Fajri Abdillah

News Web Crawler

News Web Crawler is my essay to got my Bachelor Degree, so I work as best as I can for this project.

This Project has 3 Parts.
1) Capture RSS Data from given providers
2) Crawl News Data from RSS
3) Show Data using Django

As a guest, they can search RSS data, or search crawled data. As we know, some news sites sometimes has too much distraction, in this project, that Distraction is removed, so only Head and Body of the news going into Database, no ads, no ads via click, just plain news.

Data increase every 30 Minute, so every 30 minutes, every query is cached in Redis. And at that moment, to collect RSS, I use Google Reader, sadly that awesome project was shutdown. I can't find the best replacement for Google Reader.

Django is awesome web framework, it has automatic backend administration. It is useful to check the Database, Create data from simple interface, Update, and Delete. If designed correctly, the foreign-key will show here.

Challenge

  • Scrapy & XPATH

    I need to create individually Xpath data. Because every news provider has different structure

  • Django

    Well, Django is really a webframework to meet fast deadline

  • Python

    Write a good python code is fun, because it indented very well

Year

2012

Stack

  • Python 2.6
  • Django 1.x
  • Apache 2
  • Scrapy 0.16
  • Redis
  • Ubuntu Server 12.04
  • Highcharts

Screenshoot

Use Case Diagram
 nwc_1.png

Architecture
 nwc_2.png

Class Diagram
 nwc_3.png

Statistic
 nwc_4.png

Search RSS
 nwc_5.png

Search Content
 nwc_6.png

Django Administration
 nwc_7.png

Database Implementation
 nwc_8.png

Lesson learned

Django is fun, good choice for Python web framework and fast to develop.