Fajri Abdillah

Blog Crawler

This project is build to collect Blog URL to be crawled later, I use the highest blog traffic as starting point.


  • Python

    Write a good python code is fun

  • Scrapy & XPATH

    Confusing at first, but it goes well, trial and error using xpath code just to capture some in string in HTML element.

  • Duplicate URL

    To deal with this, I use redis. Redis has simple command that we can use to make List of Unique URL. Before know Redis, I use MySQL INSERT IGNORE, with UNIQUE column. But the I/O is super high. Redis is the best




  • Python 2.6
  • Ubuntu Server 12.04
  • Apache 2
  • Scrapy 0.16
  • Redis


Lesson learned

Scrapy is absolutely a web crawler framework, and fast.