This project is build to collect Blog URL to be crawled later, I use the highest blog traffic as starting point.
Write a good python code is fun
Scrapy & XPATH
Confusing at first, but it goes well, trial and error using xpath code just to capture some in string in HTML element.
To deal with this, I use redis. Redis has simple command that we can use to make List of Unique URL. Before know Redis, I use MySQL INSERT IGNORE, with UNIQUE column. But the I/O is super high. Redis is the best
Scrapy is absolutely a web crawler framework, and fast.