Hi Nutch users! Last 8 months at Scrapinghub we’ve been working on a new web crawling framework called Frontera. This is a distributed implementation of crawl frontier part of web crawler, the component which decides what to crawl next, when and when to stop. So, it’s not a complete web crawler. However, it suggests overall crawler design. There is a clean and tested way how to build a such crawler in half of the day from existing components.
Here is a list of main features: Online operation: scheduling of new batch, updating of DB state. No need to stop crawling to change the crawling strategy. Storage abstraction: write your own backend (sqlalchemy, HBase is included). Canonical URLs resolution abstraction: each document has many URLs, which to use? We provide a place where you can code your own logic. Scrapy ecosystem: good documentation, big community, ease of customization. Communication layer is Apache Kafka: easy to plug somewhere and debug. Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. Polite by design: each website is downloaded by at most one spider process. Workers are implemented in Python. In general, such a web crawler should be very easy for customization, easy to plug in existing infrastructure and it’s online operation could be useful for crawling frequently changing web pages: news websites for example. We tested it at some scale, by crawling part of Spanish internet, you can find details in my presentation. http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf This project currently on a github, it’s an open source, under own license. https://github.com/scrapinghub/frontera https://github.com/scrapinghub/distributed-frontera The questions are, what you guys think? Is this a useful thing? If yes, what kind of use cases do you see? Currently, I’m looking for a businesses who can benefit from it, please write me if you have any ideas on that. A.

