Hi Nutch users!

Last 8 months at Scrapinghub we’ve been working on a new web crawling framework 
called Frontera. This is a distributed implementation of crawl frontier part of 
web crawler, the component which decides what to crawl next, when and when to 
stop. So, it’s not a complete web crawler. However, it suggests overall crawler 
design. There is a clean and tested way how to build a such crawler in half of 
the day from existing components.

Here is a list of main features:
Online operation: scheduling of new batch, updating of DB state. No need to 
stop crawling to change the crawling strategy.
Storage abstraction: write your own backend (sqlalchemy, HBase is included).
Canonical URLs resolution abstraction: each document has many URLs, which to 
use? We provide a place where you can code your own logic.
Scrapy ecosystem: good documentation, big community, ease of customization.
Communication layer is Apache Kafka: easy to plug somewhere and debug.
Crawling strategy abstraction: crawling goal, url ordering, scoring model is 
coded in separate module.
Polite by design: each website is downloaded by at most one spider process.
Workers are implemented in Python.
In general, such a web crawler should be very easy for customization, easy to 
plug in existing infrastructure and it’s online operation could be useful for 
crawling frequently changing web pages: news websites for example. We tested it 
at some scale, by crawling part of Spanish internet, you can find details in my 
presentation.
http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf

This project currently on a github, it’s an open source, under own license.
https://github.com/scrapinghub/frontera
https://github.com/scrapinghub/distributed-frontera

The questions are, what you guys think? Is this a useful thing? If yes, what 
kind of use cases do you see? Currently, I’m looking for a businesses who can 
benefit from it, please write me if you have any ideas on that.

A.

Reply via email to