Sorry, just re-read and saw that it's open source and under what license? I apologize if you're not trying to sell this.
On Fri, Oct 2, 2015 at 11:45 AM, Jessica Glover <[email protected]> wrote: > Hmm... you're asking for a free consultation on an open source software > user mailing list? First, this doesn't exactly seem like the appropriate > place for that. Second, offer some incentive if you want someone to help > you with your business. > > On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov <[email protected] > > wrote: > >> Hi Nutch users! >> >> Last 8 months at Scrapinghub we’ve been working on a new web crawling >> framework called Frontera. This is a distributed implementation of crawl >> frontier part of web crawler, the component which decides what to crawl >> next, when and when to stop. So, it’s not a complete web crawler. However, >> it suggests overall crawler design. There is a clean and tested way how to >> build a such crawler in half of the day from existing components. >> >> Here is a list of main features: >> Online operation: scheduling of new batch, updating of DB state. No need >> to stop crawling to change the crawling strategy. >> Storage abstraction: write your own backend (sqlalchemy, HBase is >> included). >> Canonical URLs resolution abstraction: each document has many URLs, which >> to use? We provide a place where you can code your own logic. >> Scrapy ecosystem: good documentation, big community, ease of >> customization. >> Communication layer is Apache Kafka: easy to plug somewhere and debug. >> Crawling strategy abstraction: crawling goal, url ordering, scoring model >> is coded in separate module. >> Polite by design: each website is downloaded by at most one spider >> process. >> Workers are implemented in Python. >> In general, such a web crawler should be very easy for customization, >> easy to plug in existing infrastructure and it’s online operation could be >> useful for crawling frequently changing web pages: news websites for >> example. We tested it at some scale, by crawling part of Spanish internet, >> you can find details in my presentation. >> >> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf >> >> This project currently on a github, it’s an open source, under own >> license. >> https://github.com/scrapinghub/frontera >> https://github.com/scrapinghub/distributed-frontera >> >> The questions are, what you guys think? Is this a useful thing? If yes, >> what kind of use cases do you see? Currently, I’m looking for a businesses >> who can benefit from it, please write me if you have any ideas on that. >> >> A. > > >

