Re: Frontera: large-scale, distributed web crawling framework

Jessica Glover Fri, 02 Oct 2015 08:48:58 -0700

Sorry, just re-read and saw that it's open source and under what license? I
apologize if you're not trying to sell this.


On Fri, Oct 2, 2015 at 11:45 AM, Jessica Glover <[email protected]>
wrote:

> Hmm... you're asking for a free consultation on an open source software
> user mailing list? First, this doesn't exactly seem like the appropriate
> place for that. Second, offer some incentive if you want someone to help
> you with your business.
>
> On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov <[email protected]
> > wrote:
>
>> Hi Nutch users!
>>
>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>> framework called Frontera. This is a distributed implementation of crawl
>> frontier part of web crawler, the component which decides what to crawl
>> next, when and when to stop. So, it’s not a complete web crawler. However,
>> it suggests overall crawler design. There is a clean and tested way how to
>> build a such crawler in half of the day from existing components.
>>
>> Here is a list of main features:
>> Online operation: scheduling of new batch, updating of DB state. No need
>> to stop crawling to change the crawling strategy.
>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>> included).
>> Canonical URLs resolution abstraction: each document has many URLs, which
>> to use? We provide a place where you can code your own logic.
>> Scrapy ecosystem: good documentation, big community, ease of
>> customization.
>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>> Crawling strategy abstraction: crawling goal, url ordering, scoring model
>> is coded in separate module.
>> Polite by design: each website is downloaded by at most one spider
>> process.
>> Workers are implemented in Python.
>> In general, such a web crawler should be very easy for customization,
>> easy to plug in existing infrastructure and it’s online operation could be
>> useful for crawling frequently changing web pages: news websites for
>> example. We tested it at some scale, by crawling part of Spanish internet,
>> you can find details in my presentation.
>>
>> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf
>>
>> This project currently on a github, it’s an open source, under own
>> license.
>> https://github.com/scrapinghub/frontera
>> https://github.com/scrapinghub/distributed-frontera
>>
>> The questions are, what you guys think? Is this a useful thing? If yes,
>> what kind of use cases do you see? Currently, I’m looking for a businesses
>> who can benefit from it, please write me if you have any ideas on that.
>>
>> A.
>
>
>

Re: Frontera: large-scale, distributed web crawling framework

Reply via email to