Hmm... you're asking for a free consultation on an open source software
user mailing list? First, this doesn't exactly seem like the appropriate
place for that. Second, offer some incentive if you want someone to help
you with your business.

On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov <[email protected]>
wrote:

> Hi Nutch users!
>
> Last 8 months at Scrapinghub we’ve been working on a new web crawling
> framework called Frontera. This is a distributed implementation of crawl
> frontier part of web crawler, the component which decides what to crawl
> next, when and when to stop. So, it’s not a complete web crawler. However,
> it suggests overall crawler design. There is a clean and tested way how to
> build a such crawler in half of the day from existing components.
>
> Here is a list of main features:
> Online operation: scheduling of new batch, updating of DB state. No need
> to stop crawling to change the crawling strategy.
> Storage abstraction: write your own backend (sqlalchemy, HBase is
> included).
> Canonical URLs resolution abstraction: each document has many URLs, which
> to use? We provide a place where you can code your own logic.
> Scrapy ecosystem: good documentation, big community, ease of customization.
> Communication layer is Apache Kafka: easy to plug somewhere and debug.
> Crawling strategy abstraction: crawling goal, url ordering, scoring model
> is coded in separate module.
> Polite by design: each website is downloaded by at most one spider process.
> Workers are implemented in Python.
> In general, such a web crawler should be very easy for customization, easy
> to plug in existing infrastructure and it’s online operation could be
> useful for crawling frequently changing web pages: news websites for
> example. We tested it at some scale, by crawling part of Spanish internet,
> you can find details in my presentation.
>
> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf
>
> This project currently on a github, it’s an open source, under own license.
> https://github.com/scrapinghub/frontera
> https://github.com/scrapinghub/distributed-frontera
>
> The questions are, what you guys think? Is this a useful thing? If yes,
> what kind of use cases do you see? Currently, I’m looking for a businesses
> who can benefit from it, please write me if you have any ideas on that.
>
> A.

Reply via email to