Re: Frontera: large-scale, distributed web crawling framework

Mattmann, Chris A (3980) Wed, 28 Oct 2015 08:50:22 -0700

Hi Alex,

I didn’t see any more traffic about this. Are you still looking
for feedback? Are there any plans to make Frontera and Nutch
work together?


I’m still interested of course. Thanks.

Thanks,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Alexander Sibiryakov <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, October 2, 2015 at 8:33 AM
To: "[email protected]" <[email protected]>
Subject: Frontera: large-scale, distributed web crawling framework

>Hi Nutch users!
>
>Last 8 months at Scrapinghub we’ve been working on a new web crawling
>framework called Frontera. This is a distributed implementation of crawl
>frontier part of web crawler, the component which decides what to crawl
>next, when and when to stop. So, it’s not a complete web crawler.
>However, it suggests overall crawler design. There is a clean and tested
>way how to build a such crawler in half of the day from existing
>components.
>
>Here is a list of main features:
>Online operation: scheduling of new batch, updating of DB state. No need
>to stop crawling to change the crawling strategy.
>Storage abstraction: write your own backend (sqlalchemy, HBase is
>included).
>Canonical URLs resolution abstraction: each document has many URLs, which
>to use? We provide a place where you can code your own logic.
>Scrapy ecosystem: good documentation, big community, ease of
>customization.
>Communication layer is Apache Kafka: easy to plug somewhere and debug.
>Crawling strategy abstraction: crawling goal, url ordering, scoring model
>is coded in separate module.
>Polite by design: each website is downloaded by at most one spider
>process.
>Workers are implemented in Python.
>In general, such a web crawler should be very easy for customization,
>easy to plug in existing infrastructure and it’s online operation could
>be useful for crawling frequently changing web pages: news websites for
>example. We tested it at some scale, by crawling part of Spanish
>internet, you can find details in my presentation.
>http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawl
>ing%20the%20spanish%20web.pdf
>
>This project currently on a github, it’s an open source, under own
>license.
>https://github.com/scrapinghub/frontera
>https://github.com/scrapinghub/distributed-frontera
>
>The questions are, what you guys think? Is this a useful thing? If yes,
>what kind of use cases do you see? Currently, I’m looking for a
>businesses who can benefit from it, please write me if you have any ideas
>on that.
>
>A.

Re: Frontera: large-scale, distributed web crawling framework

Reply via email to