Re: Frontera: large-scale, distributed web crawling framework

Alexander Sibiryakov Wed, 13 Jan 2016 10:12:54 -0800

Hi Chris,

Sorry for a long delay, it wasn’t easy to answer your questions, so I was 
thinking. Please forgive me, if I mention some facts about Nutch, which aren’t 
true, this is mostly because of my time limitations.


Here are the possible goals of integration of Frontera and Nutch:
- to get the best of two: Nutch is good at scale, faster on fetching/parsing, 
but Frontera/Scrapy is online, much easier on customization, having good docs 
and written in Python,
- to ease the migration from Frontera to Nutch and opposite,
- identify and fix design problems.

Now, few words how Nutch and Frontera could work together. 
1. Nutch Fetcher can be easily used with Frontera, if it will be implemented as 
a service, communicating by means of Kafka or ZeroMQ and talking Frontera 
protocol (which is documented). Fetching involves parsing and many string 
operations, that could be more efficient in JVM. FetchItem would require 
adapter for Frontera Request, the same for ParseData.

It could help Frontera users save some time on fetching, but if use case 
requires scraping (for broad crawling it isn’t), they would need to add 
scraping step later.

2. Scrapy can be used as fetcher for Nutch too. We just need to figure out a 
way how to run Scrapy spider in Hadoop environment. Input/Output adapters, 
process wrapper are needed. Some interface modifications are also required to 
use extracted items from content in Nutch-Solr(or other Lucene based) pipeline. 
Scrapy is much more efficient in network operations conceptually: asynchronous 
select()/epoll based http client and connection pool. This can be improved in 
Nutch.

This would allow writing/debugging of custom scraping code amazingly easy. Plus 
Nutch is used as a crawl frontier for Scrapy and Tika-based parsing and 
indexing primitives can be used for building search.

3. Frontera’s DB and strategy workers can be used in Hadoop/Nutch pipeline to 
generate Nutch segments and read fetcher output with slight modifications. It’s 
possible to generate quite big segment by continuously running 
get_next_requests() routine (meant to be used for small batches). They use low 
level storage, currently HBase and RDBMS are supported. Number of workers can 
be scaled, they’re designed for this. Same problems are here, need of adapters 
and process wrappers. RDBMS could suffer from concurrent access, but that’s 
solvable.

This would allow to use Frontera as a crawl frontier with Nutch. It could be 
helpful if someone wants to implement crawling strategy in Python.

Nutch and Frontera use cases aren’t completely overlap. Majority of people who 
look into Frontera want to crawl some small amount of websites, scrape some 
data from them and revisit. Sometimes they need to scale fetching (meaning no 
polite crawling here) or parsing/scraping part, and sometimes they need some 
custom prioritization or external queue management. Quite few is using it for 
broad crawling with Kafka and HBase. 

I would appreciate if you could write your vision of major Nutch use cases, so 
we could compare.

It’s up to us which direction to choose, but I think 1. and 2. options are most 
important.

Currently, Frontera is moving towards the ease of use: ZeroMQ transport, 
transport layer abstraction, standalone Frontera/Scrapy based crawler in 
Docker, web UI.


A.

> 28 окт. 2015 г., в 16:46, Mattmann, Chris A (3980) 
> <[email protected]> написал(а):
> 
> Hi Alex,
> 
> I didn’t see any more traffic about this. Are you still looking
> for feedback? Are there any plans to make Frontera and Nutch
> work together?
> 
> I’m still interested of course. Thanks.
> 
> Thanks,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Alexander Sibiryakov <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Friday, October 2, 2015 at 8:33 AM
> To: "[email protected]" <[email protected]>
> Subject: Frontera: large-scale, distributed web crawling framework
> 
>> Hi Nutch users!
>> 
>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>> framework called Frontera. This is a distributed implementation of crawl
>> frontier part of web crawler, the component which decides what to crawl
>> next, when and when to stop. So, it’s not a complete web crawler.
>> However, it suggests overall crawler design. There is a clean and tested
>> way how to build a such crawler in half of the day from existing
>> components.
>> 
>> Here is a list of main features:
>> Online operation: scheduling of new batch, updating of DB state. No need
>> to stop crawling to change the crawling strategy.
>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>> included).
>> Canonical URLs resolution abstraction: each document has many URLs, which
>> to use? We provide a place where you can code your own logic.
>> Scrapy ecosystem: good documentation, big community, ease of
>> customization.
>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>> Crawling strategy abstraction: crawling goal, url ordering, scoring model
>> is coded in separate module.
>> Polite by design: each website is downloaded by at most one spider
>> process.
>> Workers are implemented in Python.
>> In general, such a web crawler should be very easy for customization,
>> easy to plug in existing infrastructure and it’s online operation could
>> be useful for crawling frequently changing web pages: news websites for
>> example. We tested it at some scale, by crawling part of Spanish
>> internet, you can find details in my presentation.
>> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawl
>> ing%20the%20spanish%20web.pdf
>> 
>> This project currently on a github, it’s an open source, under own
>> license.
>> https://github.com/scrapinghub/frontera
>> https://github.com/scrapinghub/distributed-frontera
>> 
>> The questions are, what you guys think? Is this a useful thing? If yes,
>> what kind of use cases do you see? Currently, I’m looking for a
>> businesses who can benefit from it, please write me if you have any ideas
>> on that.
>> 
>> A.
>

Re: Frontera: large-scale, distributed web crawling framework

Reply via email to