RE: Nutch 2.x for large-scale crawls

Joseph Naegele Mon, 20 Jun 2016 05:32:54 -0700

Hi Julien,

We currently generate 10 segments at a time, and it helps quite a bit. The 
costly generate/updatedb jobs haven't been much of a problem, but once we began 
indexing our large crawldb really slowed us down. We are indexing one segment 
at a time, rather than indexing a few hundred segments all in one very 
long-running job. I only just recently learned that we can index an arbitrary 
number of segments at a time (we were using an older version of Lewis' 
NUTCH-2184 patch for indexing without the crawldb, which only allows for 
indexing 1 or many segments, which was a quick fix). Enabling Hadoop output 
compression at Markus' suggestion has significantly improved indexing time (now 
3-4x faster). We are still contemplating indexing without the crawldb and 
retroactively adding URL scores to ES vs. investigating Nutch 2.x.


Thanks,
Joe

 
-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Monday, June 20, 2016 06:02
To: [email protected]
Subject: Re: Nutch 2.x for large-scale crawls

Hi Joseph,

I meant to update the benchmarks for a while but haven't found the time to do 
so. I will probably add StormCrawler to the mix next time.

One thing that helped with the performance when I was running very large crawls 
with Nutch 1.x was to generate multiple segments in one go, fetch and parse 
them sequentially then update the whole lot with the crawldb.
This saves you those costly generate and update steps. The number of segments 
to generate is entirely up to you but even a modest value like 3 or 5 would 
have quite an impact on the performance of the crawler. Do you already do this 
with 1.x?

Julien

On 17 June 2016 at 21:40, Sebastian Nagel <[email protected]>
wrote:

> Hi,
>
> > 1. Does Nutch 2.x's architecture alleviate some of this issue?
>
> That is/was the objective of Nutch 2.x, inspired by the Bigtable [1] 
> and Percolator [2] papers.
>
> > 2. Does there exist for 2.x any diagrams similar to Sebastian's 
> > Nutch 1.x "Storage and Data Flow" diagram
>
> Don't know. But it's pretty simple - everything is stored in one table.
> Rows are pages, and from the code is clear which fields/columns are 
> accessed (read or write) by steps/tools, e.g. by the ParserJob:
>   static {
>     FIELDS.add(WebPage.Field.STATUS);
>     FIELDS.add(WebPage.Field.CONTENT);
>     FIELDS.add(WebPage.Field.CONTENT_TYPE);
>     FIELDS.add(WebPage.Field.SIGNATURE);
>     FIELDS.add(WebPage.Field.MARKERS);
>     FIELDS.add(WebPage.Field.PARSE_STATUS);
>     FIELDS.add(WebPage.Field.OUTLINKS);
>     FIELDS.add(WebPage.Field.METADATA);
>     FIELDS.add(WebPage.Field.HEADERS);
>     FIELDS.add(WebPage.Field.SITEMAPS);
>     FIELDS.add(WebPage.Field.STM_PRIORITY);
>   }
> That's a notable simplification compared to 1.x where it is really 
> hard to understand the data flow.
>
> > 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x,
> specifically 2.3?
>
> I guess you know about Julien's "Nutch fight! 1.7 vs 2.2.1" [3] Afaik, 
> there's no recent update.
>
> Sebastian
>
> [1] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; 
> Wallach, Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; 
> Gruber, Robert E., 2006: Bigtable: A distributed storage system for 
> structured data. In: Proceedings of the 7th Conference on USENIX 
> Symposium on Operating Systems Design and Implementation (OSDI ’06), vol. 7, 
> pp.
> 205–218, http://www.usenix.org/events/osdi06/tech/chang/chang.pdf
>
> [2] Peng, Daniel Dabek, Frank, 2010: Large-scale incremental 
> processing using distributed transactions and notifications. In: 9th 
> USENIX Symposium on Operating Systems Design and Implementation, pp. 
> 4–6, http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf
>
> [3] 
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
>
> On 06/17/2016 03:00 PM, Joseph Naegele wrote:
> > Hi folks,
> >
> > I am curious as to whether Nutch 2.x might solve some of the 
> > problems we
> are
> > experiencing with Nuch 1.11 at a very large scale (multiple billions 
> > of URLs). For now, the primary issue is the size of the crawldb and 
> > the
> time it
> > takes to update, as well as the time it takes to index individual
> segments.
> > I'm aware of the development on NUTCH-2184, enabling indexing 
> > without the crawldb, and if we stick with Nutch 1.x I'll rely 
> > heavily on that
> feature.
> > We also compute LinkRank, which is very time-consuming, but I 
> > imagine
> that
> > won't change much.
> >
> > 1. Does Nutch 2.x's architecture alleviate some of this issue? I 
> > know,
> for
> > example, the updatedb step is intended to be much more efficient 
> > using
> Gora
> > rather than reading/writing the entire crawldb using Hadoop data
> structures.
> >
> > 2. Does there exist for 2.x any diagrams similar to Sebastian's 
> > Nutch 1.x "Storage and Data Flow" diagram (
> http://image.slidesharecdn.com/aceu2014-snagel-web-crawling-nutch-1411
> 25144
> >
> 922-conversion-gate01/95/web-crawling-with-apache-nutch-16-638.jpg?cb=
> 141692
> > 7690)? I found that diagram very helpful in understanding Nutch 1.x 
> > segments.
> >
> > 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x,
> specifically
> > 2.3?
> >
> > I'd love to give 2.x a spin and evaluate it myself, but it would be 
> > very costly to compare the two at the scale I'm referring to.
> >
> > Thanks,
> > Joe
> >
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

RE: Nutch 2.x for large-scale crawls

Reply via email to