Re: Nutch 2.x for large-scale crawls

Sebastian Nagel Fri, 17 Jun 2016 13:41:10 -0700

Hi,

> 1. Does Nutch 2.x's architecture alleviate some of this issue?


That is/was the objective of Nutch 2.x, inspired by the Bigtable [1]
and Percolator [2] papers.

> 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
> "Storage and Data Flow" diagram

Don't know. But it's pretty simple - everything is stored in one table.
Rows are pages, and from the code is clear which fields/columns are
accessed (read or write) by steps/tools, e.g. by the ParserJob:
  static {
    FIELDS.add(WebPage.Field.STATUS);
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.CONTENT_TYPE);
    FIELDS.add(WebPage.Field.SIGNATURE);
    FIELDS.add(WebPage.Field.MARKERS);
    FIELDS.add(WebPage.Field.PARSE_STATUS);
    FIELDS.add(WebPage.Field.OUTLINKS);
    FIELDS.add(WebPage.Field.METADATA);
    FIELDS.add(WebPage.Field.HEADERS);
    FIELDS.add(WebPage.Field.SITEMAPS);
    FIELDS.add(WebPage.Field.STM_PRIORITY);
  }
That's a notable simplification compared to 1.x where it is really hard
to understand the data flow.

> 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x, specifically 
> 2.3?

I guess you know about Julien's "Nutch fight! 1.7 vs 2.2.1" [3]
Afaik, there's no recent update.

Sebastian

[1] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; Wallach,
Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; Gruber,
Robert E., 2006: Bigtable: A distributed storage system for structured
data. In: Proceedings of the 7th Conference on USENIX Symposium on
Operating Systems Design and Implementation (OSDI ’06), vol. 7, pp.
205–218, http://www.usenix.org/events/osdi06/tech/chang/chang.pdf

[2] Peng, Daniel Dabek, Frank, 2010: Large-scale incremental processing
using distributed transactions and notifications. In: 9th USENIX
Symposium on Operating Systems Design and Implementation, pp. 4–6,
http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf

[3] http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html


On 06/17/2016 03:00 PM, Joseph Naegele wrote:
> Hi folks,
> 
> I am curious as to whether Nutch 2.x might solve some of the problems we are
> experiencing with Nuch 1.11 at a very large scale (multiple billions of
> URLs). For now, the primary issue is the size of the crawldb and the time it
> takes to update, as well as the time it takes to index individual segments.
> I'm aware of the development on NUTCH-2184, enabling indexing without the
> crawldb, and if we stick with Nutch 1.x I'll rely heavily on that feature.
> We also compute LinkRank, which is very time-consuming, but I imagine that
> won't change much.
> 
> 1. Does Nutch 2.x's architecture alleviate some of this issue? I know, for
> example, the updatedb step is intended to be much more efficient using Gora
> rather than reading/writing the entire crawldb using Hadoop data structures.
> 
> 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
> "Storage and Data Flow" diagram
> (http://image.slidesharecdn.com/aceu2014-snagel-web-crawling-nutch-141125144
> 922-conversion-gate01/95/web-crawling-with-apache-nutch-16-638.jpg?cb=141692
> 7690)? I found that diagram very helpful in understanding Nutch 1.x
> segments.
> 
> 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x, specifically
> 2.3?
> 
> I'd love to give 2.x a spin and evaluate it myself, but it would be very
> costly to compare the two at the scale I'm referring to.
> 
> Thanks,
> Joe
>

Re: Nutch 2.x for large-scale crawls

Reply via email to