Hi,
> 1. Does Nutch 2.x's architecture alleviate some of this issue?
That is/was the objective of Nutch 2.x, inspired by the Bigtable [1]
and Percolator [2] papers.
> 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
> "Storage and Data Flow" diagram
Don't know. But it's pretty simple - everything is stored in one table.
Rows are pages, and from the code is clear which fields/columns are
accessed (read or write) by steps/tools, e.g. by the ParserJob:
static {
FIELDS.add(WebPage.Field.STATUS);
FIELDS.add(WebPage.Field.CONTENT);
FIELDS.add(WebPage.Field.CONTENT_TYPE);
FIELDS.add(WebPage.Field.SIGNATURE);
FIELDS.add(WebPage.Field.MARKERS);
FIELDS.add(WebPage.Field.PARSE_STATUS);
FIELDS.add(WebPage.Field.OUTLINKS);
FIELDS.add(WebPage.Field.METADATA);
FIELDS.add(WebPage.Field.HEADERS);
FIELDS.add(WebPage.Field.SITEMAPS);
FIELDS.add(WebPage.Field.STM_PRIORITY);
}
That's a notable simplification compared to 1.x where it is really hard
to understand the data flow.
> 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x, specifically
> 2.3?
I guess you know about Julien's "Nutch fight! 1.7 vs 2.2.1" [3]
Afaik, there's no recent update.
Sebastian
[1] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; Wallach,
Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; Gruber,
Robert E., 2006: Bigtable: A distributed storage system for structured
data. In: Proceedings of the 7th Conference on USENIX Symposium on
Operating Systems Design and Implementation (OSDI ’06), vol. 7, pp.
205–218, http://www.usenix.org/events/osdi06/tech/chang/chang.pdf
[2] Peng, Daniel Dabek, Frank, 2010: Large-scale incremental processing
using distributed transactions and notifications. In: 9th USENIX
Symposium on Operating Systems Design and Implementation, pp. 4–6,
http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf
[3] http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
On 06/17/2016 03:00 PM, Joseph Naegele wrote:
> Hi folks,
>
> I am curious as to whether Nutch 2.x might solve some of the problems we are
> experiencing with Nuch 1.11 at a very large scale (multiple billions of
> URLs). For now, the primary issue is the size of the crawldb and the time it
> takes to update, as well as the time it takes to index individual segments.
> I'm aware of the development on NUTCH-2184, enabling indexing without the
> crawldb, and if we stick with Nutch 1.x I'll rely heavily on that feature.
> We also compute LinkRank, which is very time-consuming, but I imagine that
> won't change much.
>
> 1. Does Nutch 2.x's architecture alleviate some of this issue? I know, for
> example, the updatedb step is intended to be much more efficient using Gora
> rather than reading/writing the entire crawldb using Hadoop data structures.
>
> 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
> "Storage and Data Flow" diagram
> (http://image.slidesharecdn.com/aceu2014-snagel-web-crawling-nutch-141125144
> 922-conversion-gate01/95/web-crawling-with-apache-nutch-16-638.jpg?cb=141692
> 7690)? I found that diagram very helpful in understanding Nutch 1.x
> segments.
>
> 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x, specifically
> 2.3?
>
> I'd love to give 2.x a spin and evaluate it myself, but it would be very
> costly to compare the two at the scale I'm referring to.
>
> Thanks,
> Joe
>