RE: improving distributed indexing performance

Joseph Naegele Mon, 13 Jun 2016 13:16:14 -0700

Sebastian,

Thanks! That explains a lot. We're computing LinkRank and I don't specify a 
LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very 
interested in NUTCH-2184. I'm planning to finish helping with 
https://github.com/apache/nutch/pull/95.

1. A related question: Is it possible to index without a crawldb but still use 
LinkRank scores? This is exactly what I need to do.

2. On a similar note, I believe there is another issue related to indexing with 
LinkRank scores. If no scoring plugins are configured, then IndexerMapReduce 
sets each document's "boost" value to 1.0f. I was under the impression I 
shouldn't use a scoring filter when computing LinkRank (see 
http://www.mail-archive.com/user%40nutch.apache.org/msg14309.html), but in 
reality I should use an "identity scoring filter", that just reads the score 
from the CrawlDatum, correct? (If so, then I've answered question 1: crawldb 
*is* necessary for indexing with LinkRank)

Thanks,
Joe

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Monday, June 13, 2016 13:35
To: [email protected]
Subject: Re: improving distributed indexing performance

Hi Joseph,

you're right the mapper does not do much, all potentially heavy computations in 
the index or scoring filters are run in the reduce step.

> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
There are 5 billion records passed through the map step:
 Map input records=5115370813
 Map output records=5115370813
 Reduce input records=5115370813
 Reduce output records=2401924

That would mean that either your segment contains a large number of 
"unindexable" documents or crawldb and/or linkdb are quite large.
In the latter case, you could try not to use them for indexing.
LinkDb is optional since long, for the CrawlDb there is
  https://issues.apache.org/jira/browse/NUTCH-2184

Sebastian

On 06/13/2016 06:55 PM, Joseph Naegele wrote:
> Hi folks,
> 
> I'm in the process of indexing a large number of docs using Nutch 1.11 
> and the indexer-elastic plugin. I've observed slow indexing 
> performance and narrowed it down to the map phase and first part of 
> the reduce phase taking 80% of the total runtime per segment. Here are some 
> statistics:
> 
> - Average segment contains around 2.4M "indexable" URLs, meaning 
> successfully fetched and parsed.
> - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 
> machines.
> - Time to index 2.4M URLs (one segment): around 3.25 hours.
> - Actual time spent sending docs to Elasticsearch: .75 hours
> - No additional indexing options specified, i.e. *not* filtering or 
> normalizing URLs, etc.
> 
> This means that for every segment, 2.5 hours is spent splitting 
> inputs, shuffling, whatever else Hadoop does, and only about 40 
> minutes is spent actually sending docs to ES. From another 
> perspective, that means we can expect an indexing rate of 1000 
> docs/sec, but the effective rate is only 200 docs/sec.
> 
> I fully understand Nutch's indexer code, so I know that it actually 
> does very little in both the Map and Reduce phase (the map phase does 
> almost nothing since I'm not filtering/normalizing URLs), so my best 
> guess is that there's just a ton of Hadoop overhead. Is it possible to 
> optimize this?
> 
> I've included below a link to a gist containing job output and 
> counters for a single segment, hoping that it will provide some hints. 
> For example, is it normal that indexing segments of this size requires 
> > 5000 input splits? I imagine that's far too many Map tasks.
> 
> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> 
> Thanks for taking a look,
> Joe
>

RE: improving distributed indexing performance

Reply via email to