Sebastian, Thanks! That explains a lot. We're computing LinkRank and I don't specify a LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very interested in NUTCH-2184. I'm planning to finish helping with https://github.com/apache/nutch/pull/95.
1. A related question: Is it possible to index without a crawldb but still use LinkRank scores? This is exactly what I need to do. 2. On a similar note, I believe there is another issue related to indexing with LinkRank scores. If no scoring plugins are configured, then IndexerMapReduce sets each document's "boost" value to 1.0f. I was under the impression I shouldn't use a scoring filter when computing LinkRank (see http://www.mail-archive.com/user%40nutch.apache.org/msg14309.html), but in reality I should use an "identity scoring filter", that just reads the score from the CrawlDatum, correct? (If so, then I've answered question 1: crawldb *is* necessary for indexing with LinkRank) Thanks, Joe -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Monday, June 13, 2016 13:35 To: [email protected] Subject: Re: improving distributed indexing performance Hi Joseph, you're right the mapper does not do much, all potentially heavy computations in the index or scoring filters are run in the reduce step. > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 There are 5 billion records passed through the map step: Map input records=5115370813 Map output records=5115370813 Reduce input records=5115370813 Reduce output records=2401924 That would mean that either your segment contains a large number of "unindexable" documents or crawldb and/or linkdb are quite large. In the latter case, you could try not to use them for indexing. LinkDb is optional since long, for the CrawlDb there is https://issues.apache.org/jira/browse/NUTCH-2184 Sebastian On 06/13/2016 06:55 PM, Joseph Naegele wrote: > Hi folks, > > I'm in the process of indexing a large number of docs using Nutch 1.11 > and the indexer-elastic plugin. I've observed slow indexing > performance and narrowed it down to the map phase and first part of > the reduce phase taking 80% of the total runtime per segment. Here are some > statistics: > > - Average segment contains around 2.4M "indexable" URLs, meaning > successfully fetched and parsed. > - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 > machines. > - Time to index 2.4M URLs (one segment): around 3.25 hours. > - Actual time spent sending docs to Elasticsearch: .75 hours > - No additional indexing options specified, i.e. *not* filtering or > normalizing URLs, etc. > > This means that for every segment, 2.5 hours is spent splitting > inputs, shuffling, whatever else Hadoop does, and only about 40 > minutes is spent actually sending docs to ES. From another > perspective, that means we can expect an indexing rate of 1000 > docs/sec, but the effective rate is only 200 docs/sec. > > I fully understand Nutch's indexer code, so I know that it actually > does very little in both the Map and Reduce phase (the map phase does > almost nothing since I'm not filtering/normalizing URLs), so my best > guess is that there's just a ton of Hadoop overhead. Is it possible to > optimize this? > > I've included below a link to a gist containing job output and > counters for a single segment, hoping that it will provide some hints. > For example, is it normal that indexing segments of this size requires > > 5000 input splits? I imagine that's far too many Map tasks. > > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 > > Thanks for taking a look, > Joe >

