RE: improving distributed indexing performance

Markus Jelsma Tue, 14 Jun 2016 03:52:30 -0700

Joseph - LinkRank and LinkDB are not related to eachother. LinkRank scores the 
WebGraph, the LinkDB is created with invertlinks.


In any case, consider enabling Hadoop sequence file compression. It greatly 
reduces CrawlDB size and increases throughput. The CrawlDB is very suitable for 
compression.

Markus 
 
-----Original message-----
> From:Joseph Naegele <[email protected]>
> Sent: Monday 13th June 2016 22:16
> To: [email protected]
> Subject: RE: improving distributed indexing performance
> 
> Sebastian,
> 
> Thanks! That explains a lot. We're computing LinkRank and I don't specify a 
> LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very 
> interested in NUTCH-2184. I'm planning to finish helping with 
> https://github.com/apache/nutch/pull/95.
> 
> 1. A related question: Is it possible to index without a crawldb but still 
> use LinkRank scores? This is exactly what I need to do.
> 
> 2. On a similar note, I believe there is another issue related to indexing 
> with LinkRank scores. If no scoring plugins are configured, then 
> IndexerMapReduce sets each document's "boost" value to 1.0f. I was under the 
> impression I shouldn't use a scoring filter when computing LinkRank (see 
> http://www.mail-archive.com/user%40nutch.apache.org/msg14309.html), but in 
> reality I should use an "identity scoring filter", that just reads the score 
> from the CrawlDatum, correct? (If so, then I've answered question 1: crawldb 
> *is* necessary for indexing with LinkRank)
> 
> Thanks,
> Joe
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]] 
> Sent: Monday, June 13, 2016 13:35
> To: [email protected]
> Subject: Re: improving distributed indexing performance
> 
> Hi Joseph,
> 
> you're right the mapper does not do much, all potentially heavy computations 
> in the index or scoring filters are run in the reduce step.
> 
> > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> There are 5 billion records passed through the map step:
>  Map input records=5115370813
>  Map output records=5115370813
>  Reduce input records=5115370813
>  Reduce output records=2401924
> 
> That would mean that either your segment contains a large number of 
> "unindexable" documents or crawldb and/or linkdb are quite large.
> In the latter case, you could try not to use them for indexing.
> LinkDb is optional since long, for the CrawlDb there is
>   https://issues.apache.org/jira/browse/NUTCH-2184
> 
> Sebastian
> 
> On 06/13/2016 06:55 PM, Joseph Naegele wrote:
> > Hi folks,
> > 
> > I'm in the process of indexing a large number of docs using Nutch 1.11 
> > and the indexer-elastic plugin. I've observed slow indexing 
> > performance and narrowed it down to the map phase and first part of 
> > the reduce phase taking 80% of the total runtime per segment. Here are some 
> > statistics:
> > 
> > - Average segment contains around 2.4M "indexable" URLs, meaning 
> > successfully fetched and parsed.
> > - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 
> > machines.
> > - Time to index 2.4M URLs (one segment): around 3.25 hours.
> > - Actual time spent sending docs to Elasticsearch: .75 hours
> > - No additional indexing options specified, i.e. *not* filtering or 
> > normalizing URLs, etc.
> > 
> > This means that for every segment, 2.5 hours is spent splitting 
> > inputs, shuffling, whatever else Hadoop does, and only about 40 
> > minutes is spent actually sending docs to ES. From another 
> > perspective, that means we can expect an indexing rate of 1000 
> > docs/sec, but the effective rate is only 200 docs/sec.
> > 
> > I fully understand Nutch's indexer code, so I know that it actually 
> > does very little in both the Map and Reduce phase (the map phase does 
> > almost nothing since I'm not filtering/normalizing URLs), so my best 
> > guess is that there's just a ton of Hadoop overhead. Is it possible to 
> > optimize this?
> > 
> > I've included below a link to a gist containing job output and 
> > counters for a single segment, hoping that it will provide some hints. 
> > For example, is it normal that indexing segments of this size requires 
> > > 5000 input splits? I imagine that's far too many Map tasks.
> > 
> > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> > 
> > Thanks for taking a look,
> > Joe
> > 
> 
> 
>

RE: improving distributed indexing performance

Reply via email to