RE: improving distributed indexing performance

Markus Jelsma Tue, 14 Jun 2016 06:34:27 -0700

Hello Joseph - you only need the LinkDB (invertlinks) if you want to use 
index-anchors plugin which indexes hyperlink anchors of referring pages, which 
can be very useful. You need the webgraph (and LinkRank) if you want to 
calculate scores and write them back to the CrawlDB before indexing.


Both are very resource intensive and you might not need them. We only use 
LinkRank for very specific situations and rarely use the LinkDB. We don't use 
them in our regular crawls for our sitesearch search engines.

Markus
 
-----Original message-----
> From:Joseph Naegele <[email protected]>
> Sent: Tuesday 14th June 2016 14:37
> To: [email protected]
> Subject: RE: improving distributed indexing performance
> 
> Thanks Markus, I'll start with Hadoop's sequence file compression.
> 
> To clarify, if we're fetching, parsing, then building a webgraph and 
> computing LinkRank on all segments before indexing, there's no reason I need 
> to create a LinkDB, correct?
> 
> Joe
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: Tuesday, June 14, 2016 06:52
> To: [email protected]
> Subject: RE: improving distributed indexing performance
> 
> Joseph - LinkRank and LinkDB are not related to eachother. LinkRank scores 
> the WebGraph, the LinkDB is created with invertlinks.
> 
> In any case, consider enabling Hadoop sequence file compression. It greatly 
> reduces CrawlDB size and increases throughput. The CrawlDB is very suitable 
> for compression.
> 
> Markus 
>  
> -----Original message-----
> > From:Joseph Naegele <[email protected]>
> > Sent: Monday 13th June 2016 22:16
> > To: [email protected]
> > Subject: RE: improving distributed indexing performance
> > 
> > Sebastian,
> > 
> > Thanks! That explains a lot. We're computing LinkRank and I don't specify a 
> > LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very 
> > interested in NUTCH-2184. I'm planning to finish helping with 
> > https://github.com/apache/nutch/pull/95.
> > 
> > 1. A related question: Is it possible to index without a crawldb but still 
> > use LinkRank scores? This is exactly what I need to do.
> > 
> > 2. On a similar note, I believe there is another issue related to 
> > indexing with LinkRank scores. If no scoring plugins are configured, 
> > then IndexerMapReduce sets each document's "boost" value to 1.0f. I 
> > was under the impression I shouldn't use a scoring filter when 
> > computing LinkRank (see 
> > http://www.mail-archive.com/user%40nutch.apache.org/msg14309.html), 
> > but in reality I should use an "identity scoring filter", that just 
> > reads the score from the CrawlDatum, correct? (If so, then I've 
> > answered question 1: crawldb *is* necessary for indexing with 
> > LinkRank)
> > 
> > Thanks,
> > Joe
> > 
> > -----Original Message-----
> > From: Sebastian Nagel [mailto:[email protected]]
> > Sent: Monday, June 13, 2016 13:35
> > To: [email protected]
> > Subject: Re: improving distributed indexing performance
> > 
> > Hi Joseph,
> > 
> > you're right the mapper does not do much, all potentially heavy 
> > computations in the index or scoring filters are run in the reduce step.
> > 
> > > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> > There are 5 billion records passed through the map step:
> >  Map input records=5115370813
> >  Map output records=5115370813
> >  Reduce input records=5115370813
> >  Reduce output records=2401924
> > 
> > That would mean that either your segment contains a large number of 
> > "unindexable" documents or crawldb and/or linkdb are quite large.
> > In the latter case, you could try not to use them for indexing.
> > LinkDb is optional since long, for the CrawlDb there is
> >   https://issues.apache.org/jira/browse/NUTCH-2184
> > 
> > Sebastian
> > 
> > On 06/13/2016 06:55 PM, Joseph Naegele wrote:
> > > Hi folks,
> > > 
> > > I'm in the process of indexing a large number of docs using Nutch 
> > > 1.11 and the indexer-elastic plugin. I've observed slow indexing 
> > > performance and narrowed it down to the map phase and first part of 
> > > the reduce phase taking 80% of the total runtime per segment. Here are 
> > > some statistics:
> > > 
> > > - Average segment contains around 2.4M "indexable" URLs, meaning 
> > > successfully fetched and parsed.
> > > - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 
> > > machines.
> > > - Time to index 2.4M URLs (one segment): around 3.25 hours.
> > > - Actual time spent sending docs to Elasticsearch: .75 hours
> > > - No additional indexing options specified, i.e. *not* filtering or 
> > > normalizing URLs, etc.
> > > 
> > > This means that for every segment, 2.5 hours is spent splitting 
> > > inputs, shuffling, whatever else Hadoop does, and only about 40 
> > > minutes is spent actually sending docs to ES. From another 
> > > perspective, that means we can expect an indexing rate of 1000 
> > > docs/sec, but the effective rate is only 200 docs/sec.
> > > 
> > > I fully understand Nutch's indexer code, so I know that it actually 
> > > does very little in both the Map and Reduce phase (the map phase 
> > > does almost nothing since I'm not filtering/normalizing URLs), so my 
> > > best guess is that there's just a ton of Hadoop overhead. Is it possible 
> > > to optimize this?
> > > 
> > > I've included below a link to a gist containing job output and 
> > > counters for a single segment, hoping that it will provide some hints.
> > > For example, is it normal that indexing segments of this size 
> > > requires
> > > > 5000 input splits? I imagine that's far too many Map tasks.
> > > 
> > > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> > > 
> > > Thanks for taking a look,
> > > Joe
> > > 
> > 
> > 
> > 
> 
>

RE: improving distributed indexing performance

Reply via email to