Hi folks,

I'm in the process of indexing a large number of docs using Nutch 1.11 and
the indexer-elastic plugin. I've observed slow indexing performance and
narrowed it down to the map phase and first part of the reduce phase taking
80% of the total runtime per segment. Here are some statistics:

- Average segment contains around 2.4M "indexable" URLs, meaning
successfully fetched and parsed.
- Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2
machines.
- Time to index 2.4M URLs (one segment): around 3.25 hours.
- Actual time spent sending docs to Elasticsearch: .75 hours
- No additional indexing options specified, i.e. *not* filtering or
normalizing URLs, etc.

This means that for every segment, 2.5 hours is spent splitting inputs,
shuffling, whatever else Hadoop does, and only about 40 minutes is spent
actually sending docs to ES. From another perspective, that means we can
expect an indexing rate of 1000 docs/sec, but the effective rate is only 200
docs/sec.

I fully understand Nutch's indexer code, so I know that it actually does
very little in both the Map and Reduce phase (the map phase does almost
nothing since I'm not filtering/normalizing URLs), so my best guess is that
there's just a ton of Hadoop overhead. Is it possible to optimize this?

I've included below a link to a gist containing job output and counters for
a single segment, hoping that it will provide some hints. For example, is it
normal that indexing segments of this size requires > 5000 input splits? I
imagine that's far too many Map tasks.

https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284

Thanks for taking a look,
Joe

Reply via email to