Hi folks, I'm in the process of indexing a large number of docs using Nutch 1.11 and the indexer-elastic plugin. I've observed slow indexing performance and narrowed it down to the map phase and first part of the reduce phase taking 80% of the total runtime per segment. Here are some statistics:
- Average segment contains around 2.4M "indexable" URLs, meaning successfully fetched and parsed. - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 machines. - Time to index 2.4M URLs (one segment): around 3.25 hours. - Actual time spent sending docs to Elasticsearch: .75 hours - No additional indexing options specified, i.e. *not* filtering or normalizing URLs, etc. This means that for every segment, 2.5 hours is spent splitting inputs, shuffling, whatever else Hadoop does, and only about 40 minutes is spent actually sending docs to ES. From another perspective, that means we can expect an indexing rate of 1000 docs/sec, but the effective rate is only 200 docs/sec. I fully understand Nutch's indexer code, so I know that it actually does very little in both the Map and Reduce phase (the map phase does almost nothing since I'm not filtering/normalizing URLs), so my best guess is that there's just a ton of Hadoop overhead. Is it possible to optimize this? I've included below a link to a gist containing job output and counters for a single segment, hoping that it will provide some hints. For example, is it normal that indexing segments of this size requires > 5000 input splits? I imagine that's far too many Map tasks. https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 Thanks for taking a look, Joe

