Re: indexing cpu utilization

Uwe Reh Wed, 02 Jan 2013 13:39:46 -0800

Hi,

while trying to optimize our indexing workflow I reached the sameendpoint like gabriel shen described in his mail. My Solr server won'tutilize more than 40% of the computing power.I made some tests, but i'm not able to find the bottleneck. Couldanybody help to solve this quest?


At first let me describe the environment:

Server:
- Two socket Opteron (interlagos) => 32 cores
- 64Gb Ram (1600Mhz)
- SATA Disks: spindle and ssd
- Solaris 5.11
- JRE 1.7.0
- Solr 4.0
- ApplicationServer Jetty
- 1Gb network interface

Client:
- same hardware as client

- either multi threaded solrj client using multiple instances ofHttpSolrServer- or multi threaded solrj client using a ConcurrentUpdateSolrServer with100 threads


Problem:
- 10,000,000 docs of bibliographic data (~4k each)

- with a simplified schema definition it takes 10 hours to index <=>~250docs/second

- with the real schema.xml it takes 50 hours to index  <=> ~50docs/second

In both cases the client takes just 2% of the cpu resources and theserver 35%. It's obvious that there is some optimization potential inthe schema definition, but why uses the Server never more than 40% ofthe cpu power?



Discarded possible bottlenecks:
- Ram for the JVM

Solr takes only up to 12G of heap and there is just a negligible gcactivity. So the increase from 16G to 32G of possible heap made nodifference.

- Bandwidth of the net

The transmitted data is identical in both cases. The size of thetransmitted data is somewhat below 50G. Since both machines have adedicated 1G line to the switch, the raw transmission should not takemuch more than 10 minutes

- Performance of the client

Like above, the client ist fast enough for the simplified case (10h). Adry run (just preprocessing not indexing) may finish after 75 minutes.

- Servers disk IO

The size of the simpler index is ~100G the size of the other is ~150G.This makes factor of 1.5 not 5. The difference between a ssd and a realdisk is not noticeable. The output of 'iostat' and 'zpool iostat' isunsuspicious.

- Bad thread distribution

'mpstat' shows a well distributed load over all cpus and a sensibleamount of crosscalls (less than ten/cpu)

- Solr update parameter (solrconfig.xml)

Inspired from>http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 I'm using:

<ramBufferSizeMB>256</ramBufferSizeMB>
<mergeFactor>40</mergeFactor>
<termIndexInterval>1024</termIndexInterval>
<lockType>native</lockType>
<unlockOnStartup>true</unlockOnStartup>

Any changes on this Parameters made it worse.

To get an idea whats going on, I've done some statistics with visualvm.(see attachement)The distribution of real and cpu time looks significant, but Im notsmart enough to interpret the results.The methodorg.apache.lucene.index.treadAffinityDocumentsWriterThreadPool.getAndLock()is active at 80% of the time but takes only 1% of the cpu time. On theother hand the second methodorg.apache.commons.codec.language.bm.PhoneticEngine$PhonemeBuilder.append()is active at 12% of the time and is always running on a cpu

So again the question "When there are free resources in all dimensions,why utilizes Solr not more than 40% of the computing Power"?

Bandwidth of the RAM?? I can't believe this. How to verify?
???

Any hints are welcome.
Uwe

Re: indexing cpu utilization

Reply via email to