Hi,

while trying to optimize our indexing workflow I reached the same endpoint like gabriel shen described in his mail. My Solr server won't utilize more than 40% of the computing power. I made some tests, but i'm not able to find the bottleneck. Could anybody help to solve this quest?

At first let me describe the environment:

Server:
- Two socket Opteron (interlagos) => 32 cores
- 64Gb Ram (1600Mhz)
- SATA Disks: spindle and ssd
- Solaris 5.11
- JRE 1.7.0
- Solr 4.0
- ApplicationServer Jetty
- 1Gb network interface

Client:
- same hardware as client
- either multi threaded solrj client using multiple instances of HttpSolrServer - or multi threaded solrj client using a ConcurrentUpdateSolrServer with 100 threads

Problem:
- 10,000,000 docs of bibliographic data (~4k each)
- with a simplified schema definition it takes 10 hours to index <=> ~250docs/second
- with the real schema.xml it takes 50 hours to index  <=> ~50docs/second
In both cases the client takes just 2% of the cpu resources and the server 35%. It's obvious that there is some optimization potential in the schema definition, but why uses the Server never more than 40% of the cpu power?


Discarded possible bottlenecks:
- Ram for the JVM
Solr takes only up to 12G of heap and there is just a negligible gc activity. So the increase from 16G to 32G of possible heap made no difference.
- Bandwidth of the net
The transmitted data is identical in both cases. The size of the transmitted data is somewhat below 50G. Since both machines have a dedicated 1G line to the switch, the raw transmission should not take much more than 10 minutes
- Performance of the client
Like above, the client ist fast enough for the simplified case (10h). A dry run (just preprocessing not indexing) may finish after 75 minutes.
- Servers disk IO
The size of the simpler index is ~100G the size of the other is ~150G. This makes factor of 1.5 not 5. The difference between a ssd and a real disk is not noticeable. The output of 'iostat' and 'zpool iostat' is unsuspicious.
- Bad thread distribution
'mpstat' shows a well distributed load over all cpus and a sensible amount of crosscalls (less than ten/cpu)
- Solr update parameter (solrconfig.xml)
Inspired from >http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 I'm using:
<ramBufferSizeMB>256</ramBufferSizeMB>
<mergeFactor>40</mergeFactor>
<termIndexInterval>1024</termIndexInterval>
<lockType>native</lockType>
<unlockOnStartup>true</unlockOnStartup>
Any changes on this Parameters made it worse.

To get an idea whats going on, I've done some statistics with visualvm. (see attachement) The distribution of real and cpu time looks significant, but Im not smart enough to interpret the results. The method org.apache.lucene.index.treadAffinityDocumentsWriterThreadPool.getAndLock() is active at 80% of the time but takes only 1% of the cpu time. On the other hand the second method org.apache.commons.codec.language.bm.PhoneticEngine$PhonemeBuilder.append() is active at 12% of the time and is always running on a cpu

So again the question "When there are free resources in all dimensions, why utilizes Solr not more than 40% of the computing Power"?
Bandwidth of the RAM?? I can't believe this. How to verify?
???

Any hints are welcome.
Uwe






Reply via email to