RE: Machine utilization while indexing

Nagelberg, Kallin Thu, 20 May 2010 08:16:45 -0700

How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-----Original Message-----
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does 
at the moment.

I have to index (and re-index) some 3-5 million documents. These 
documents are preprocessed by a java application that effectively 
combines multiple database tables with each-other to form the 
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to 
be send to the solr server exceeds my preset limit. Telling me that Solr 
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer 
as it would not process the documents fast enough causing 
OutOfMemoryExceptions due to the large amount of documents building up 
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any 
analysis on the fields that are being indexed. The schema is rather 
straight forward.

most fields look like
<fieldType name="long" class="solr.LongField" omitNorms="true"/>
<field name="objectId" type="long" stored="true" indexed="true" 
required="true" />
<field name="listId" type="long" stored="false" indexed="true" 
multiValued="true"/>

the relevant solrconfig.xml
<indexDefaults>
     <useCompoundFile>false</useCompoundFile>
     <mergeFactor>100</mergeFactor>
     <RAMBufferSizeMB>256</RAMBufferSizeMB>
     <maxMergeDocs>2147483647</maxMergeDocs>
     <maxFieldLength>10000</maxFieldLength>
     <writeLockTimeout>1000</writeLockTimeout>
     <commitLockTimeout>10000</commitLockTimeout>
     <lockType>single</lockType>
</indexDefaults>

The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10% 
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the 
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but 
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I 
have a feeling that my machine is capable of doing more (use more 
cpu's). I just can't figure-out how.

Thijs

RE: Machine utilization while indexing

Reply via email to