Re: Machine utilization while indexing

Thijs Thu, 20 May 2010 08:28:53 -0700

I already have a blockingqueue in place (that's my custom queue) andluckily I'm indexing faster then what your doing.Currently it takesabout 2hour to index the 5m documents I'm talking about. But I stillfeel as if my machine is under utilized.


Thijs



On 20-5-2010 17:16, Nagelberg, Kallin wrote:

How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-----Original Message-----
From: Thijs [mailto:vonk.th...@gmail.com]
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does
at the moment.

I have to index (and re-index) some 3-5 million documents. These
documents are preprocessed by a java application that effectively
combines multiple database tables with each-other to form the
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to
be send to the solr server exceeds my preset limit. Telling me that Solr
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents building up
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any
analysis on the fields that are being indexed. The schema is rather
straight forward.

most fields look like
<fieldType name="long" class="solr.LongField" omitNorms="true"/>
<field name="objectId" type="long" stored="true" indexed="true"
required="true" />
<field name="listId" type="long" stored="false" indexed="true"
multiValued="true"/>

the relevant solrconfig.xml
<indexDefaults>
      <useCompoundFile>false</useCompoundFile>
      <mergeFactor>100</mergeFactor>
      <RAMBufferSizeMB>256</RAMBufferSizeMB>
      <maxMergeDocs>2147483647</maxMergeDocs>
      <maxFieldLength>10000</maxFieldLength>
      <writeLockTimeout>1000</writeLockTimeout>
      <commitLockTimeout>10000</commitLockTimeout>
      <lockType>single</lockType>
</indexDefaults>


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I
have a feeling that my machine is capable of doing more (use more
cpu's). I just can't figure-out how.

Thijs

Re: Machine utilization while indexing

Reply via email to