Re: Machine utilization while indexing

Thijs Tue, 25 May 2010 04:45:38 -0700

Hi all,

I did some further investigation and (after turning of some filters inyourkit) found that is was actually the machine sending the files tosolr that was slowing things down.

At first I couldn't find this as it turned out that yourkit hidesorg.apache.* classes. When I removed this filter, it turned out thatatleast 50% of the CPU time was taken byorg.apache.solr.client.solrj.util.ClientUtils.writeXML(SolrInputDocument, Writer)This was taking so much time that the commit queues where filling up onthe client side instead of the solr server.

I have now switched back to my custom BlockingQueue with multipleCommonsHttpSolrServers that use the BinaryRequestWriter. And I'm nowable to index 800000 documents in 8minutes (including optimize). And2.9milj documents in 32 minutes(inlc. optimize).

As the StreamingUpdateSolrServer only supports XML I can't use that.

So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler)aren't turned on by default. (eps considering some threads on thedev-list some time ago about setting a default schema for optimumperformance.Also finding out about this performance enhancement wasn't easy as it'shardly mentioned on the Wiki. I'll see if I can update this.


Thanks for all the advise and esp the great work on Solr&Lucene.
Thijs


On 20-5-2010 21:34, Chris Hostetter wrote:


: StreamingUpdateSolrServer already has multiple threads and uses multiple
: connections under the covers. At least the api says ' Uses an internal

Hmmm... i think one of us missunderstands the point behind
StreamingUpdateSolrServer and it's internal threads/queues.  (it's very
possible that it's me)

my understanding is that this allows it to manage the batching of multiple
operations for you, reusing connections as it goes -- so the the
queueSize is how many individual requests it buffers before sending the
batch to Solr, and the threadCount controls how many batches it can send
in parallel (in the event that one thread is still waiting for the
response when the queue next fills up)

But if you are only using a single thread to feed SolrRequests to a single
instance of StreamingUpdateSolrServer then there can still be lots of
opportunities for Solr itself to be idle -- as i said, it's not clear to
me if you are using multiple threads to write to your
StreamingUpdateSolrServer ... even if if you reuse the same
StreamingUpdateSolrServer instance, multiple threads in your client code
may increse the throughput (assuming that at the moment the threads in
StreamingUpdateSolrServer are largely idle)

But as i said ... this is all mostly a guess.  I'm not intimatiely
familiar with solrj.


-Hoss

Re: Machine utilization while indexing

Reply via email to