Hi all,

I did some further investigation and (after turning of some filters in yourkit) found that is was actually the machine sending the files to solr that was slowing things down.

At first I couldn't find this as it turned out that yourkit hides org.apache.* classes. When I removed this filter, it turned out that atleast 50% of the CPU time was taken by org.apache.solr.client.solrj.util.ClientUtils.writeXML(SolrInputDocument, Writer) This was taking so much time that the commit queues where filling up on the client side instead of the solr server.

I have now switched back to my custom BlockingQueue with multiple CommonsHttpSolrServers that use the BinaryRequestWriter. And I'm now able to index 800000 documents in 8minutes (including optimize). And 2.9milj documents in 32 minutes(inlc. optimize).
As the StreamingUpdateSolrServer only supports XML I can't use that.

So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) aren't turned on by default. (eps considering some threads on the dev-list some time ago about setting a default schema for optimum performance. Also finding out about this performance enhancement wasn't easy as it's hardly mentioned on the Wiki. I'll see if I can update this.

Thanks for all the advise and esp the great work on Solr&Lucene.
Thijs


On 20-5-2010 21:34, Chris Hostetter wrote:

: StreamingUpdateSolrServer already has multiple threads and uses multiple
: connections under the covers. At least the api says ' Uses an internal

Hmmm... i think one of us missunderstands the point behind
StreamingUpdateSolrServer and it's internal threads/queues.  (it's very
possible that it's me)

my understanding is that this allows it to manage the batching of multiple
operations for you, reusing connections as it goes -- so the the
queueSize is how many individual requests it buffers before sending the
batch to Solr, and the threadCount controls how many batches it can send
in parallel (in the event that one thread is still waiting for the
response when the queue next fills up)

But if you are only using a single thread to feed SolrRequests to a single
instance of StreamingUpdateSolrServer then there can still be lots of
opportunities for Solr itself to be idle -- as i said, it's not clear to
me if you are using multiple threads to write to your
StreamingUpdateSolrServer ... even if if you reuse the same
StreamingUpdateSolrServer instance, multiple threads in your client code
may increse the throughput (assuming that at the moment the threads in
StreamingUpdateSolrServer are largely idle)

But as i said ... this is all mostly a guess.  I'm not intimatiely
familiar with solrj.


-Hoss


Reply via email to