Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

Tanguy Moal Mon, 30 May 2011 00:33:31 -0700

Hello,

Sorry for re-posting this but it seems my message got lost in themailing list's messages stream without hitting anyone's attention... =D

Shortly, has anyone already experienced dramatic indexing slowdownsduring large bulk imports with overwriteDupes turned on and a fairlyhigh duplicates rate (around 4-8x) ?

It seems to produce a lot of deletions, which in turn appear to make themerging of segments pretty slow, by fairly increasing the number oflittle reads operations occuring simultaneously with the regular largewrite operations of the merge. Added to the poor IO performances of acommodity SATA drive, indexing takes ages.

I temporarily bypassed that limitation by disabling the overwriting ofduplicates, but that changes the way I request the index, requiring meto turn on field collapsing at search time.


Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index timededuplication ?

More details on my setup and the state of my understanding are in myprevious message here-after.


Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.
I'm sending a comfortable initial amount of documents (~250M) andwished to perform overwriting of duplicated documents at index time,during the update, taking advantage of the UpdateProcessorChain.
At the beginning of the indexing stage, everything is quite fast;documents arrive at a rate of about 1000 doc/s.The only extra processing during the import is computation of a coupleof hashes that are used to identify uniquely documents given theircontent, using both stock (MD5Signature) and custom (derived fromLookup3Signature) update processors.
I send a commit command to the server every 500k documents sent.
During a first period, the server is CPU bound. After a short while(~10 minutes), the rate at which documents are received starts to falldramatically, the server being IO bound.I've been firstly thinking of a normal speed decrease during thecommit, while my push client is waiting for the flush to occur. Thatwould have been a normal slowdown.
The thing that retained my attention was the fact that unexpectedly,the server was performing a lot of small reads, way more the numberwrites, which seem to be larger.The combination of the many small reads with the constant amount ofbigger writes seem to be creating a lot of IO contention on mycommodity SATA drive, and the ETA of my built index started toincrease scarily =D
I then restarted the JVM with JMX enabled so I could startinvestigating a little bit more. I've the realized that theUpdateHandler was performing many reads while processing the updaterequest.
Are there any known limitations around the UpdateProcessorChain, whenoverwriteDupes is set to true ?I turned that off, which of course breaks the intent of my builtindex, but for comparison purposes it's good.
That did the trick, indexing is fast again, even with the periodiccommits.
I therefor have two questions, an interesting first one and a boringsecond one :
1 / What's the workflow of the UpdateProcessorChain when one or moreprocessors have overwriting of duplicates turned on ? What happensunder the hood ?
I tried to answer that myself looking at DirectUpdateHandler2 and myunderstanding stopped at the following :
- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTermand updateTerm things, in the addDoc method. The deletions seem to bebuffered somewhere, I just didn't get it :-)
I might be wrong since I didn't read the code more than that, but thepoint might be at how does solr handles deletions, which is somethingstill unclear to me. In anyways, a lot of reads seem to occur for thatprecise task and it tends to produce a lot of IO, killing indexingperformances when overwriteDupes is on. I don't even understand why somany read operations occur at this stage since my process had acomfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are usedso far.
Any help, recommandation or idea is welcome :-)
2 / In the case there isn't a simple fix for this, I'll have to dowith duplicates in my index. I don't mind since solr offers a greatgrouping feature, which I already use in some other applications. Theonly thing I don't know yet is that if I do rely on grouping at searchtime, in combination with the Stats component (which is the intent ofthat index), and limiting the results to 1 document per group, willthe computed statistics take those duplicates into account or not ?Shortly, how well does the Stats component behave when combined tohits collapsing ?
I had firstly implemented my solution using overwriteDupes because itwould have reduced both the target size of my index and the complexityof queries used to obtain statistics on the search results, at one time.
Thank you very much in advance.

--
Tanguy



--
--
Tanguy

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

Reply via email to