Hi Erick, Dimitry and Mikhail

thank you all for your time. I tried all of the suggestions below and am happy 
to report that indexing speeds have improved. There were several confounding 
problems including

- a bank of (~20) regexes that were poorly optimized and compiled at each 
indexing step
- single threaded
- not using StreamingUpdateSolrServer
- excessive logging

However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the 
time of building the SOLR document. Indexing sped up after precomputing these 
values offline.

Thank you all for your help. 

best

Peyman 

On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote:

> How have you determined that it's the solr add? By timing the call on the
> SolrJ side or by looking at the machine where Solr is running? This is the
> very first thing you have to answer. You can get a rough ides with any
> simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
> box). The point is just to see whether the indexer machine is being
> well utilized. I'd guess it's not actually.
> 
> One quick experiment would be to try using StreamingUpdateSolrServer
> (SUSS), which has the capability of having multiple threads
> fire at Solr at once. It is possible that your performance is spent
> waiting for I/O.
> 
> Once you have that question answered, you can refine. But until you
> know which side of the wire the problem is on, you're flying blind.
> 
> Both Yandong Peyman:
> These times are quite surprising. Running everything locally on my laptop,
> I'm indexing between 5-7K documents/second. The source is
> the Wikipedia dump.
> 
> I'm particularly surprised by the difference Yandong is seeing based
> on the various analysis chains. the first thing I'd back off is the
> MaxPermSize. 512M is huge for this parameter.
> If you're getting that kind of time differential and your CPU isn't
> pegged, you're probably swapping in which case you need
> to give the processes more memory. I'd just take the MaxPermSize
> out completely as a start.
> 
> Not sure if you've seen this page, something there might help.
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> 
> But throw a profiler at the indexer as a first step, just to see
> where the problem is, CPU or I/O.
> 
> Best
> Erick
> 
> On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin <pey...@robustlinks.com> 
> wrote:
>> Hi
>> 
>> I am trying to index 12MM docs faster than is currently happening in Solr 
>> (using solrj). We have identified solr's add method as the bottleneck (and 
>> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and 
>> jvm ram).
>> 
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we 
>> add and commit in batches. And we've tried both CommonsHttpSolrServer and 
>> EmbeddedSolrServer (assuming removing http overhead would speed things up 
>> with embedding) but the differences is marginal.
>> 
>> The docs being indexed are on average 20 fields long, mostly indexed but 
>> none stored. The major size contributors are two fields:
>> 
>>        - content, and
>>        - shingledContent (populated using copyField of content).
>> 
>> The length of the content field is (likely) gaussian distributed (few large 
>> docs 50-80K tokens, but majority around 2k tokens). We use shingledContent 
>> to support phrase queries and content for unigram queries (following the 
>> advice of Solr Enterprise search server advice - p. 305, section "The 
>> Solution: Shingling").
>> 
>> Clearly the size of the docs is a contributor to the slow adds (confirmed by 
>> removing these 2 fields resulting in halving the indexing time). We've tried 
>> compressed=true also but that is not working.
>> 
>> Any guidance on how to support our application logic (without having to 
>> change the schema too much) and speed the indexing speed (from current 212 
>> days for 12MM docs) would be much appreciated.
>> 
>> thank you
>> 
>> Peyman
>> 

Reply via email to