Hi Erick, Dimitry and Mikhail

thank you all for your time. I tried all of the suggestions below and am happy 
to report that indexing speeds have improved. There were several confounding 
problems including

- a bank of (~20) regexes that were poorly optimized and compiled at each 
indexing step
- single threaded
- not using StreamingUpdateSolrServer
- excessive logging

However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the 
time of building the SOLR document. Indexing sped up after precomputing these 
values offline.

Thank you all for your help. 



On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote:

> How have you determined that it's the solr add? By timing the call on the
> SolrJ side or by looking at the machine where Solr is running? This is the
> very first thing you have to answer. You can get a rough ides with any
> simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
> box). The point is just to see whether the indexer machine is being
> well utilized. I'd guess it's not actually.
> One quick experiment would be to try using StreamingUpdateSolrServer
> (SUSS), which has the capability of having multiple threads
> fire at Solr at once. It is possible that your performance is spent
> waiting for I/O.
> Once you have that question answered, you can refine. But until you
> know which side of the wire the problem is on, you're flying blind.
> Both Yandong Peyman:
> These times are quite surprising. Running everything locally on my laptop,
> I'm indexing between 5-7K documents/second. The source is
> the Wikipedia dump.
> I'm particularly surprised by the difference Yandong is seeing based
> on the various analysis chains. the first thing I'd back off is the
> MaxPermSize. 512M is huge for this parameter.
> If you're getting that kind of time differential and your CPU isn't
> pegged, you're probably swapping in which case you need
> to give the processes more memory. I'd just take the MaxPermSize
> out completely as a start.
> Not sure if you've seen this page, something there might help.
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> But throw a profiler at the indexer as a first step, just to see
> where the problem is, CPU or I/O.
> Best
> Erick
> On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin <pey...@robustlinks.com> 
> wrote:
>> Hi
>> I am trying to index 12MM docs faster than is currently happening in Solr 
>> (using solrj). We have identified solr's add method as the bottleneck (and 
>> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and 
>> jvm ram).
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we 
>> add and commit in batches. And we've tried both CommonsHttpSolrServer and 
>> EmbeddedSolrServer (assuming removing http overhead would speed things up 
>> with embedding) but the differences is marginal.
>> The docs being indexed are on average 20 fields long, mostly indexed but 
>> none stored. The major size contributors are two fields:
>>        - content, and
>>        - shingledContent (populated using copyField of content).
>> The length of the content field is (likely) gaussian distributed (few large 
>> docs 50-80K tokens, but majority around 2k tokens). We use shingledContent 
>> to support phrase queries and content for unigram queries (following the 
>> advice of Solr Enterprise search server advice - p. 305, section "The 
>> Solution: Shingling").
>> Clearly the size of the docs is a contributor to the slow adds (confirmed by 
>> removing these 2 fields resulting in halving the indexing time). We've tried 
>> compressed=true also but that is not working.
>> Any guidance on how to support our application logic (without having to 
>> change the schema too much) and speed the indexing speed (from current 212 
>> days for 12MM docs) would be much appreciated.
>> thank you
>> Peyman

