Hi Mike,

>.Do you use multiple threads for indexing?  Large RAM buffer size is
>>also good, but I think perf peaks out mabye around 512 MB (at least
>>based on past tests)?

We are using Solr, I'm not sure if Solr uses multiple threads for indexing.  We 
have 30 "producers" each sending documents to 1 of 12 Solr shards on a round 
robin basis.  So each shard will get multiple requests.

>>Believe it or not, merging is typically compute bound.  It's costly to
>>decode & re-encode all the vInts.

Sounds like we need to do some monitoring during merging to see what the cpu 
use is and also the io wait during large merges.

>>Larger merge factor is good because it means the postings are copied 
>>fewer times, but, it's bad beacuse you could risk running out of
>>descriptors, and, if the OS doesn't have enough RAM, you'll start to
>>thin out the readahead that the OS can do (which makes the merge less
>>efficient since the disk heads are seeking more).

Is there a way to estimate the amount of RAM for the readahead?   Once we start 
the re-indexing we will be running 12 shards on a 16 processor box with 144 GB 
of memory.

>>Do you do any deleting?
Deletes would happen as a byproduct of updating a record.  This shouldn't 
happen too frequently during re-indexing, but we update records when a document 
gets re-scanned and re-OCR'd.  This would probably amount to a few thousand.


>>Do you use stored fields and/or term vectors?  If so, try to make
>>your docs "uniform" if possible, ie add the same fields in the same
>>order.  This enables lucene to use bulk byte copy merging under the hood.

We use 4 or 5 stored fields.  They are very small compared to our huge OCR 
field.  Since we construct our Solr documents programattically, I'm fairly 
certain that they are always in the same order.  I'll have to look at the code 
when I get back to make sure.

We aren't using term vectors now, but we plan to add them as well as a number 
of fields based on MARC (cataloging) metadata in the future.

Tom

Reply via email to