Optimize takes a 'maxSegments' option. This tells it to stop when
there are N segments instead of just one.

If you use a very high mergeFactor and then call optimize with a sane
number like 50, it only merges the little teeny segments.

On Thu, May 3, 2012 at 8:28 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote:
>>
>> We have a fairly large scale system - about 200 million docs and fairly
>> high indexing activity - about 300k docs per day with peak ingestion rates
>> of about 20 docs per sec. I want to work out what a good mergeFactor setting
>> would be by testing with different mergeFactor settings. I think the default
>> of 10 might be high, I want to try with 5 and compare. Unless I know when a
>> merge starts and finishes, it would be quite difficult to work out the
>> impact of changing mergeFactor. I want to be able to measure how long merges
>> take, run queries during the merge activity and see what the response times
>> are etc..
>
>
> With a lot of indexing activity, if you are attempting to avoid large
> merges, I would think you would want a higher mergeFactor, not a lower one,
> and do occasional optimizes during non-peak hours.  With a small
> mergeFactor, you will be merging a lot more often, and you are more likely
> to encounter merges of already-merged segments, which can be very slow.
>
> My index is nearing 70 million documents.  I've got seven shards - six large
> indexes with about 11.5 million docs each, and a small index that I try to
> keep below half a million documents.  The small index contains the newest
> documents, between 3.5 and 7 days worth.  With this setup and the way I
> manage it, large merges pretty much never happen.
>
> Once a minute, I do an update cycle.  This looks for and applies deletions,
> reinserts, and new document inserts.  New document inserts happen only on
> the small index, and there are usually a few dozen documents to insert on
> each update cycle.  Deletions and reinserts can happen on any of the seven
> shards, but there are not usually deletions and reinserts on every update
> cycle, and the number of reinserts is usually very very small.  Once an
> hour, I optimize the small index, which takes about 30 seconds.  Once a day,
> I optimize one of the large indexes during non-peak hours, so every large
> index gets optimized once every six days.  This takes about 15 minutes,
> during which deletes and reinserts are not applied, but new document inserts
> continue to happen.
>
> My mergeFactor is set to 35.  I wanted a large value here, and this
> particular number has a side effect -- uniformity in segment filenames on
> the disk during full rebuilds.  Lucene uses a base-36 segment numbering
> scheme.  I usually end up with less than 10 segments in the larger indexes,
> which means they don't do merges.  The small index does do merges, but I
> have never had a problem with those merges going slowly.
>
> Because I do occasionally optimize, I am fairly sure that even when I do
> have merges, they happen with 35 very small segment files, and leave the
> large initial segment alone.  I have not tested this theory, but it seems
> the most sensible way to do things, and I've found that Lucene/Solr usually
> does things in a sensible manner.  If I am wrong here (using 3.5 and its
> improved merging), I would appreciate knowing.
>
> Thanks,
> Shawn
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to