Optimize takes a 'maxSegments' option. This tells it to stop when there are N segments instead of just one.
If you use a very high mergeFactor and then call optimize with a sane number like 50, it only merges the little teeny segments. On Thu, May 3, 2012 at 8:28 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote: >> >> We have a fairly large scale system - about 200 million docs and fairly >> high indexing activity - about 300k docs per day with peak ingestion rates >> of about 20 docs per sec. I want to work out what a good mergeFactor setting >> would be by testing with different mergeFactor settings. I think the default >> of 10 might be high, I want to try with 5 and compare. Unless I know when a >> merge starts and finishes, it would be quite difficult to work out the >> impact of changing mergeFactor. I want to be able to measure how long merges >> take, run queries during the merge activity and see what the response times >> are etc.. > > > With a lot of indexing activity, if you are attempting to avoid large > merges, I would think you would want a higher mergeFactor, not a lower one, > and do occasional optimizes during non-peak hours. With a small > mergeFactor, you will be merging a lot more often, and you are more likely > to encounter merges of already-merged segments, which can be very slow. > > My index is nearing 70 million documents. I've got seven shards - six large > indexes with about 11.5 million docs each, and a small index that I try to > keep below half a million documents. The small index contains the newest > documents, between 3.5 and 7 days worth. With this setup and the way I > manage it, large merges pretty much never happen. > > Once a minute, I do an update cycle. This looks for and applies deletions, > reinserts, and new document inserts. New document inserts happen only on > the small index, and there are usually a few dozen documents to insert on > each update cycle. Deletions and reinserts can happen on any of the seven > shards, but there are not usually deletions and reinserts on every update > cycle, and the number of reinserts is usually very very small. Once an > hour, I optimize the small index, which takes about 30 seconds. Once a day, > I optimize one of the large indexes during non-peak hours, so every large > index gets optimized once every six days. This takes about 15 minutes, > during which deletes and reinserts are not applied, but new document inserts > continue to happen. > > My mergeFactor is set to 35. I wanted a large value here, and this > particular number has a side effect -- uniformity in segment filenames on > the disk during full rebuilds. Lucene uses a base-36 segment numbering > scheme. I usually end up with less than 10 segments in the larger indexes, > which means they don't do merges. The small index does do merges, but I > have never had a problem with those merges going slowly. > > Because I do occasionally optimize, I am fairly sure that even when I do > have merges, they happen with 35 very small segment files, and leave the > large initial segment alone. I have not tested this theory, but it seems > the most sensible way to do things, and I've found that Lucene/Solr usually > does things in a sensible manner. If I am wrong here (using 3.5 and its > improved merging), I would appreciate knowing. > > Thanks, > Shawn > -- Lance Norskog goks...@gmail.com