Mark Miller wrote:
> Phillip Farber wrote:
>   
>> Resuming this discussion in a new thread to focus only on this question:
>>
>> What is the best way to get the size of an index so it does not get
>> too big to be optimized (or to allow a very large segment merge) given
>> space limits?
>>
>> I already have the largest 15,000rpm SCSI direct attached storage so
>> buying storage is not an option.  I don't do deletes.
>>     
> Even if you did do deletes, its not really a 3x problem - thats just
> theory - you'd have to work to get there. Deletes are merged out as you
> index additional docs as segments are merged over time. The 3x scenario
> brought up is more of a fun mind exercise than anything that would
> realistically happen.
>   
>
And for completeness for those following along:

Lets say you did do some crazy deleting, and deleted half the docs in
your index. Those docs stay around, and the ids are just added to a list
that keeps those docs from being "seen". Later, as natural merging
occurs, or if you force merges with an optimize, those deleted docs will
physically be removed. Lets then say you then managed to re-add all of
those docs without any merging occurring while adding those docs (say
you wanted to see this affect so bad that you wrote and put in a custom
merge policy that doesn't find any segments to merge). Even if you do
all that, before you do the optimize, your going to look at the size of
your index and see its n GB. Thats your current index size. Now say you
kick off the optimize. Its not even going to take 2x that n size to
optimize - this is because all those deletes will be removed as the
index is optimized down to one segment. Its going to take <2x.

This delete thing, as I said, is more of a fun mental thing. It has
little relation to how much space you need to optimize in comparison to
how big your index is before optimizing. And its really worse than a
worse case scenario unless you write a custom merge policy, or crank
some settings insanely high and have enough RAM to do (all the indexing
would have to take place in one huge segment in RAM that would then get
flushed).


-- 
- Mark

http://www.lucidimagination.com



Reply via email to