Mark Miller wrote: > Phillip Farber wrote: > >> Resuming this discussion in a new thread to focus only on this question: >> >> What is the best way to get the size of an index so it does not get >> too big to be optimized (or to allow a very large segment merge) given >> space limits? >> >> I already have the largest 15,000rpm SCSI direct attached storage so >> buying storage is not an option. I don't do deletes. >> > Even if you did do deletes, its not really a 3x problem - thats just > theory - you'd have to work to get there. Deletes are merged out as you > index additional docs as segments are merged over time. The 3x scenario > brought up is more of a fun mind exercise than anything that would > realistically happen. > > And for completeness for those following along:
Lets say you did do some crazy deleting, and deleted half the docs in your index. Those docs stay around, and the ids are just added to a list that keeps those docs from being "seen". Later, as natural merging occurs, or if you force merges with an optimize, those deleted docs will physically be removed. Lets then say you then managed to re-add all of those docs without any merging occurring while adding those docs (say you wanted to see this affect so bad that you wrote and put in a custom merge policy that doesn't find any segments to merge). Even if you do all that, before you do the optimize, your going to look at the size of your index and see its n GB. Thats your current index size. Now say you kick off the optimize. Its not even going to take 2x that n size to optimize - this is because all those deletes will be removed as the index is optimized down to one segment. Its going to take <2x. This delete thing, as I said, is more of a fun mental thing. It has little relation to how much space you need to optimize in comparison to how big your index is before optimizing. And its really worse than a worse case scenario unless you write a custom merge policy, or crank some settings insanely high and have enough RAM to do (all the indexing would have to take place in one huge segment in RAM that would then get flushed). -- - Mark http://www.lucidimagination.com