On Wed, May 19, 2021 at 09:03:52AM -0400, Kudrettin Güleryüz wrote:
> Sorry, I meant I am trying to reduce the index size... I am not using the
> index optimize feature at this point.
> 
> Experiment one:
> Index document of size ~10KB for only once. Total index size in multiple
> shards ~117KB
> 
> Experiment two:
> Index document of size ~10KB for 10,000 times. Total index size in multiple
> shards ~250MB
> 
> I am assuming that the terms (keys) in the inverted index wouldn't increase
> by indexing the same document multiple times. Therefore I would expect the
> increase in index size would be minimal compared to indexing a totally
> different document. Can you tell me what I am missing?

https://lucidworks.com/post/solr-segment-merge-frees-wasted-space-caused-by-deleted-documents/

In short:  Solr doesn't re-use the space occupied by deleted index
entries.  Replacing a document causes the entries for the previous
version to be deleted.  Eventually Solr will reorganize parts of the
index into new files, and this drops *some* deleted index entries.  At
any point in time, Solr will be holding some "wasted" space, but it's
under control and normally you don't need to worry about it.

> On Tue, May 18, 2021 at 12:48 PM Dave <[email protected]> wrote:
> 
> > At a certain point the index size doesn’t matter. When you re index a
> > document you do not delete the actual residing document, you mark it as
> > deleted and add on the replacement.  An optimize is what removes the marked
> > deleted files, but an optimize is really no longer a recommended process
> > since solr is very good at merging as well as the fact disk is
> > inexpensive.  The reason the index increased in guessing is that even
> > though it’s only indexed, that data is still stored and of course
> > duplicated.  If it’s performance has not been adversely effected I would
> > not ever run the optimize command. I’ve pushed an index that is naturally
> > 450gb all the way to 800gb+ and it ran great, assuming you have the disk
> > space available
> >
> > > On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <[email protected]>
> > wrote:
> > >
> > > Hello,
> > >
> > > Experimenting with optimizing the index size.
> > >
> > > Can you help me understand why indexing but not storing a file 10,000
> > > increases the index size by 2,500 times? 7.3 here. Schema and all other
> > > conditions are kept constant.
> > >
> > > Thanks
> >

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

Attachment: signature.asc
Description: PGP signature

Reply via email to