On Wed, May 19, 2021 at 09:03:52AM -0400, Kudrettin Güleryüz wrote: > Sorry, I meant I am trying to reduce the index size... I am not using the > index optimize feature at this point. > > Experiment one: > Index document of size ~10KB for only once. Total index size in multiple > shards ~117KB > > Experiment two: > Index document of size ~10KB for 10,000 times. Total index size in multiple > shards ~250MB > > I am assuming that the terms (keys) in the inverted index wouldn't increase > by indexing the same document multiple times. Therefore I would expect the > increase in index size would be minimal compared to indexing a totally > different document. Can you tell me what I am missing?
https://lucidworks.com/post/solr-segment-merge-frees-wasted-space-caused-by-deleted-documents/ In short: Solr doesn't re-use the space occupied by deleted index entries. Replacing a document causes the entries for the previous version to be deleted. Eventually Solr will reorganize parts of the index into new files, and this drops *some* deleted index entries. At any point in time, Solr will be holding some "wasted" space, but it's under control and normally you don't need to worry about it. > On Tue, May 18, 2021 at 12:48 PM Dave <[email protected]> wrote: > > > At a certain point the index size doesn’t matter. When you re index a > > document you do not delete the actual residing document, you mark it as > > deleted and add on the replacement. An optimize is what removes the marked > > deleted files, but an optimize is really no longer a recommended process > > since solr is very good at merging as well as the fact disk is > > inexpensive. The reason the index increased in guessing is that even > > though it’s only indexed, that data is still stored and of course > > duplicated. If it’s performance has not been adversely effected I would > > not ever run the optimize command. I’ve pushed an index that is naturally > > 450gb all the way to 800gb+ and it ran great, assuming you have the disk > > space available > > > > > On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <[email protected]> > > wrote: > > > > > > Hello, > > > > > > Experimenting with optimizing the index size. > > > > > > Can you help me understand why indexing but not storing a file 10,000 > > > increases the index size by 2,500 times? 7.3 here. Schema and all other > > > conditions are kept constant. > > > > > > Thanks > > -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu
signature.asc
Description: PGP signature
