The keys are the same, but the index is bigger. Solr indexes the position of each term in each document. One term in one document is one position. One term in 10k documents is 10k positions. One term occurring twice in each of 10k documents is 20k positions.
Also, indexing many copies of the same document is not a good way to forecast index size. The size depends on the statistics of the actual documents and on the schema. Measure it with real data and the schema you expect to use. wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On May 19, 2021, at 1:18 PM, Kudrettin Güleryüz <[email protected]> wrote: > > Thanks for the insight, I forgot to mention a key information while > explaining experiment two: > > Although their content is exactly the same, each document would be > different because of their filename. The name of the 10,000 file is > different. Therefore some of the fields content such as filename, id, etc, > is always different. The most significant field in terms of the storage > size is the content field and that is exactly the same for all files in > this experiment. > > Since that is the case, I think no Solr document deletions are necessary. > In fact when I run update?optimize=true, there is no significant change on > the total size of the index. > > On Wed, May 19, 2021 at 11:23 AM Mark H. Wood <[email protected]> wrote: > >> On Wed, May 19, 2021 at 09:03:52AM -0400, Kudrettin Güleryüz wrote: >>> Sorry, I meant I am trying to reduce the index size... I am not using the >>> index optimize feature at this point. >>> >>> Experiment one: >>> Index document of size ~10KB for only once. Total index size in multiple >>> shards ~117KB >>> >>> Experiment two: >>> Index document of size ~10KB for 10,000 times. Total index size in >> multiple >>> shards ~250MB >>> >>> I am assuming that the terms (keys) in the inverted index wouldn't >> increase >>> by indexing the same document multiple times. Therefore I would expect >> the >>> increase in index size would be minimal compared to indexing a totally >>> different document. Can you tell me what I am missing? >> >> >> https://lucidworks.com/post/solr-segment-merge-frees-wasted-space-caused-by-deleted-documents/ >> >> In short: Solr doesn't re-use the space occupied by deleted index >> entries. Replacing a document causes the entries for the previous >> version to be deleted. Eventually Solr will reorganize parts of the >> index into new files, and this drops *some* deleted index entries. At >> any point in time, Solr will be holding some "wasted" space, but it's >> under control and normally you don't need to worry about it. >> >>> On Tue, May 18, 2021 at 12:48 PM Dave <[email protected]> >> wrote: >>> >>>> At a certain point the index size doesn’t matter. When you re index a >>>> document you do not delete the actual residing document, you mark it as >>>> deleted and add on the replacement. An optimize is what removes the >> marked >>>> deleted files, but an optimize is really no longer a recommended >> process >>>> since solr is very good at merging as well as the fact disk is >>>> inexpensive. The reason the index increased in guessing is that even >>>> though it’s only indexed, that data is still stored and of course >>>> duplicated. If it’s performance has not been adversely effected I >> would >>>> not ever run the optimize command. I’ve pushed an index that is >> naturally >>>> 450gb all the way to 800gb+ and it ran great, assuming you have the >> disk >>>> space available >>>> >>>>> On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz < >> [email protected]> >>>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> Experimenting with optimizing the index size. >>>>> >>>>> Can you help me understand why indexing but not storing a file 10,000 >>>>> increases the index size by 2,500 times? 7.3 here. Schema and all >> other >>>>> conditions are kept constant. >>>>> >>>>> Thanks >>>> >> >> -- >> Mark H. Wood >> Lead Technology Analyst >> >> University Library >> Indiana University - Purdue University Indianapolis >> 755 W. Michigan Street >> Indianapolis, IN 46202 >> 317-274-0749 >> www.ulib.iupui.edu >>
