Hello all, Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again). In Solr 3.6 and previous we needed to reduce the memory used for storing the in-memory representation of the tii file. We originally used the termInfosIndexDivisor which affects the sampling of the tii file when read into memory. While this solved our problem for searching, unfortunately the termInfosIndexDivisor was not read during indexing and caused OOM problems once our indexes grew beyond a certain size. See: https://issues.apache.org/jira/browse/SOLR-2290.
Has this been changed in Solr 4.0? The advantage of using the termInfosIndexDivisor is that it can be changed without re-indexing, so we were able to experiment with different settings to determine a good setting without re-indexing several terabytes of data. When we ran into problems with the memory use for the in-memory representation of the tii file during indexing, we changed the termIndexInterval. The termIndexInterval is an indexing-time setting which affects the size of the tii file. It sets the sampling of the tis file that gets written to the tii file. In Solr 4.0 termInfosIndexDivisor has been replaced with termIndexDivisor. The documentation for these two features, the index-time termIndexInterval and the run-time termIndexDivisor no longer seems to be on the solr config page of the wiki and the docmentation in the example file does not exlain what the termIndexDivisor does. Would it be appropriate to add these back to the wiki page? If not, could someone add a line or two to the comments in the Solr 4.0 example file explaining what the termIndexDivisor doe? Tom