Hello all,

Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file.   We originally used the
termInfosIndexDivisor which affects the sampling of the tii file when read
into memory.   While this solved our problem for searching, unfortunately
the termInfosIndexDivisor was not read during indexing and caused OOM
problems once our indexes grew beyond a certain size.  See:
https://issues.apache.org/jira/browse/SOLR-2290.

Has this been changed in Solr 4.0?

The advantage of using the termInfosIndexDivisor is that it can be changed
without re-indexing, so we were able to experiment with different settings
to determine a good setting without re-indexing several terabytes of data.

When we ran into problems with the memory use for the in-memory
representation of the tii file during indexing, we changed the
termIndexInterval.  The termIndexInterval is an indexing-time setting
 which affects the size of the tii file.  It sets the sampling of the tis
file that gets written to the tii file.

In Solr 4.0 termInfosIndexDivisor has been replaced with
termIndexDivisor.    The documentation for these two features, the
index-time termIndexInterval and the run-time  termIndexDivisor no longer
seems to be on the solr config page of the wiki and the docmentation in the
example file does not exlain what the termIndexDivisor does.

Would it be appropriate to add these back to the wiki page?  If not, could
someone add a line or two to the comments in the Solr 4.0 example file
explaining what the termIndexDivisor doe?


Tom

Reply via email to