Thanks Robert,

I'll have to spend some time understanding the default codec for Solr 4.0.
Did I miss something in the changes file?

 I'll be digging into the default codec docs and testing sometime in next
week  or two (with a 2 billion term index)  If I understand it well enough,
I'll be happy to draft some changes up for either the wiki or Solr the
example solrconfig.xml  file.

Does this mean that the default codec will reduce memory use for the terms
index enough so I don't need to use either of these settings to deal with
my > 2 billion term indexes?

If both of these parameters don't make sense for the default codec, then
maybe they need to be commented out or removed from the solr example
solrconfig.xml.

Tom

On Fri, Sep 7, 2012 at 1:33 PM, Robert Muir <rcm...@gmail.com> wrote:

> Hi Tom: I already enhanced the javadocs about this for Lucene, putting
> warnings everywhere in bold:
>
> NOTE: This parameter does not apply to all PostingsFormat
> implementations, including the default one in this release. It only
> makes sense for term indexes that are implemented as a fixed gap
> between terms.
> NOTE: divisor settings > 1 do not apply to all PostingsFormat
> implementations, including the default one in this release. It only
> makes sense for terms indexes that can efficiently re-sample terms at
> load time.
> etc
>
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29
>
> In the future I expect these parameters ill be removed completely:
> anything like this is specific to the codec/implementation.
>
> In Lucene 4.0 the terms index works completely differently: these
> parameters don't make sense for it.
>
> On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West <tburt...@umich.edu>
> wrote:
> > Hello all,
> >
> > Due to multiple languages and dirty OCR, our indexes have over 2 billion
> > unique terms (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
> ).
> > In Solr 3.6 and previous we needed to reduce the memory used for storing
> > the in-memory representation of the tii file.   We originally used the
> > termInfosIndexDivisor which affects the sampling of the tii file when
> read
> > into memory.   While this solved our problem for searching, unfortunately
> > the termInfosIndexDivisor was not read during indexing and caused OOM
> > problems once our indexes grew beyond a certain size.  See:
> > https://issues.apache.org/jira/browse/SOLR-2290.
> >
> > Has this been changed in Solr 4.0?
> >
> > The advantage of using the termInfosIndexDivisor is that it can be
> changed
> > without re-indexing, so we were able to experiment with different
> settings
> > to determine a good setting without re-indexing several terabytes of
> data.
> >
> > When we ran into problems with the memory use for the in-memory
> > representation of the tii file during indexing, we changed the
> > termIndexInterval.  The termIndexInterval is an indexing-time setting
> >  which affects the size of the tii file.  It sets the sampling of the tis
> > file that gets written to the tii file.
> >
> > In Solr 4.0 termInfosIndexDivisor has been replaced with
> > termIndexDivisor.    The documentation for these two features, the
> > index-time termIndexInterval and the run-time  termIndexDivisor no longer
> > seems to be on the solr config page of the wiki and the docmentation in
> the
> > example file does not exlain what the termIndexDivisor does.
> >
> > Would it be appropriate to add these back to the wiki page?  If not,
> could
> > someone add a line or two to the comments in the Solr 4.0 example file
> > explaining what the termIndexDivisor doe?
> >
> >
> > Tom
>
>
>
> --
> lucidworks.com
>

Reply via email to