Hi Stephen,

We regularly index documents in the range of 500KB-8GB on machines that
have about 10GB devoted to Solr.  In order to avoid OOM's on Solr versions
prior to Solr 4.0, we use a separate indexing machine(s) from the search
server machine(s) and also set the termIndexInterval to 8 times that of the
default 128
<termIndexInterval>1024</termIndexInterval> (See
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for
a description of the problem, although the solution we are using is
different, termIndexInterval rather than termInfosDivisor)

I would like to second Otis' suggestion that you consider breaking large
documents into smaller sub-documents.   We are currently not doing that and
we believe that relevance ranking is not working well at all.

 If you consider that most relevance ranking algorithms were designed,
tested, and tuned on TREC newswire-size documents (average 300 words) or
truncated web documents (average 1,000-3,000 words), it seems likely that
they may not work well with book size documents (average 100,000 words).
 Ranking algorithms that use IDF will be particularly affected.


We are currently investigating grouping and block-join options.
Unfortunately, our data does not have good mark-up or metadata to allow
splitting books by chapter.  We have investigated indexing pages of books,
but  due to many issues including performance and scalability  (We index
the full-text of 11 million books and indexing on the page level it would
result in 3.3 billion solr documents), we haven't arrived at a workable
solution for our use case.   At the moment the main bottleneck is memory
use for faceting, but we intend to experiment with docValues to see if the
increase in index size is worth the reduction in memory use.

Presently block-join indexing does not implement scoring, although we hope
that will change in the near future and the relevance ranking for grouping
will rank the group by the highest ranking member.   So if you split a book
into chapters, it would rank the book by the highest ranking chapter.
 This may be appropriate for your use case as Otis suggested.  In our use
case sometimes this is appropriate, but we are investigating the
possibility of other methods of scoring the group based on a more flexible
function of the scores of the members (i.e scoring book based on function
of scores of chapters).

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



On Tue, Mar 18, 2014 at 11:17 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> I think you probably want to split giant documents because you / your users
> probably want to be able to find smaller sections of those big docs that
> are best matches to their queries.  Imagine querying War and Peace.  Almost
> any regular word your query for will produce a match.  Yes, you may want to
> enable field collapsing aka grouping.  I've seen facet counts get messed up
> when grouping is turned on, but have not confirmed if this is a (known) bug
> or not.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann <
> stephen_kottm...@h3biomedicine.com> wrote:
>
> > Hi Solr Users,
> >
> > I'm looking for advice on best practices when indexing large documents
> > (100's of MB or even 1 to 2 GB text files). I've been hunting around on
> > google and the mailing list, and have found some suggestions of splitting
> > the logical document up into multiple solr documents. However, I haven't
> > been able to find anything that seems like conclusive advice.
> >
> > Some background...
> >
> > We've been using solr with great success for some time on a project that
> is
> > mostly indexing very structured data - ie. mainly based on ingesting
> > through DIH.
> >
> > I've now started a new project and we're trying to make use of solr
> again -
> > however, in this project we are indexing mostly unstructured data - pdfs,
> > powerpoint, word, etc. I've not done much configuration - my solr
> instance
> > is very close to the example provided in the distribution aside from some
> > minor schema changes. Our index is relatively small at this point ( ~3k
> > documents ), and for initial indexing I am pulling documents from a http
> > data source, running them through Tika, and then pushing to solr using
> > solrj. For the most part this is working great... until I hit one of
> these
> > huge text files and then OOM on indexing.
> >
> > I've got a modest JVM - 4GB allocated. Obviously I can throw more memory
> at
> > it, but it seems like maybe there's a more robust solution that would
> scale
> > better.
> >
> > Is splitting the logical document into multiple solr documents best
> > practice here? If so, what are the considerations or pitfalls of doing
> this
> > that I should be paying attention to. I guess when querying I always need
> > to use a group by field to prevent multiple hits for the same document.
> Are
> > there issues with term frequency, etc that you need to work around?
> >
> > Really interested to hear how others are dealing with this.
> >
> > Thanks everyone!
> > Stephen
> >
> > --
> > [This e-mail message may contain privileged, confidential and/or
> > proprietary information of H3 Biomedicine. If you believe that it has
> been
> > sent to you in error, please contact the sender immediately and delete
> the
> > message including any attachments, without copying, using, or
> distributing
> > any of the information contained therein. This e-mail message should not
> be
> > interpreted to include a digital or electronic signature that can be used
> > to authenticate an agreement, contract or other legal document, nor to
> > reflect an intention to be bound to any legally-binding agreement or
> > contract.]
> >
>

Reply via email to