Re: Indexing large documents

2014-03-19 Thread Alexei Martchenko
Even the most non-structured data has to have some breakpoint. I've seen
projects running solr that used to index whole books one document per
chapter plus a synopsis boosted doc. The question here is how you need to
search and match those docs.


alexei martchenko
Facebook http://www.facebook.com/alexeiramone |
Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
Steam http://steamcommunity.com/id/alexeiramone/ |
4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
Github https://github.com/alexeiramone | (11) 9 7613.0966 |


2014-03-18 23:52 GMT-03:00 Stephen Kottmann 
stephen_kottm...@h3biomedicine.com:

 Hi Solr Users,

 I'm looking for advice on best practices when indexing large documents
 (100's of MB or even 1 to 2 GB text files). I've been hunting around on
 google and the mailing list, and have found some suggestions of splitting
 the logical document up into multiple solr documents. However, I haven't
 been able to find anything that seems like conclusive advice.

 Some background...

 We've been using solr with great success for some time on a project that is
 mostly indexing very structured data - ie. mainly based on ingesting
 through DIH.

 I've now started a new project and we're trying to make use of solr again -
 however, in this project we are indexing mostly unstructured data - pdfs,
 powerpoint, word, etc. I've not done much configuration - my solr instance
 is very close to the example provided in the distribution aside from some
 minor schema changes. Our index is relatively small at this point ( ~3k
 documents ), and for initial indexing I am pulling documents from a http
 data source, running them through Tika, and then pushing to solr using
 solrj. For the most part this is working great... until I hit one of these
 huge text files and then OOM on indexing.

 I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
 it, but it seems like maybe there's a more robust solution that would scale
 better.

 Is splitting the logical document into multiple solr documents best
 practice here? If so, what are the considerations or pitfalls of doing this
 that I should be paying attention to. I guess when querying I always need
 to use a group by field to prevent multiple hits for the same document. Are
 there issues with term frequency, etc that you need to work around?

 Really interested to hear how others are dealing with this.

 Thanks everyone!
 Stephen

 --
 [This e-mail message may contain privileged, confidential and/or
 proprietary information of H3 Biomedicine. If you believe that it has been
 sent to you in error, please contact the sender immediately and delete the
 message including any attachments, without copying, using, or distributing
 any of the information contained therein. This e-mail message should not be
 interpreted to include a digital or electronic signature that can be used
 to authenticate an agreement, contract or other legal document, nor to
 reflect an intention to be bound to any legally-binding agreement or
 contract.]



Re: Indexing large documents

2014-03-19 Thread Tom Burton-West
Hi Stephen,

We regularly index documents in the range of 500KB-8GB on machines that
have about 10GB devoted to Solr.  In order to avoid OOM's on Solr versions
prior to Solr 4.0, we use a separate indexing machine(s) from the search
server machine(s) and also set the termIndexInterval to 8 times that of the
default 128
termIndexInterval1024/termIndexInterval (See
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for
a description of the problem, although the solution we are using is
different, termIndexInterval rather than termInfosDivisor)

I would like to second Otis' suggestion that you consider breaking large
documents into smaller sub-documents.   We are currently not doing that and
we believe that relevance ranking is not working well at all.

 If you consider that most relevance ranking algorithms were designed,
tested, and tuned on TREC newswire-size documents (average 300 words) or
truncated web documents (average 1,000-3,000 words), it seems likely that
they may not work well with book size documents (average 100,000 words).
 Ranking algorithms that use IDF will be particularly affected.


We are currently investigating grouping and block-join options.
Unfortunately, our data does not have good mark-up or metadata to allow
splitting books by chapter.  We have investigated indexing pages of books,
but  due to many issues including performance and scalability  (We index
the full-text of 11 million books and indexing on the page level it would
result in 3.3 billion solr documents), we haven't arrived at a workable
solution for our use case.   At the moment the main bottleneck is memory
use for faceting, but we intend to experiment with docValues to see if the
increase in index size is worth the reduction in memory use.

Presently block-join indexing does not implement scoring, although we hope
that will change in the near future and the relevance ranking for grouping
will rank the group by the highest ranking member.   So if you split a book
into chapters, it would rank the book by the highest ranking chapter.
 This may be appropriate for your use case as Otis suggested.  In our use
case sometimes this is appropriate, but we are investigating the
possibility of other methods of scoring the group based on a more flexible
function of the scores of the members (i.e scoring book based on function
of scores of chapters).

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



On Tue, Mar 18, 2014 at 11:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I think you probably want to split giant documents because you / your users
 probably want to be able to find smaller sections of those big docs that
 are best matches to their queries.  Imagine querying War and Peace.  Almost
 any regular word your query for will produce a match.  Yes, you may want to
 enable field collapsing aka grouping.  I've seen facet counts get messed up
 when grouping is turned on, but have not confirmed if this is a (known) bug
 or not.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann 
 stephen_kottm...@h3biomedicine.com wrote:

  Hi Solr Users,
 
  I'm looking for advice on best practices when indexing large documents
  (100's of MB or even 1 to 2 GB text files). I've been hunting around on
  google and the mailing list, and have found some suggestions of splitting
  the logical document up into multiple solr documents. However, I haven't
  been able to find anything that seems like conclusive advice.
 
  Some background...
 
  We've been using solr with great success for some time on a project that
 is
  mostly indexing very structured data - ie. mainly based on ingesting
  through DIH.
 
  I've now started a new project and we're trying to make use of solr
 again -
  however, in this project we are indexing mostly unstructured data - pdfs,
  powerpoint, word, etc. I've not done much configuration - my solr
 instance
  is very close to the example provided in the distribution aside from some
  minor schema changes. Our index is relatively small at this point ( ~3k
  documents ), and for initial indexing I am pulling documents from a http
  data source, running them through Tika, and then pushing to solr using
  solrj. For the most part this is working great... until I hit one of
 these
  huge text files and then OOM on indexing.
 
  I've got a modest JVM - 4GB allocated. Obviously I can throw more memory
 at
  it, but it seems like maybe there's a more robust solution that would
 scale
  better.
 
  Is splitting the logical document into multiple solr documents best
  practice here? If so, what are the considerations or pitfalls of doing
 this
  that I should be paying attention to. I guess when querying I always need
  to use a group by field to prevent multiple hits for the same document.
 Are
  there issues with term frequency, etc that you need to 

Re: Indexing large documents

2014-03-18 Thread Otis Gospodnetic
Hi,

I think you probably want to split giant documents because you / your users
probably want to be able to find smaller sections of those big docs that
are best matches to their queries.  Imagine querying War and Peace.  Almost
any regular word your query for will produce a match.  Yes, you may want to
enable field collapsing aka grouping.  I've seen facet counts get messed up
when grouping is turned on, but have not confirmed if this is a (known) bug
or not.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann 
stephen_kottm...@h3biomedicine.com wrote:

 Hi Solr Users,

 I'm looking for advice on best practices when indexing large documents
 (100's of MB or even 1 to 2 GB text files). I've been hunting around on
 google and the mailing list, and have found some suggestions of splitting
 the logical document up into multiple solr documents. However, I haven't
 been able to find anything that seems like conclusive advice.

 Some background...

 We've been using solr with great success for some time on a project that is
 mostly indexing very structured data - ie. mainly based on ingesting
 through DIH.

 I've now started a new project and we're trying to make use of solr again -
 however, in this project we are indexing mostly unstructured data - pdfs,
 powerpoint, word, etc. I've not done much configuration - my solr instance
 is very close to the example provided in the distribution aside from some
 minor schema changes. Our index is relatively small at this point ( ~3k
 documents ), and for initial indexing I am pulling documents from a http
 data source, running them through Tika, and then pushing to solr using
 solrj. For the most part this is working great... until I hit one of these
 huge text files and then OOM on indexing.

 I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
 it, but it seems like maybe there's a more robust solution that would scale
 better.

 Is splitting the logical document into multiple solr documents best
 practice here? If so, what are the considerations or pitfalls of doing this
 that I should be paying attention to. I guess when querying I always need
 to use a group by field to prevent multiple hits for the same document. Are
 there issues with term frequency, etc that you need to work around?

 Really interested to hear how others are dealing with this.

 Thanks everyone!
 Stephen

 --
 [This e-mail message may contain privileged, confidential and/or
 proprietary information of H3 Biomedicine. If you believe that it has been
 sent to you in error, please contact the sender immediately and delete the
 message including any attachments, without copying, using, or distributing
 any of the information contained therein. This e-mail message should not be
 interpreted to include a digital or electronic signature that can be used
 to authenticate an agreement, contract or other legal document, nor to
 reflect an intention to be bound to any legally-binding agreement or
 contract.]



RE: Indexing large documents

2007-08-20 Thread praveen jain
Hi 
 I want to know how to update my .xml file which have other field then
the  default one , so which file o have to modify, and how.

pRAVEEN jAIN
+919890599250

-Original Message-
From: Fouad Mardini [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 20, 2007 4:00 PM
To: solr-user@lucene.apache.org
Subject: Indexing large documents

Hello,

I am using solr to index text extracted from word documents, and it is
working really well.
Recently i started noticing that some documents are not indexed, that is
i
know that the word foobar is in a document, but when i search for foobar
the
id of that document is not returned.
I suspect that this has to do with the size of the document, and that
documents with a lot of text are not being indexed.
Please advise.

thanks,
fmardini



Re: Indexing large documents

2007-08-20 Thread Peter Manis
Fouad,

I would check the error log or console for any possible errors first.
They may not show up, it really depends on how you are processing the
word document (custom solr, feeding the text to it, etc).  We are
using a custom version of solr with PDF, DOC, XLS, etc text extraction
and I have successfully indexed 40mb documents.  I did have indexing
problems with a large document or two and simply increasing the heap
size fixed the problem.

 - Pete

On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
 Hello,

 I am using solr to index text extracted from word documents, and it is
 working really well.
 Recently i started noticing that some documents are not indexed, that is i
 know that the word foobar is in a document, but when i search for foobar the
 id of that document is not returned.
 I suspect that this has to do with the size of the document, and that
 documents with a lot of text are not being indexed.
 Please advise.

 thanks,
 fmardini



Re: Indexing large documents

2007-08-20 Thread Fouad Mardini
Well, I am using the java textmining library to extract text from documents,
then i do a post to solr
I do not have an error log, i only have *.request.log files in the logs
directory

Thanks

On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote:

 Fouad,

 I would check the error log or console for any possible errors first.
 They may not show up, it really depends on how you are processing the
 word document (custom solr, feeding the text to it, etc).  We are
 using a custom version of solr with PDF, DOC, XLS, etc text extraction
 and I have successfully indexed 40mb documents.  I did have indexing
 problems with a large document or two and simply increasing the heap
 size fixed the problem.

 - Pete

 On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
  Hello,
 
  I am using solr to index text extracted from word documents, and it is
  working really well.
  Recently i started noticing that some documents are not indexed, that is
 i
  know that the word foobar is in a document, but when i search for foobar
 the
  id of that document is not returned.
  I suspect that this has to do with the size of the document, and that
  documents with a lot of text are not being indexed.
  Please advise.
 
  thanks,
  fmardini
 



Re: Indexing large documents

2007-08-20 Thread Peter Manis
The that should show some errors if something goes wrong, if not the
console usually will.  The errors will look like a java stacktrace
output.  Did increasing the heap do anything for you?  Changing mine
to 256mb max worked fine for all of our files.

On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
 Well, I am using the java textmining library to extract text from documents,
 then i do a post to solr
 I do not have an error log, i only have *.request.log files in the logs
 directory

 Thanks

 On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote:
 
  Fouad,
 
  I would check the error log or console for any possible errors first.
  They may not show up, it really depends on how you are processing the
  word document (custom solr, feeding the text to it, etc).  We are
  using a custom version of solr with PDF, DOC, XLS, etc text extraction
  and I have successfully indexed 40mb documents.  I did have indexing
  problems with a large document or two and simply increasing the heap
  size fixed the problem.
 
  - Pete
 
  On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
   Hello,
  
   I am using solr to index text extracted from word documents, and it is
   working really well.
   Recently i started noticing that some documents are not indexed, that is
  i
   know that the word foobar is in a document, but when i search for foobar
  the
   id of that document is not returned.
   I suspect that this has to do with the size of the document, and that
   documents with a lot of text are not being indexed.
   Please advise.
  
   thanks,
   fmardini
  
 



Re: Indexing large documents

2007-08-20 Thread Pieter Berkel
You will probably need to increase the value of maxFieldLength in your
solrconfig.xml.  The default value is 1 which might explain why your
documents are not being completely indexed.

Piete


On 20/08/07, Peter Manis [EMAIL PROTECTED] wrote:

 The that should show some errors if something goes wrong, if not the
 console usually will.  The errors will look like a java stacktrace
 output.  Did increasing the heap do anything for you?  Changing mine
 to 256mb max worked fine for all of our files.

 On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
  Well, I am using the java textmining library to extract text from
 documents,
  then i do a post to solr
  I do not have an error log, i only have *.request.log files in the logs
  directory
 
  Thanks
 
  On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote:
  
   Fouad,
  
   I would check the error log or console for any possible errors first.
   They may not show up, it really depends on how you are processing the
   word document (custom solr, feeding the text to it, etc).  We are
   using a custom version of solr with PDF, DOC, XLS, etc text extraction
   and I have successfully indexed 40mb documents.  I did have indexing
   problems with a large document or two and simply increasing the heap
   size fixed the problem.
  
   - Pete
  
   On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
Hello,
   
I am using solr to index text extracted from word documents, and it
 is
working really well.
Recently i started noticing that some documents are not indexed,
 that is
   i
know that the word foobar is in a document, but when i search for
 foobar
   the
id of that document is not returned.
I suspect that this has to do with the size of the document, and
 that
documents with a lot of text are not being indexed.
Please advise.
   
thanks,
fmardini