Re: Indexing large documents
Even the most non-structured data has to have some breakpoint. I've seen projects running solr that used to index whole books one document per chapter plus a synopsis boosted doc. The question here is how you need to search and match those docs. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-03-18 23:52 GMT-03:00 Stephen Kottmann stephen_kottm...@h3biomedicine.com: Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to work around? Really interested to hear how others are dealing with this. Thanks everyone! Stephen -- [This e-mail message may contain privileged, confidential and/or proprietary information of H3 Biomedicine. If you believe that it has been sent to you in error, please contact the sender immediately and delete the message including any attachments, without copying, using, or distributing any of the information contained therein. This e-mail message should not be interpreted to include a digital or electronic signature that can be used to authenticate an agreement, contract or other legal document, nor to reflect an intention to be bound to any legally-binding agreement or contract.]
Re: Indexing large documents
Hi Stephen, We regularly index documents in the range of 500KB-8GB on machines that have about 10GB devoted to Solr. In order to avoid OOM's on Solr versions prior to Solr 4.0, we use a separate indexing machine(s) from the search server machine(s) and also set the termIndexInterval to 8 times that of the default 128 termIndexInterval1024/termIndexInterval (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for a description of the problem, although the solution we are using is different, termIndexInterval rather than termInfosDivisor) I would like to second Otis' suggestion that you consider breaking large documents into smaller sub-documents. We are currently not doing that and we believe that relevance ranking is not working well at all. If you consider that most relevance ranking algorithms were designed, tested, and tuned on TREC newswire-size documents (average 300 words) or truncated web documents (average 1,000-3,000 words), it seems likely that they may not work well with book size documents (average 100,000 words). Ranking algorithms that use IDF will be particularly affected. We are currently investigating grouping and block-join options. Unfortunately, our data does not have good mark-up or metadata to allow splitting books by chapter. We have investigated indexing pages of books, but due to many issues including performance and scalability (We index the full-text of 11 million books and indexing on the page level it would result in 3.3 billion solr documents), we haven't arrived at a workable solution for our use case. At the moment the main bottleneck is memory use for faceting, but we intend to experiment with docValues to see if the increase in index size is worth the reduction in memory use. Presently block-join indexing does not implement scoring, although we hope that will change in the near future and the relevance ranking for grouping will rank the group by the highest ranking member. So if you split a book into chapters, it would rank the book by the highest ranking chapter. This may be appropriate for your use case as Otis suggested. In our use case sometimes this is appropriate, but we are investigating the possibility of other methods of scoring the group based on a more flexible function of the scores of the members (i.e scoring book based on function of scores of chapters). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search On Tue, Mar 18, 2014 at 11:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I think you probably want to split giant documents because you / your users probably want to be able to find smaller sections of those big docs that are best matches to their queries. Imagine querying War and Peace. Almost any regular word your query for will produce a match. Yes, you may want to enable field collapsing aka grouping. I've seen facet counts get messed up when grouping is turned on, but have not confirmed if this is a (known) bug or not. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann stephen_kottm...@h3biomedicine.com wrote: Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to
Re: Indexing large documents
Hi, I think you probably want to split giant documents because you / your users probably want to be able to find smaller sections of those big docs that are best matches to their queries. Imagine querying War and Peace. Almost any regular word your query for will produce a match. Yes, you may want to enable field collapsing aka grouping. I've seen facet counts get messed up when grouping is turned on, but have not confirmed if this is a (known) bug or not. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann stephen_kottm...@h3biomedicine.com wrote: Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to work around? Really interested to hear how others are dealing with this. Thanks everyone! Stephen -- [This e-mail message may contain privileged, confidential and/or proprietary information of H3 Biomedicine. If you believe that it has been sent to you in error, please contact the sender immediately and delete the message including any attachments, without copying, using, or distributing any of the information contained therein. This e-mail message should not be interpreted to include a digital or electronic signature that can be used to authenticate an agreement, contract or other legal document, nor to reflect an intention to be bound to any legally-binding agreement or contract.]
RE: Indexing large documents
Hi I want to know how to update my .xml file which have other field then the default one , so which file o have to modify, and how. pRAVEEN jAIN +919890599250 -Original Message- From: Fouad Mardini [mailto:[EMAIL PROTECTED] Sent: Monday, August 20, 2007 4:00 PM To: solr-user@lucene.apache.org Subject: Indexing large documents Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
Well, I am using the java textmining library to extract text from documents, then i do a post to solr I do not have an error log, i only have *.request.log files in the logs directory Thanks On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote: Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
The that should show some errors if something goes wrong, if not the console usually will. The errors will look like a java stacktrace output. Did increasing the heap do anything for you? Changing mine to 256mb max worked fine for all of our files. On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Well, I am using the java textmining library to extract text from documents, then i do a post to solr I do not have an error log, i only have *.request.log files in the logs directory Thanks On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote: Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
You will probably need to increase the value of maxFieldLength in your solrconfig.xml. The default value is 1 which might explain why your documents are not being completely indexed. Piete On 20/08/07, Peter Manis [EMAIL PROTECTED] wrote: The that should show some errors if something goes wrong, if not the console usually will. The errors will look like a java stacktrace output. Did increasing the heap do anything for you? Changing mine to 256mb max worked fine for all of our files. On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Well, I am using the java textmining library to extract text from documents, then i do a post to solr I do not have an error log, i only have *.request.log files in the logs directory Thanks On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote: Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini