Re: Problems for indexing large documents on SolrCloud
Hi, First thanks for your advices. I did some several tests and finally I could index all the data on my SolrCloud cluster. The error was client side, it's documented in this post : http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201406.mbox/%3ccfc09ae1.94f8%25rebecca.t...@ucsf.edu%3E EofException from Jetty means one specific thing: The client software disconnected before Solr was finished with the request and sent its response. Chances are good that this is because of a configured socket timeout on your SolrJ client or its HttpClient. This might have been done with the setSoTimeout method on the server object. So I increased Solarium timeout from 5 to 60 seconds and all the data is well indexed now. The error was not reproducible on my development PC because the database and the Solr were on the same local virtual machine with a lot of available resources so the indexation was faster than in SolrCloud cluster. Thanks, Olivier 2014-09-11 0:21 GMT+02:00 Shawn Heisey s...@elyograg.org: On 9/10/2014 2:05 PM, Erick Erickson wrote: bq: org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier This is very often an indication that your packets are being truncated by something in the chain. In your case, make sure that Tomcat is configured to handle inputs of the size that you're sending. This may be happening before things get to Solr, in which case your settings in solrconfig.xml aren't germane, the problem is earlier than than. A semi-smoking-gun here is that there's a size of your multivalued field that seems to break things... That doesn't rule out time problems of course. But I'd look at the Tomcat settings for maximum packet size first. The maximum HTTP request size is actually is controlled by Solr itself since 4.1, with changes committed for SOLR-4265. Changing the setting on Tomcat probably will not help. An example from my own config which sets this to 32MB - the default is 2048, or 2MB: requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=32768 formdataUploadLimitInKB=32768/ Thanks, Shawn
Problems for indexing large documents on SolrCloud
Hi, I have some problems for indexing large documents in a SolrCloud cluster of 3 servers (Solr 4.8.1) with 3 shards and 2 replicas for each shard on Tomcat 7. For a specific document (with 300 K values in a multivalued field), I couldn't index it on SolrCloud but I could do it in a single instance of Solr on my own PC. The indexation is done with Solarium from a database. The data indexed are e-commerce products with classic fields like name, price, description, instock, etc... The large field (type int) is constitued of other products ids. The only difference with other documents well-indexed on Solr is the size of that multivalued field. Indeed, other documents well-indexed have all between 100K values and 200 K values for that field. The index size is 11 Mb for 20 documents. To solve it, I tried to change several parameters including ZKTimeout in solr.xml : In solrcloud section : int name=zkClientTimeout6/int int name=distribUpdateConnTimeout10/int int name=distribUpdateSoTimeout10/int In shardHandlerFactory section : int name=socketTimeout${socketTimeout:10}/int int name=connTimeout${connTimeout:10}/int I also tried to increase these values in solrconfig.xml : requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=1 formdataUploadLimitInKB=10 addHttpRequestToContext=false/ I also tried to increase the quantity of RAM (there are VMs) : each server has 4 Gb of RAM with 3Gb for the JVM. Are there other settings which can solve the problem that I would have forgotten ? The error messages are : ERROR SolrDispatchFilter null:java.lang.RuntimeException: [was class java.net.SocketException] Connection reset ERROR SolrDispatchFilter null:ClientAbortException: java.net.SocketException: broken pipe ERROR SolrDispatchFilter null:ClientAbortException: java.net.SocketException: broken pipe ERROR SolrCore org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier ERROR SolrCore org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier ERROR SolrCore org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier ERROR SolrCore org.apache.solr.common.SolrException: Unexpected EOF in attribute value Thanks, Olivier SolrCore org.apache.solr.common.SolrException: Unexpected end of input block in start tag
Re: Problems for indexing large documents on SolrCloud
bq: org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier This is very often an indication that your packets are being truncated by something in the chain. In your case, make sure that Tomcat is configured to handle inputs of the size that you're sending. This may be happening before things get to Solr, in which case your settings in solrconfig.xml aren't germane, the problem is earlier than than. A semi-smoking-gun here is that there's a size of your multivalued field that seems to break things... That doesn't rule out time problems of course. But I'd look at the Tomcat settings for maximum packet size first. Best, Erick On Wed, Sep 10, 2014 at 9:11 AM, Olivier olivau...@gmail.com wrote: Hi, I have some problems for indexing large documents in a SolrCloud cluster of 3 servers (Solr 4.8.1) with 3 shards and 2 replicas for each shard on Tomcat 7. For a specific document (with 300 K values in a multivalued field), I couldn't index it on SolrCloud but I could do it in a single instance of Solr on my own PC. The indexation is done with Solarium from a database. The data indexed are e-commerce products with classic fields like name, price, description, instock, etc... The large field (type int) is constitued of other products ids. The only difference with other documents well-indexed on Solr is the size of that multivalued field. Indeed, other documents well-indexed have all between 100K values and 200 K values for that field. The index size is 11 Mb for 20 documents. To solve it, I tried to change several parameters including ZKTimeout in solr.xml : In solrcloud section : int name=zkClientTimeout6/int int name=distribUpdateConnTimeout10/int int name=distribUpdateSoTimeout10/int In shardHandlerFactory section : int name=socketTimeout${socketTimeout:10}/int int name=connTimeout${connTimeout:10}/int I also tried to increase these values in solrconfig.xml : requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=1 formdataUploadLimitInKB=10 addHttpRequestToContext=false/ I also tried to increase the quantity of RAM (there are VMs) : each server has 4 Gb of RAM with 3Gb for the JVM. Are there other settings which can solve the problem that I would have forgotten ? The error messages are : ERROR SolrDispatchFilter null:java.lang.RuntimeException: [was class java.net.SocketException] Connection reset ERROR SolrDispatchFilter null:ClientAbortException: java.net.SocketException: broken pipe ERROR SolrDispatchFilter null:ClientAbortException: java.net.SocketException: broken pipe ERROR SolrCore org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier ERROR SolrCore org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier ERROR SolrCore org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier ERROR SolrCore org.apache.solr.common.SolrException: Unexpected EOF in attribute value Thanks, Olivier SolrCore org.apache.solr.common.SolrException: Unexpected end of input block in start tag
Re: Problems for indexing large documents on SolrCloud
On 9/10/2014 2:05 PM, Erick Erickson wrote: bq: org.apache.solr.common.SolrException: Unexpected end of input block; expected an identifier This is very often an indication that your packets are being truncated by something in the chain. In your case, make sure that Tomcat is configured to handle inputs of the size that you're sending. This may be happening before things get to Solr, in which case your settings in solrconfig.xml aren't germane, the problem is earlier than than. A semi-smoking-gun here is that there's a size of your multivalued field that seems to break things... That doesn't rule out time problems of course. But I'd look at the Tomcat settings for maximum packet size first. The maximum HTTP request size is actually is controlled by Solr itself since 4.1, with changes committed for SOLR-4265. Changing the setting on Tomcat probably will not help. An example from my own config which sets this to 32MB - the default is 2048, or 2MB: requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=32768 formdataUploadLimitInKB=32768/ Thanks, Shawn
Re: Indexing large documents
Even the most non-structured data has to have some breakpoint. I've seen projects running solr that used to index whole books one document per chapter plus a synopsis boosted doc. The question here is how you need to search and match those docs. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-03-18 23:52 GMT-03:00 Stephen Kottmann stephen_kottm...@h3biomedicine.com: Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to work around? Really interested to hear how others are dealing with this. Thanks everyone! Stephen -- [This e-mail message may contain privileged, confidential and/or proprietary information of H3 Biomedicine. If you believe that it has been sent to you in error, please contact the sender immediately and delete the message including any attachments, without copying, using, or distributing any of the information contained therein. This e-mail message should not be interpreted to include a digital or electronic signature that can be used to authenticate an agreement, contract or other legal document, nor to reflect an intention to be bound to any legally-binding agreement or contract.]
Re: Indexing large documents
Hi Stephen, We regularly index documents in the range of 500KB-8GB on machines that have about 10GB devoted to Solr. In order to avoid OOM's on Solr versions prior to Solr 4.0, we use a separate indexing machine(s) from the search server machine(s) and also set the termIndexInterval to 8 times that of the default 128 termIndexInterval1024/termIndexInterval (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for a description of the problem, although the solution we are using is different, termIndexInterval rather than termInfosDivisor) I would like to second Otis' suggestion that you consider breaking large documents into smaller sub-documents. We are currently not doing that and we believe that relevance ranking is not working well at all. If you consider that most relevance ranking algorithms were designed, tested, and tuned on TREC newswire-size documents (average 300 words) or truncated web documents (average 1,000-3,000 words), it seems likely that they may not work well with book size documents (average 100,000 words). Ranking algorithms that use IDF will be particularly affected. We are currently investigating grouping and block-join options. Unfortunately, our data does not have good mark-up or metadata to allow splitting books by chapter. We have investigated indexing pages of books, but due to many issues including performance and scalability (We index the full-text of 11 million books and indexing on the page level it would result in 3.3 billion solr documents), we haven't arrived at a workable solution for our use case. At the moment the main bottleneck is memory use for faceting, but we intend to experiment with docValues to see if the increase in index size is worth the reduction in memory use. Presently block-join indexing does not implement scoring, although we hope that will change in the near future and the relevance ranking for grouping will rank the group by the highest ranking member. So if you split a book into chapters, it would rank the book by the highest ranking chapter. This may be appropriate for your use case as Otis suggested. In our use case sometimes this is appropriate, but we are investigating the possibility of other methods of scoring the group based on a more flexible function of the scores of the members (i.e scoring book based on function of scores of chapters). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search On Tue, Mar 18, 2014 at 11:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I think you probably want to split giant documents because you / your users probably want to be able to find smaller sections of those big docs that are best matches to their queries. Imagine querying War and Peace. Almost any regular word your query for will produce a match. Yes, you may want to enable field collapsing aka grouping. I've seen facet counts get messed up when grouping is turned on, but have not confirmed if this is a (known) bug or not. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann stephen_kottm...@h3biomedicine.com wrote: Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need
Indexing large documents
Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to work around? Really interested to hear how others are dealing with this. Thanks everyone! Stephen -- [This e-mail message may contain privileged, confidential and/or proprietary information of H3 Biomedicine. If you believe that it has been sent to you in error, please contact the sender immediately and delete the message including any attachments, without copying, using, or distributing any of the information contained therein. This e-mail message should not be interpreted to include a digital or electronic signature that can be used to authenticate an agreement, contract or other legal document, nor to reflect an intention to be bound to any legally-binding agreement or contract.]
Re: Indexing large documents
Hi, I think you probably want to split giant documents because you / your users probably want to be able to find smaller sections of those big docs that are best matches to their queries. Imagine querying War and Peace. Almost any regular word your query for will produce a match. Yes, you may want to enable field collapsing aka grouping. I've seen facet counts get messed up when grouping is turned on, but have not confirmed if this is a (known) bug or not. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann stephen_kottm...@h3biomedicine.com wrote: Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice. Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to work around? Really interested to hear how others are dealing with this. Thanks everyone! Stephen -- [This e-mail message may contain privileged, confidential and/or proprietary information of H3 Biomedicine. If you believe that it has been sent to you in error, please contact the sender immediately and delete the message including any attachments, without copying, using, or distributing any of the information contained therein. This e-mail message should not be interpreted to include a digital or electronic signature that can be used to authenticate an agreement, contract or other legal document, nor to reflect an intention to be bound to any legally-binding agreement or contract.]
Re: Is indexing large documents still an issue?
You can still use highlighting without returning the content. Just set content as your alternate highlight field. Then if no highlights are returned you will receive the content. Make sure you set a character limit so you don't get the whole thing. I use 300. Does that make sense? This is what I add to my query string. hl=truehl.fl=contenthl.snippets=3hl.alternateField=contenthl.maxAlternateFieldLength=300 On Thu, May 2, 2013 at 7:32 AM, adfel70 adfe...@gmail.com wrote: Well, returning the content field for highlighting is within my requirements. Did you solve this in some other way? or you just didn't have to? Bai Shen wrote The only issue I ran into was returning the content field. Once I modified my query to avoid that, I got good performance. Admittedly, I only have about 15-20k documents in my index ATM, but most of them are in the multiMB range with a current max of 250MB. On Thu, May 2, 2013 at 7:05 AM, adfel70 lt; adfel70@ gt; wrote: Hi, In previous versions of solr, indexing documents with large fields caused performance degradation. Is this still the case in solr 4.2? If so, and I'll need to chunk the document and index many document parts, can anyony give a general idea of what field/document size solr CAN handle? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425p4060431.html Sent from the Solr - User mailing list archive at Nabble.com.
Is indexing large documents still an issue?
Hi, In previous versions of solr, indexing documents with large fields caused performance degradation. Is this still the case in solr 4.2? If so, and I'll need to chunk the document and index many document parts, can anyony give a general idea of what field/document size solr CAN handle? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is indexing large documents still an issue?
The only issue I ran into was returning the content field. Once I modified my query to avoid that, I got good performance. Admittedly, I only have about 15-20k documents in my index ATM, but most of them are in the multiMB range with a current max of 250MB. On Thu, May 2, 2013 at 7:05 AM, adfel70 adfe...@gmail.com wrote: Hi, In previous versions of solr, indexing documents with large fields caused performance degradation. Is this still the case in solr 4.2? If so, and I'll need to chunk the document and index many document parts, can anyony give a general idea of what field/document size solr CAN handle? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is indexing large documents still an issue?
Well, returning the content field for highlighting is within my requirements. Did you solve this in some other way? or you just didn't have to? Bai Shen wrote The only issue I ran into was returning the content field. Once I modified my query to avoid that, I got good performance. Admittedly, I only have about 15-20k documents in my index ATM, but most of them are in the multiMB range with a current max of 250MB. On Thu, May 2, 2013 at 7:05 AM, adfel70 lt; adfel70@ gt; wrote: Hi, In previous versions of solr, indexing documents with large fields caused performance degradation. Is this still the case in solr 4.2? If so, and I'll need to chunk the document and index many document parts, can anyony give a general idea of what field/document size solr CAN handle? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425p4060431.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing large documents
Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
RE: Indexing large documents
Hi I want to know how to update my .xml file which have other field then the default one , so which file o have to modify, and how. pRAVEEN jAIN +919890599250 -Original Message- From: Fouad Mardini [mailto:[EMAIL PROTECTED] Sent: Monday, August 20, 2007 4:00 PM To: solr-user@lucene.apache.org Subject: Indexing large documents Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
Well, I am using the java textmining library to extract text from documents, then i do a post to solr I do not have an error log, i only have *.request.log files in the logs directory Thanks On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote: Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
The that should show some errors if something goes wrong, if not the console usually will. The errors will look like a java stacktrace output. Did increasing the heap do anything for you? Changing mine to 256mb max worked fine for all of our files. On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Well, I am using the java textmining library to extract text from documents, then i do a post to solr I do not have an error log, i only have *.request.log files in the logs directory Thanks On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote: Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini
Re: Indexing large documents
You will probably need to increase the value of maxFieldLength in your solrconfig.xml. The default value is 1 which might explain why your documents are not being completely indexed. Piete On 20/08/07, Peter Manis [EMAIL PROTECTED] wrote: The that should show some errors if something goes wrong, if not the console usually will. The errors will look like a java stacktrace output. Did increasing the heap do anything for you? Changing mine to 256mb max worked fine for all of our files. On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Well, I am using the java textmining library to extract text from documents, then i do a post to solr I do not have an error log, i only have *.request.log files in the logs directory Thanks On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote: Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote: Hello, I am using solr to index text extracted from word documents, and it is working really well. Recently i started noticing that some documents are not indexed, that is i know that the word foobar is in a document, but when i search for foobar the id of that document is not returned. I suspect that this has to do with the size of the document, and that documents with a lot of text are not being indexed. Please advise. thanks, fmardini