Some days ago, I posted about an issue with SOLR running out of memory when attempting to index large text files (say 300 MB ). Details at http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
Two things I need to point out: 1. I don't need Tika for content extraction as the files are already in plain text format. 2. The heap space error was caused by a futile Tika/SOLR attempt at creating the corresponding huge XML document in memory I've decided to develop a custom handler that 1. reads the file text directly 2. attempts to create a SOLR document and directly add the text data to the corresponding field. One approach I've taken is to read manageable chunks of text data sequentially from the file and process. We've used this approach sucessfully with Lucene in the past and I'm attempting to make it work with SOLR too. I got most of the work done yesterday, but need a bit of guidance w.r.t. point 2. How can I achieve updating the same field multiple times. Looking at the SOLR source, processor.addField() merely a. adds to the in-memory field map and b. attempts to write EVERYTHING to the index later on. In my situation, (a) eventually causes a heap space error. Here's part of the handler code. Thanks much Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html Sent from the Solr - User mailing list archive at Nabble.com.