Some days ago, I posted about an issue with SOLR running out of memory when
attempting to index large text files (say 300 MB ). Details at
http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html

Two things I need to point out: 

1. I don't need Tika for content extraction as the files are already in
plain text format.
2. The heap space error was caused by a futile Tika/SOLR attempt at creating
the corresponding huge XML document in memory

I've decided to develop a custom handler that 
1. reads the file text directly
2. attempts to create a SOLR document and directly add the text data to the
corresponding field. 

One approach I've taken is to read manageable chunks of text data
sequentially from the file and process. We've used this approach sucessfully
with Lucene in the past and I'm attempting to make it work with SOLR too. I
got most of the work done yesterday, but need a bit of guidance w.r.t. point
2.

How can I achieve updating the same field multiple times. Looking at the
SOLR source, processor.addField() merely 
a. adds to the in-memory field map and 
b. attempts to write EVERYTHING to the index later on. 

In my situation, (a) eventually causes a heap space error.




Here's part of the handler code.



Thanks much

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to