subject:"Re\: OutOfMemoryError indexing large documents"

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb

100MB of text for a single lucene document, into a single analyzed field. The analyzer is basically the StandardAnalyzer, with minor changes: 1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't split URLs and email addresses (so we can do it ourselves in the next step). 2. Spli

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb

I've had success limiting the number of documents by size, and doing them 1 at a time works OK with 2G heap. I'm also hoping to understand why memory usage would be so high to begin with, or maybe this is expected? I agree that indexing 100+M of text is a bit silly, but the use case is a legal con

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Trejkaz

On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson wrote: > Well > 2> seriously consider the utility of indexing a 100+M file. Assuming > it's mostly text, lots and lots and lots of queries will match it, and > it'll score pretty low due to length normalization. And you probably > can't return it to

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Jack Krupansky

Is that 100MB for a single Lucene document? And is that 100MB for a single field? Is that field analyzed text? How complex is the analyzer? Like, does it do ngrams or something else that is token or memory intensive? Posting the analyzer might help us see what the issue might be. Try indexing

Re: OutOfMemoryError indexing large documents

2014-11-25 Thread Erick Erickson

Well 1> don't send 20 docs at once. Or send docs over some size N by themselves. 2> seriously consider the utility of indexing a 100+M file. Assuming it's mostly text, lots and lots and lots of queries will match it, and it'll score pretty low due to length normalization. And you probably can't re

Re: OutOfMemoryError indexing large documents

Re: OutOfMemoryError indexing large documents

Re: OutOfMemoryError indexing large documents

Re: OutOfMemoryError indexing large documents

Re: OutOfMemoryError indexing large documents

5 matches

Site Navigation

Mail list logo

Footer information