100MB of text for a single lucene document, into a single analyzed field. The
analyzer is basically the StandardAnalyzer, with minor changes:
1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't
split URLs and email addresses (so we can do it ourselves in the next step).
2. Spli
I've had success limiting the number of documents by size, and doing them 1
at a time works OK with 2G heap. I'm also hoping to understand why memory
usage would be so high to begin with, or maybe this is expected?
I agree that indexing 100+M of text is a bit silly, but the use case is a
legal con
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson wrote:
> Well
> 2> seriously consider the utility of indexing a 100+M file. Assuming
> it's mostly text, lots and lots and lots of queries will match it, and
> it'll score pretty low due to length normalization. And you probably
> can't return it to
Is that 100MB for a single Lucene document? And is that 100MB for a single
field? Is that field analyzed text? How complex is the analyzer? Like, does
it do ngrams or something else that is token or memory intensive? Posting
the analyzer might help us see what the issue might be.
Try indexing
Well
1> don't send 20 docs at once. Or send docs over some size N by themselves.
2> seriously consider the utility of indexing a 100+M file. Assuming
it's mostly text, lots and lots and lots of queries will match it, and
it'll score pretty low due to length normalization. And you probably
can't re