AW: Retrieve found terms

2014-11-26 Thread Ralf Heyde
Hi John, as far as I remember, the Highlighter has some functionality, to provide that. Maybe you should have a look into that Lucene Highlighting Project too. Ralf -Ursprüngliche Nachricht- Von: John Cecere [mailto:john.cec...@oracle.com] Gesendet: Dienstag, 25. November 2014 15:12 An:

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb
100MB of text for a single lucene document, into a single analyzed field. The analyzer is basically the StandardAnalyzer, with minor changes: 1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't split URLs and email addresses (so we can do it ourselves in the next step). 2. Spli

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb
I've had success limiting the number of documents by size, and doing them 1 at a time works OK with 2G heap. I'm also hoping to understand why memory usage would be so high to begin with, or maybe this is expected? I agree that indexing 100+M of text is a bit silly, but the use case is a legal con

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Trejkaz
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson wrote: > Well > 2> seriously consider the utility of indexing a 100+M file. Assuming > it's mostly text, lots and lots and lots of queries will match it, and > it'll score pretty low due to length normalization. And you probably > can't return it to

Re: Retrieve found terms

2014-11-26 Thread Barry Coughlan
You can enumerate the terms by wrapping the TermsEnum in a PrefixTermsEnum, e.g. Terms terms = fields.terms(field); TermsEnum termsEnum = new PrefixTermsEnum(terms.iterator(null), new BytesRef("arch*")); BytesRef text; while((text = termsEnum.next()) != null) { System.out.pri

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Jack Krupansky
Is that 100MB for a single Lucene document? And is that 100MB for a single field? Is that field analyzed text? How complex is the analyzer? Like, does it do ngrams or something else that is token or memory intensive? Posting the analyzer might help us see what the issue might be. Try indexing