Hi John,
as far as I remember, the Highlighter has some functionality, to provide
that. Maybe you should have a look into that Lucene Highlighting Project
too.
Ralf
-Ursprüngliche Nachricht-
Von: John Cecere [mailto:john.cec...@oracle.com]
Gesendet: Dienstag, 25. November 2014 15:12
An:
100MB of text for a single lucene document, into a single analyzed field. The
analyzer is basically the StandardAnalyzer, with minor changes:
1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't
split URLs and email addresses (so we can do it ourselves in the next step).
2. Spli
I've had success limiting the number of documents by size, and doing them 1
at a time works OK with 2G heap. I'm also hoping to understand why memory
usage would be so high to begin with, or maybe this is expected?
I agree that indexing 100+M of text is a bit silly, but the use case is a
legal con
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson wrote:
> Well
> 2> seriously consider the utility of indexing a 100+M file. Assuming
> it's mostly text, lots and lots and lots of queries will match it, and
> it'll score pretty low due to length normalization. And you probably
> can't return it to
You can enumerate the terms by wrapping the TermsEnum in a PrefixTermsEnum,
e.g.
Terms terms = fields.terms(field);
TermsEnum termsEnum = new PrefixTermsEnum(terms.iterator(null),
new BytesRef("arch*"));
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.pri
Is that 100MB for a single Lucene document? And is that 100MB for a single
field? Is that field analyzed text? How complex is the analyzer? Like, does
it do ngrams or something else that is token or memory intensive? Posting
the analyzer might help us see what the issue might be.
Try indexing