It might be a bit harsh to drop the document if it has a very long token
in it. I can imagine documents with embedded binary data, where the
text around the binary data is still useful for search.
My feeling is that long tokens (longer than 128 or 256 bytes) are not
useful for search, and sho
ze, choose between truncating or dropping longer tokens, and in no
case produce tokens longer that what IndexWriter can digest.
- perhaps come up a clear policy on when a tokenizer should throw an
exception?
Gabi Steinberg.
Yonik Seeley wrote:
On Dec 20, 2007 11:57 AM, Michael McCandless <[EM
allow you to choose a
relatively small max, such as 32 or 64, reducing the overhead caused by
junk in the documents while minimizing the chance of not finding something.
Gabi.
Michael McCandless wrote:
Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. I think