Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
It might be a bit harsh to drop the document if it has a very long token in it. I can imagine documents with embedded binary data, where the text around the binary data is still useful for search. My feeling is that long tokens (longer than 128 or 256 bytes) are not useful for search, and sho

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
ze, choose between truncating or dropping longer tokens, and in no case produce tokens longer that what IndexWriter can digest. - perhaps come up a clear policy on when a tokenizer should throw an exception? Gabi Steinberg. Yonik Seeley wrote: On Dec 20, 2007 11:57 AM, Michael McCandless <[EM

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
allow you to choose a relatively small max, such as 32 or 64, reducing the overhead caused by junk in the documents while minimizing the chance of not finding something. Gabi. Michael McCandless wrote: Gabi Steinberg wrote: On balance, I think that dropping the document makes sense. I think