Re: DocumentsWriter.checkMaxTermLength issues

2008-01-01 Thread Michael McCandless
Doron Cohen wrote: On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computer

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Doron Cohen
On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly se

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Michael McCandless
Grant Ingersoll wrote: On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote: I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to su

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Grant Ingersoll
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote: I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. I

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 12:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly se

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Michael McCandless
I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally"

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Sure, but I mean in the >16K (in other words, in the case where > DocsWriter fails, which presumably only DocsWriter knows about) case. > I want the option to ignore tokens larger than that instead of failing/ > throwing an exce

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Grant Ingersoll
On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote: On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: I meant (1)... it leaves the core smaller. I don't see any reason to have logic to truncate or discard tokens in the cor

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: > > I meant (1)... it leaves the core smaller. > > I don't see any reason to have logic to truncate or discard tokens in > > the core indexing code (except to handle tokens >16

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Grant Ingersoll
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: I think I like the 3'rd option - is this what you meant? I meant (1)... it leaves the core smaller. I don't se

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: > > On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > > wrote: > > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > > I like the approach of configur

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Doron Cohen
On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > I like the approach of configuration of this behavior in Analysis > > > (and so IndexWriter can throw an exce

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Doron Cohen <[EMAIL PROTECTED]> wrote: > > I like the approach of configuration of this behavior in Analysis > > (and so IndexWriter can throw an exception on such errors). > > > > It seems that this should be a property of An

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Michael McCandless
Doron Cohen <[EMAIL PROTECTED]> wrote: > I like the approach of configuration of this behavior in Analysis > (and so IndexWriter can throw an exception on such errors). > > It seems that this should be a property of Analyzer vs. > just StandardAnalyzer, right? > > It can probably be a "policy" prop

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-23 Thread Doron Cohen
I like the approach of configuration of this behavior in Analysis (and so IndexWriter can throw an exception on such errors). It seems that this should be a property of Analyzer vs. just StandardAnalyzer, right? It can probably be a "policy" property, with two parameters: 1) maxLength, 2) action:

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-21 Thread Michael McCandless
I think this is a good approach -- any objections? This way, IndexWriter is in-your-face (throws TermTooLongException on seeing a massive term), but StandardAnalyzer is robust (silently skips or prefix's the too-long terms). Mike Gabi Steinberg wrote: How about defaulting to a max token

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
How about defaulting to a max token size of 16K in StandardTokenizer, so that it never causes an IndexWriter exception, with an option to reduce that size? The backward incompatibilty is limited then - tokens exceeding 16K will NOT causing an IndexWriter exception. In 3.0 we can reduce that d

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Gabi Steinberg wrote: On balance, I think that dropping the document makes sense. I think Yonik is right in that ensuring that keys are useful - and indexable - is the tokenizer's job. StandardTokenizer, in my opinion, should behave similarly to a person looking at a document and decidin

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
OK I will take this approach... create TermTooLongException (subclasses RuntimeException), listed in the javadocs but not the throws clause of add/updateDocument. DW throws this if it encounters any term >= 16383 chars in length. Whenever that exception (or others) are thrown from within

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 2:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Makes sense. I wasn't sure if declaring new exceptions to be thrown > is violating back-compat. issues or not (even if they are runtime > exceptions) That's a good question... I know that declared RuntimeExceptions are containe

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
Makes sense. I wasn't sure if declaring new exceptions to be thrown is violating back-compat. issues or not (even if they are runtime exceptions) On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote: On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: But, I can see the value i

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > But, I can see the value in the throw the exception > case too, except I think the API should declare the exception is being > thrown. It could throw an extension of IOException. To be robust, user indexing code needs to catch

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote: Yonik Seeley wrote: On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote: It might be a bit harsh to drop the document if it has a very long token in it. There is really two issues here. For long tokens, one could ei

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
On balance, I think that dropping the document makes sense. I think Yonik is right in that ensuring that keys are useful - and indexable - is the tokenizer's job. StandardTokenizer, in my opinion, should behave similarly to a person looking at a document and deciding which tokens should be in

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 11:57 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yonik Seeley wrote: > > On Dec 20, 2007 11:33 AM, Gabi Steinberg > > <[EMAIL PROTECTED]> wrote: > >> It might be a bit harsh to drop the document if it has a very long > >> token > >> in it. > > > > There is really two issues

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote: It might be a bit harsh to drop the document if it has a very long token in it. There is really two issues here. For long tokens, one could either ignore them or generate an exception. I can see th

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote: > It might be a bit harsh to drop the document if it has a very long token > in it. There is really two issues here. For long tokens, one could either ignore them or generate an exception. For all exceptions generated while index

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
On Dec 20, 2007, at 10:55 AM, Yonik Seeley wrote: On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: I'm wondering if the IndexWriter should throw an explicit exception in this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components durin

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: Though ... we could simply immediately delete the document when any exception occurs during its processing. So if we think whenever any doc hits an exception, then it should be deleted, it's not so ha

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
It might be a bit harsh to drop the document if it has a very long token in it. I can imagine documents with embedded binary data, where the text around the binary data is still useful for search. My feeling is that long tokens (longer than 128 or 256 bytes) are not useful for search, and sho

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Though ... we could simply immediately delete the document when any > exception occurs during its processing. So if we think whenever any > doc hits an exception, then it should be deleted, it's not so hard to > implement th

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: I'm wondering if the IndexWriter should throw an explicit exception in this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components during indexing anyway, so it seems

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: as it seems to me really long tokens should be handled more gracefully. It seems strange that the message says the terms were skipped (which the code does in fact do), but then there is a RuntimeException thrown which usually indicates to me the issue is not recoverable.

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I'm wondering if the IndexWriter should throw an explicit exception in > this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components during indexing anyway, so it seems like indexing code shou

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I am getting the following exception when running against trunk: > java.lang.IllegalArgumentException: at least one term (length 20079) > exceeds max term length 16383; these terms were skipped > at > org > .apache.lucene.ind

DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
I am getting the following exception when running against trunk: java.lang.IllegalArgumentException: at least one term (length 20079) exceeds max term length 16383; these terms were skipped at org .apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java: 1545) at org.apac