[jira] Updated: (LUCENE-1096) Deleting docs of all returned Hits during search causes ArrayIndexOutOfBoundsException

2007-12-20 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1096: Fix Version/s: 2.3 Lucene Fields: [New, Patch Available] (was: [New]) > Deleting docs of all

[jira] Updated: (LUCENE-1096) Deleting docs of all returned Hits during search causes ArrayIndexOutOfBoundsException

2007-12-20 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1096: Attachment: lucene-1096.patch Patch with tests for the two scenarios described above, and a fix fo

[jira] Commented: (LUCENE-1096) Deleting docs of all returned Hits during search causes ArrayIndexOutOfBoundsException

2007-12-20 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553817 ] Doron Cohen commented on LUCENE-1096: - It seems that this is a serious problem with Hits based search. An applic

[jira] Updated: (LUCENE-1097) IndexWriter.close(false) does not actually stop background merge threads

2007-12-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1097: --- Attachment: LUCENE-1097.patch Patch attached. I plan to commit in a day or two. I

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
How about defaulting to a max token size of 16K in StandardTokenizer, so that it never causes an IndexWriter exception, with an option to reduce that size? The backward incompatibilty is limited then - tokens exceeding 16K will NOT causing an IndexWriter exception. In 3.0 we can reduce that d

[jira] Commented: (LUCENE-770) CfsExtractor tool

2007-12-20 Thread Daniel Naber (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553778 ] Daniel Naber commented on LUCENE-770: - I think there's a small issue which is also in IndexReader.main: the javad

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Gabi Steinberg wrote: On balance, I think that dropping the document makes sense. I think Yonik is right in that ensuring that keys are useful - and indexable - is the tokenizer's job. StandardTokenizer, in my opinion, should behave similarly to a person looking at a document and decidin

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
OK I will take this approach... create TermTooLongException (subclasses RuntimeException), listed in the javadocs but not the throws clause of add/updateDocument. DW throws this if it encounters any term >= 16383 chars in length. Whenever that exception (or others) are thrown from within

Re: Hudson Upgrade Dec 19

2007-12-20 Thread Nigel Daley
This is now complete. Please let me know if you see any problems with Hudson. Nige On Dec 18, 2007, at 10:59 PM, Nigel Daley wrote: I'd like to upgrade Hudson (http://lucene.zones.apache.org:8080/ hudson/) from 1.136 to 1.161 tomorrow (Dec 19). I'll also be upgrading some existing plugin

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 2:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Makes sense. I wasn't sure if declaring new exceptions to be thrown > is violating back-compat. issues or not (even if they are runtime > exceptions) That's a good question... I know that declared RuntimeExceptions are containe

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
Makes sense. I wasn't sure if declaring new exceptions to be thrown is violating back-compat. issues or not (even if they are runtime exceptions) On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote: On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: But, I can see the value i

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > But, I can see the value in the throw the exception > case too, except I think the API should declare the exception is being > thrown. It could throw an extension of IOException. To be robust, user indexing code needs to catch

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote: Yonik Seeley wrote: On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote: It might be a bit harsh to drop the document if it has a very long token in it. There is really two issues here. For long tokens, one could ei

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
On balance, I think that dropping the document makes sense. I think Yonik is right in that ensuring that keys are useful - and indexable - is the tokenizer's job. StandardTokenizer, in my opinion, should behave similarly to a person looking at a document and deciding which tokens should be in

Re: TeeTokenFilter performance testing

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 10:07 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Hmmm, I will have to take a look at Token.clone. I must admit I don't > know a lot about the perf. differences between clone and new, but I > would think the cost should be on par, if not a little cheaper, > otherwise what's th

Re: TeeTokenFilter performance testing

2007-12-20 Thread Karl Wettin
20 dec 2007 kl. 16.07 skrev Grant Ingersoll: I must admit I don't know a lot about the perf. differences between clone and new, but I would think the cost should be on par, if not a little cheaper, otherwise what's the point? My guess is that clone() is a convenience implementation that to

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 11:57 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yonik Seeley wrote: > > On Dec 20, 2007 11:33 AM, Gabi Steinberg > > <[EMAIL PROTECTED]> wrote: > >> It might be a bit harsh to drop the document if it has a very long > >> token > >> in it. > > > > There is really two issues

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote: It might be a bit harsh to drop the document if it has a very long token in it. There is really two issues here. For long tokens, one could either ignore them or generate an exception. I can see th

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote: > It might be a bit harsh to drop the document if it has a very long token > in it. There is really two issues here. For long tokens, one could either ignore them or generate an exception. For all exceptions generated while index

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
On Dec 20, 2007, at 10:55 AM, Yonik Seeley wrote: On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: I'm wondering if the IndexWriter should throw an explicit exception in this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components durin

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: Though ... we could simply immediately delete the document when any exception occurs during its processing. So if we think whenever any doc hits an exception, then it should be deleted, it's not so ha

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Gabi Steinberg
It might be a bit harsh to drop the document if it has a very long token in it. I can imagine documents with embedded binary data, where the text around the binary data is still useful for search. My feeling is that long tokens (longer than 128 or 256 bytes) are not useful for search, and sho

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Though ... we could simply immediately delete the document when any > exception occurs during its processing. So if we think whenever any > doc hits an exception, then it should be deleted, it's not so hard to > implement th

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: I'm wondering if the IndexWriter should throw an explicit exception in this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components during indexing anyway, so it seems

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Michael McCandless
Yonik Seeley wrote: as it seems to me really long tokens should be handled more gracefully. It seems strange that the message says the terms were skipped (which the code does in fact do), but then there is a RuntimeException thrown which usually indicates to me the issue is not recoverable.

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I'm wondering if the IndexWriter should throw an explicit exception in > this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components during indexing anyway, so it seems like indexing code shou

Re: TeeTokenFilter performance testing

2007-12-20 Thread Grant Ingersoll
Hmmm, I will have to take a look at Token.clone. I must admit I don't know a lot about the perf. differences between clone and new, but I would think the cost should be on par, if not a little cheaper, otherwise what's the point? It also seems like we shouldn't have to go through nulling

[jira] Created: (LUCENE-1097) IndexWriter.close(false) does not actually stop background merge threads

2007-12-20 Thread Michael McCandless (JIRA)
IndexWriter.close(false) does not actually stop background merge threads Key: LUCENE-1097 URL: https://issues.apache.org/jira/browse/LUCENE-1097 Project: Lucene - Java

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Yonik Seeley
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I am getting the following exception when running against trunk: > java.lang.IllegalArgumentException: at least one term (length 20079) > exceeds max term length 16383; these terms were skipped > at > org > .apache.lucene.ind

DocumentsWriter.checkMaxTermLength issues

2007-12-20 Thread Grant Ingersoll
I am getting the following exception when running against trunk: java.lang.IllegalArgumentException: at least one term (length 20079) exceeds max term length 16383; these terms were skipped at org .apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java: 1545) at org.apac

[jira] Resolved: (LUCENE-1094) Exception in DocumentsWriter.addDocument can corrupt stored fields file (fdt)

2007-12-20 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1094. Resolution: Fixed > Exception in DocumentsWriter.addDocument can corrupt stored fi

[jira] Updated: (LUCENE-1096) Deleting docs of all returned Hits during search causes ArrayIndexOutOfBoundsException

2007-12-20 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1096: Attachment: TestSearchDelete.java Test failing with this bug. > Deleting docs of all returned Hit

[jira] Updated: (LUCENE-1096) Deleting docs of all returned Hits during search causes ArrayIndexOutOfBoundsException

2007-12-20 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1096: Description: For background user discussion: http://www.nabble.com/document-deletion-problem-to144

[jira] Assigned: (LUCENE-1096) Deletion from index causes ArrayIndexOutOfBoundsException

2007-12-20 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen reassigned LUCENE-1096: --- Assignee: Doron Cohen > Deletion from index causes ArrayIndexOutOfBoundsException >