[jira] Resolved: (LUCENE-1095) StopFilter should have option to incr positionIncrement after stop word
[ https://issues.apache.org/jira/browse/LUCENE-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-1095. - Resolution: Fixed Lucene Fields: [Patch Available] (was: [New]) Committed . (already yesterday actually, I was sure that I already resolved it...) > StopFilter should have option to incr positionIncrement after stop word > --- > > Key: LUCENE-1095 > URL: https://issues.apache.org/jira/browse/LUCENE-1095 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Hoss Man >Assignee: Doron Cohen > Attachments: lucene-1095-pos-incr.patch, lucene-1095-pos-incr.patch, > lucene-1095-pos-incr.patch > > > I've seen this come up on the mailing list a few times in the last month, so > i'm filing a known bug/improvement arround it... > StopFilter should have an option that if set, records how many stop words are > "skipped" in a row, and then sets that value as the positionIncrement on the > "next" token that StopFilter does return. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1112) Document is partially indexed on an unhandled exception
Document is partially indexed on an unhandled exception --- Key: LUCENE-1112 URL: https://issues.apache.org/jira/browse/LUCENE-1112 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.3 With LUCENE-843, it's now possible for a subset of a document's fields/terms to be indexed or stored when an exception is hit. This was not the case in the past (it was "all or none"). I plan to make it "all or none" again by immediately marking a document as deleted if any exception is hit while indexing it. Discussion leading up to this: http://www.gossamer-threads.com/lists/lucene/java-dev/56103 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Let's release Lucene 2.3 soon?
I just opened a new issue, which I think should be fixed for 2.3, to fix IndexWriter.add/updateDocument to not "partially add" a document when an exception is hit: https://issues.apache.org/jira/browse/LUCENE-1112 I'll try to work out a patch by Thu but it may be tight... Mike Michael Busch <[EMAIL PROTECTED]> wrote: > Michael Busch wrote: > > > > I think a good target would be to complete all 2.3 issues by end of this > > year. Then we can start a code freeze beginning of next year, so that > > we'll have 2.3 out hopefully by mid/end of January '08. I would > > volunteer to act as the release manager again. > > > > Hi Team, > > perfect timing! As of today all 2.3 issues are committed (thanks > everyone!). If nobody objects I will create a 2.3 branch on Thursday, > Jan 3rd, and we will have a code freeze on the branch for aprox. 10 > days. In this time period only critical/blocking issues and > documentation patches can be committed to the branch. > > -Michael > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
Doron Cohen <[EMAIL PROTECTED]> wrote: > I like the approach of configuration of this behavior in Analysis > (and so IndexWriter can throw an exception on such errors). > > It seems that this should be a property of Analyzer vs. > just StandardAnalyzer, right? > > It can probably be a "policy" property, with two parameters: > 1) maxLength, 2) action: chop/split/ignore/raiseException when > generating too long tokens. Agreed, this should be generic/shared to all analyzers. But maybe for 2.3, we just truncate any too-long term to the max allowed size, and then after 2.3 we make this a settable "policy"? > Doron > > On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > > > > I think this is a good approach -- any objections? > > > > This way, IndexWriter is in-your-face (throws TermTooLongException on > > seeing a massive term), but StandardAnalyzer is robust (silently > > skips or prefix's the too-long terms). > > > > Mike > > > > Gabi Steinberg wrote: > > > > > How about defaulting to a max token size of 16K in > > > StandardTokenizer, so that it never causes an IndexWriter > > > exception, with an option to reduce that size? > > > > > > The backward incompatibilty is limited then - tokens exceeding 16K > > > will NOT causing an IndexWriter exception. In 3.0 we can reduce > > > that default to a useful size. > > > > > > The option to truncate the token can be useful, I think. It will > > > index the max size prefix of the long tokens. You can still find > > > them, pretty accurately - this becomes a prefix search, but is > > > unlikely to return multiple values because it's a long prefix. It > > > allow you to choose a relatively small max, such as 32 or 64, > > > reducing the overhead caused by junk in the documents while > > > minimizing the chance of not finding something. > > > > > > Gabi. > > > > > > Michael McCandless wrote: > > >> Gabi Steinberg wrote: > > >>> On balance, I think that dropping the document makes sense. I > > >>> think Yonik is right in that ensuring that keys are useful - and > > >>> indexable - is the tokenizer's job. > > >>> > > >>> StandardTokenizer, in my opinion, should behave similarly to a > > >>> person looking at a document and deciding which tokens should be > > >>> indexed. Few people would argue that a 16K block of binary data > > >>> is useful for searching, but it's reasonable to suggest that the > > >>> text around it is useful. > > >>> > > >>> I know that one can add the LengthFilter to avoid this problem, > > >>> but this is not really intuitive; one does not expect the > > >>> standard tokenizer to generate tokens that IndexWriter chokes on. > > >>> > > >>> My vote is to: > > >>> - drop documents with tokens longer than 16K, as Mike and Yonik > > >>> suggested > > >>> - because uninformed user would start with StandardTokenizer, I > > >>> think it should limit token size to 128 bytes, and add options to > > >>> change that size, choose between truncating or dropping longer > > >>> tokens, and in no case produce tokens longer that what > > >>> IndexWriter can digest. > > >> I like this idea, though we probably can't do that until 3.0 so we > > >> don't break backwards compatibility? > > > ... > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-488) adding docs with large (binary) fields of 5mb causes OOM regardless of heap size
[ https://issues.apache.org/jira/browse/LUCENE-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-488. Resolution: Fixed This problem was resolved by LUCENE-843, after which stored fields are written directly into the directory (therefore not consuming aggregated RAM). It is interesting that the test provided here was allocating a new byte buffer of 2 - 10 MB for each added doc. This by itself couldeventually lead to OOMs because as the program ran longer it was becoming harder to alocate consecutive chunks of those sizes. Enhancing binary fields with offset and length (?) would allow applications to reuse the input byte array and allocate less of those. > adding docs with large (binary) fields of 5mb causes OOM regardless of heap > size > > > Key: LUCENE-488 > URL: https://issues.apache.org/jira/browse/LUCENE-488 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 1.9 > Environment: Linux asimov 2.6.6.hoss1 #1 SMP Tue Jul 6 16:31:01 PDT > 2004 i686 GNU/Linux >Reporter: Hoss Man > Attachments: TestBigBinary.java > > > as reported by George Washington in a message to [EMAIL PROTECTED] with > subect "Storing large text or binary source documents in the index and memory > usage" arround 2006-01-21 there seems to be a problem with adding docs > containing really large fields. > I'll attach a test case in a moment, note that (for me) regardless of how big > i make my heap size, and regardless of what value I set MIN_MB to, once it > starts trying to make documents of containing 5mb of data, it can only add 9 > before it rolls over and dies. > here's the output from the code as i will attach in a moment... > [junit] Testsuite: org.apache.lucene.document.TestBigBinary > [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 78.656 sec > [junit] - Standard Output --- > [junit] NOTE: directory will not be cleaned up automatically... > [junit] Dir: > /tmp/org.apache.lucene.document.TestBigBinary.97856146.100iters.4mb > [junit] iters completed: 100 > [junit] totalBytes Allocated: 419430400 > [junit] NOTE: directory will not be cleaned up automatically... > [junit] Dir: > /tmp/org.apache.lucene.document.TestBigBinary.97856146.100iters.5mb > [junit] iters completed: 9 > [junit] totalBytes Allocated: 52428800 > [junit] - --- > [junit] Testcase: > testBigBinaryFields(org.apache.lucene.document.TestBigBinary):Caused an > ERROR > [junit] Java heap space > [junit] java.lang.OutOfMemoryError: Java heap space > [junit] Test org.apache.lucene.document.TestBigBinary FAILED -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1112) Document is partially indexed on an unhandled exception
[ https://issues.apache.org/jira/browse/LUCENE-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1112: Attachment: lucene-1112-test.patch Patch demonstrating the problem: testWickedLongTerm() modified to fail when numDocs grows although addDocument() throws an exception. > Document is partially indexed on an unhandled exception > --- > > Key: LUCENE-1112 > URL: https://issues.apache.org/jira/browse/LUCENE-1112 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: lucene-1112-test.patch > > > With LUCENE-843, it's now possible for a subset of a document's > fields/terms to be indexed or stored when an exception is hit. This > was not the case in the past (it was "all or none"). > I plan to make it "all or none" again by immediately marking a > document as deleted if any exception is hit while indexing it. > Discussion leading up to this: > http://www.gossamer-threads.com/lists/lucene/java-dev/56103 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through
[ https://issues.apache.org/jira/browse/LUCENE-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-458. -- Resolution: Duplicate The problem here apparently is that when the JVM crashed not all files are properly synced with the FS. This seems to be a similar problem to LUCENE-1044. > Merging may create duplicates if the JVM crashes half way through > - > > Key: LUCENE-458 > URL: https://issues.apache.org/jira/browse/LUCENE-458 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 1.4 > Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this > version. We've updated to 1.5.0_05 since, but discovered this issue with an > older text index since.) >Reporter: Trejkaz > > In the past, our indexing process crashed due to a Hotspot compiler bug on > SMP systems (although it could happen with any bad native code.) Everything > picked up and appeared to work, but now that it's a month later I've > discovered an oddity in the text index. > We have two documents which are identical in the text index. I know we only > stored it once for two reasons. First, we store the MD5 of every document > into the hash and the MD5s were the same. Second, we store a GUID into each > document which is generated uniquely for each document. The GUID and the MD5 > hash on these two documents, as well as all other fields, is exactly the same. > My conclusion is that a merge was occurring at the point the JVM crashed, > which is consistent with the time the process crashed. Is it possible that > Lucene did the copy of this document to the new location, and didn't get to > delete the original? > If so, I guess this issue should be prevented somehow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1102) EnwikiDocMaker id field
[ https://issues.apache.org/jira/browse/LUCENE-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-1102. - Resolution: Fixed Lucene Fields: (was: [New]) Committed > EnwikiDocMaker id field > --- > > Key: LUCENE-1102 > URL: https://issues.apache.org/jira/browse/LUCENE-1102 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1102.patch > > > The EnwikiDocMaker is fairly usable outside of the benchmarking class, but it > would benefit from indexing the ID field of the docs. > Patch to follow that adds an ID field. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Let's release Lucene 2.3 soon?
On Dec 30, 2007, at 1:02 PM, Michael Busch wrote: Grant Ingersoll wrote: On Dec 30, 2007, at 6:29 AM, Michael Busch wrote: In this time period only critical/blocking issues and documentation patches can be committed to the branch. I'd add that we should make some effort to clean up old JIRA issues... I think I have a deja-vu! :) I think when we released 2.2 I forgot to mention this and you reminded us! But I don't think that we really cleaned up that much. Shall we do it this time in a more coordinated manner? I think it would be great if we could work through all unresolved issues that haven't been updated in 2007. I created a private JIRA filter (not sure how I can share a private filter or create a public one?) and 117 issues is the result. I will go ahead and open a couple of JIRA issues with type 'Task' and fix version '2.3', a separate one for each package. Then whoever feels comfortable with an issue can comment on it, close it or update it. I'll also do a bulk update of all those 117 issues and change the priority to minor. Any objections? Cool. I would just add that cleaned up doesn't necessarily mean closed, it could just mean reviewed and left opened b/c it is still valid. That being said, my guess is that most of the really, really old ones could be marked as won't fix, although there are still some interesting ones in there, mostly having to do with adding features. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1113) fix for Document.getBoost() documentation
fix for Document.getBoost() documentation - Key: LUCENE-1113 URL: https://issues.apache.org/jira/browse/LUCENE-1113 Project: Lucene - Java Issue Type: Bug Components: Javadocs Affects Versions: 2.2 Reporter: Daniel Naber Priority: Minor Attachments: document-getboost.diff The attached patch fixes the javadoc to make clear that getBoost() will never return a useful value in most cases. I will commit this unless someone has a better wording or a real fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1113) fix for Document.getBoost() documentation
[ https://issues.apache.org/jira/browse/LUCENE-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Naber updated LUCENE-1113: - Attachment: document-getboost.diff > fix for Document.getBoost() documentation > - > > Key: LUCENE-1113 > URL: https://issues.apache.org/jira/browse/LUCENE-1113 > Project: Lucene - Java > Issue Type: Bug > Components: Javadocs >Affects Versions: 2.2 >Reporter: Daniel Naber >Priority: Minor > Attachments: document-getboost.diff > > > The attached patch fixes the javadoc to make clear that getBoost() will never > return a useful value in most cases. I will commit this unless someone has a > better wording or a real fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fuzzy makes no sense for short tokens
Hi! it generally makes no sense to search fuzzy for short tokens because changing even only a single character of course already results in a high edit distance. So it actually only makes sense in this case: if( token.length() > 1f / (1f - minSimilarity) ) E.g. changing one character in a 3-letter token (foo) results in an edit distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher we can save all the expensive rewrite() logic. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1113) fix for Document.getBoost() documentation
[ https://issues.apache.org/jira/browse/LUCENE-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12555110 ] Doron Cohen commented on LUCENE-1113: - How about: {noformat} Returns, at indexing time, the boost factor as set by [EMAIL PROTECTED] #setBoost(float)}. Note that once a document is indexed this value is no longer available from the index. At search time, for retrieved documents, this method always returns 1. This however does not mean that the boost value set at indexing time was ignored - it was just combined with other indexing time factors and stored elsewhere, for better indexing and search performance. (For more info see the "norm(t,d)" part of the scoring formula in [EMAIL PROTECTED] org.apache.lucene.search.Similarity Similarity}. {noformat} > fix for Document.getBoost() documentation > - > > Key: LUCENE-1113 > URL: https://issues.apache.org/jira/browse/LUCENE-1113 > Project: Lucene - Java > Issue Type: Bug > Components: Javadocs >Affects Versions: 2.2 >Reporter: Daniel Naber >Priority: Minor > Attachments: document-getboost.diff > > > The attached patch fixes the javadoc to make clear that getBoost() will never > return a useful value in most cases. I will commit this unless someone has a > better wording or a real fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Doron Cohen <[EMAIL PROTECTED]> wrote: > > I like the approach of configuration of this behavior in Analysis > > (and so IndexWriter can throw an exception on such errors). > > > > It seems that this should be a property of Analyzer vs. > > just StandardAnalyzer, right? > > > > It can probably be a "policy" property, with two parameters: > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > generating too long tokens. > > Agreed, this should be generic/shared to all analyzers. > > But maybe for 2.3, we just truncate any too-long term to the max > allowed size, and then after 2.3 we make this a settable "policy"? But we already have a nice component model for analyzers... why not just encapsulate truncation/discarding in a TokenFilter? -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > I like the approach of configuration of this behavior in Analysis > > > (and so IndexWriter can throw an exception on such errors). > > > > > > It seems that this should be a property of Analyzer vs. > > > just StandardAnalyzer, right? > > > > > > It can probably be a "policy" property, with two parameters: > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > > generating too long tokens. > > > > Agreed, this should be generic/shared to all analyzers. > > > > But maybe for 2.3, we just truncate any too-long term to the max > > allowed size, and then after 2.3 we make this a settable "policy"? > > But we already have a nice component model for analyzers... > why not just encapsulate truncation/discarding in a TokenFilter? Makes sense, especially for the implementation aspect. I'm not sure what API you have in mind: (1) leave that for applications, to append such a TokenFilter to their Analyzer (== no change), (2) DocumentsWriter to create such a TokenFilter under the cover, to force behavior that is defined (where?), or (3) have an IndexingTokenFilter assigned to IndexWriter, make the default such filter trim/ignore/whatever as discussed and then applications can set a different IndexingTokenFilter for changing the default behavior? I think I like the 3'rd option - is this what you meant? Doron
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: > > On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > > wrote: > > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > > I like the approach of configuration of this behavior in Analysis > > > > (and so IndexWriter can throw an exception on such errors). > > > > > > > > It seems that this should be a property of Analyzer vs. > > > > just StandardAnalyzer, right? > > > > > > > > It can probably be a "policy" property, with two parameters: > > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > > > generating too long tokens. > > > > > > Agreed, this should be generic/shared to all analyzers. > > > > > > But maybe for 2.3, we just truncate any too-long term to the max > > > allowed size, and then after 2.3 we make this a settable "policy"? > > > > But we already have a nice component model for analyzers... > > why not just encapsulate truncation/discarding in a TokenFilter? > > > Makes sense, especially for the implementation aspect. > I'm not sure what API you have in mind: > > (1) leave that for applications, to append such a > TokenFilter to their Analyzer (== no change), > > (2) DocumentsWriter to create such a TokenFilter > under the cover, to force behavior that is defined (where?), or > > (3) have an IndexingTokenFilter assigned to IndexWriter, > make the default such filter trim/ignore/whatever as discussed > and then applications can set a different IndexingTokenFilter for > changing the default behavior? > > I think I like the 3'rd option - is this what you meant? I meant (1)... it leaves the core smaller. I don't see any reason to have logic to truncate or discard tokens in the core indexing code (except to handle tokens >16k as an error condition). Most of the time you want to catch those large tokens early on in the chain anyway (put the filter right after the tokenizer). Doing it later could cause exceptions or issues with other token filters that might not be expecting huge tokens. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: I think I like the 3'rd option - is this what you meant? I meant (1)... it leaves the core smaller. I don't see any reason to have logic to truncate or discard tokens in the core indexing code (except to handle tokens >16k as an error condition). I would agree here, with the exception that I want the option for it to be treated as an error. In some cases, I would be just as happy for it to silently ignore the token, or to log it. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: > > I meant (1)... it leaves the core smaller. > > I don't see any reason to have logic to truncate or discard tokens in > > the core indexing code (except to handle tokens >16k as an error > > condition). > > I would agree here, with the exception that I want the option for it > to be treated as an error. That should also be possible via an analyzer component throwing an exception. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote: On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: I meant (1)... it leaves the core smaller. I don't see any reason to have logic to truncate or discard tokens in the core indexing code (except to handle tokens >16k as an error condition). I would agree here, with the exception that I want the option for it to be treated as an error. That should also be possible via an analyzer component throwing an exception. Sure, but I mean in the >16K (in other words, in the case where DocsWriter fails, which presumably only DocsWriter knows about) case. I want the option to ignore tokens larger than that instead of failing/ throwing an exception. Imagine I am charged w/ indexing some data that I don't know anything about (i.e. computer forensics), my goal would be to index as much as possible in my first raw pass, so that I can then begin to explore the dataset. Having it completely discard the document is not a good thing, but throwing away some large binary tokens would be acceptable (especially if I get warnings about said tokens) and robust. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Sure, but I mean in the >16K (in other words, in the case where > DocsWriter fails, which presumably only DocsWriter knows about) case. > I want the option to ignore tokens larger than that instead of failing/ > throwing an exception. I think the issue here is what the default behavior for IndexWriter should be. If configuration is required because something other than the default is desired, then one could use a TokenFilter to change the behavior rather than changing options on IndexWriter. Using a TokenFilter is much more flexible. > Imagine I am charged w/ indexing some data > that I don't know anything about (i.e. computer forensics), my goal > would be to index as much as possible in my first raw pass, so that I > can then begin to explore the dataset. Having it completely discard > the document is not a good thing, but throwing away some large binary > tokens would be acceptable (especially if I get warnings about said > tokens) and robust. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally" our users see this. So I'm thinking we should have the default behavior, in IndexWriter, be to skip immense terms? Then people can use TokenFilter to change this behavior if they want. Mike Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Sure, but I mean in the >16K (in other words, in the case where > > DocsWriter fails, which presumably only DocsWriter knows about) case. > > I want the option to ignore tokens larger than that instead of failing/ > > throwing an exception. > > I think the issue here is what the default behavior for IndexWriter should be. > > If configuration is required because something other than the default > is desired, then one could use a TokenFilter to change the behavior > rather than changing options on IndexWriter. Using a TokenFilter is > much more flexible. > > > Imagine I am charged w/ indexing some data > > that I don't know anything about (i.e. computer forensics), my goal > > would be to index as much as possible in my first raw pass, so that I > > can then begin to explore the dataset. Having it completely discard > > the document is not a good thing, but throwing away some large binary > > tokens would be acceptable (especially if I get warnings about said > > tokens) and robust. > > -Yonik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 12:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly see this exception. In general it could be a long time before > you "accidentally" our users see this. > > So I'm thinking we should have the default behavior, in IndexWriter, > be to skip immense terms? > > Then people can use TokenFilter to change this behavior if they want. +1 -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1112) Document is partially indexed on an unhandled exception
[ https://issues.apache.org/jira/browse/LUCENE-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12555119 ] Michael McCandless commented on LUCENE-1112: Thanks Doron; I'll fold this in (though, I'll move it to the testExceptionFromTokenStream case since it looks like we're going to no longer throw an exception on hitting a wicked-long-term). > Document is partially indexed on an unhandled exception > --- > > Key: LUCENE-1112 > URL: https://issues.apache.org/jira/browse/LUCENE-1112 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: lucene-1112-test.patch > > > With LUCENE-843, it's now possible for a subset of a document's > fields/terms to be indexed or stored when an exception is hit. This > was not the case in the past (it was "all or none"). > I plan to make it "all or none" again by immediately marking a > document as deleted if any exception is hit while indexing it. > Discussion leading up to this: > http://www.gossamer-threads.com/lists/lucene/java-dev/56103 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote: I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally" our users see this. So I'm thinking we should have the default behavior, in IndexWriter, be to skip immense terms? Then people can use TokenFilter to change this behavior if they want. +1. We could log it, right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Let's release Lucene 2.3 soon?
Michael McCandless wrote: > I just opened a new issue, which I think should be fixed for 2.3, to > fix IndexWriter.add/updateDocument to not "partially add" a document > when an exception is hit: > > https://issues.apache.org/jira/browse/LUCENE-1112 > > I'll try to work out a patch by Thu but it may be tight... > No rush! I can wait a couple more days before I create the branch. Let's say next Monday (8th)? -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1114) contrib/Highlighter javadoc example needs to be updated
contrib/Highlighter javadoc example needs to be updated --- Key: LUCENE-1114 URL: https://issues.apache.org/jira/browse/LUCENE-1114 Project: Lucene - Java Issue Type: Bug Components: contrib/* Reporter: Grant Ingersoll Priority: Trivial The Javadoc package.html example code is outdated, as it still uses QueryParser.parse. http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/contrib-highlighter/index.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1114) contrib/Highlighter javadoc example needs to be updated
[ https://issues.apache.org/jira/browse/LUCENE-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12555145 ] Grant Ingersoll commented on LUCENE-1114: - It also only demonstrates using the Analyzer to get the tokenStream, and not term vectors (TermSources) > contrib/Highlighter javadoc example needs to be updated > --- > > Key: LUCENE-1114 > URL: https://issues.apache.org/jira/browse/LUCENE-1114 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Grant Ingersoll >Priority: Trivial > > The Javadoc package.html example code is outdated, as it still uses > QueryParser.parse. > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/contrib-highlighter/index.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DocumentsWriter.checkMaxTermLength issues
Grant Ingersoll wrote: On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote: I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally" our users see this. So I'm thinking we should have the default behavior, in IndexWriter, be to skip immense terms? Then people can use TokenFilter to change this behavior if they want. +1. We could log it, right? Yes, to IndexWriter's infoStream, if it's set. I'll do that... Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Build failed in Hudson: Lucene-Nightly #321
See http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/321/changes -- [...truncated 866 lines...] A contrib/db/bdb-je/src/java A contrib/db/bdb-je/src/java/org A contrib/db/bdb-je/src/java/org/apache A contrib/db/bdb-je/src/java/org/apache/lucene A contrib/db/bdb-je/src/java/org/apache/lucene/store A contrib/db/bdb-je/src/java/org/apache/lucene/store/je A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/File.java A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEDirectory.java A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEIndexInput.java A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEIndexOutput.java A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JELock.java A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/Block.java A contrib/db/bdb-je/build.xml A contrib/db/bdb A contrib/db/bdb/pom.xml.template A contrib/db/bdb/src A contrib/db/bdb/src/test A contrib/db/bdb/src/test/org A contrib/db/bdb/src/test/org/apache A contrib/db/bdb/src/test/org/apache/lucene A contrib/db/bdb/src/test/org/apache/lucene/store A contrib/db/bdb/src/test/org/apache/lucene/store/db A contrib/db/bdb/src/test/org/apache/lucene/store/db/DbStoreTest.java AU contrib/db/bdb/src/test/org/apache/lucene/store/db/SanityLoadLibrary.java A contrib/db/bdb/src/java A contrib/db/bdb/src/java/org A contrib/db/bdb/src/java/org/apache A contrib/db/bdb/src/java/org/apache/lucene A contrib/db/bdb/src/java/org/apache/lucene/store A contrib/db/bdb/src/java/org/apache/lucene/store/db AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/File.java AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/Block.java AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/DbDirectory.java A contrib/db/bdb/src/java/org/apache/lucene/store/db/DbIndexInput.java A contrib/db/bdb/src/java/org/apache/lucene/store/db/DbIndexOutput.java AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/DbLock.java A contrib/db/bdb/src/java/com A contrib/db/bdb/src/java/com/sleepycat A contrib/db/bdb/src/java/com/sleepycat/db AUcontrib/db/bdb/src/java/com/sleepycat/db/DbHandleExtractor.java AUcontrib/db/bdb/build.xml AUcontrib/db/build.xml A contrib/similarity A contrib/similarity/pom.xml.template A contrib/similarity/src A contrib/similarity/src/java A contrib/similarity/src/java/org A contrib/similarity/src/java/org/apache A contrib/similarity/src/java/org/apache/lucene A contrib/similarity/src/java/org/apache/lucene/search A contrib/similarity/src/java/org/apache/lucene/search/similar AU contrib/similarity/src/java/org/apache/lucene/search/similar/package.html AUcontrib/similarity/README.txt AUcontrib/similarity/.cvsignore AUcontrib/similarity/build.xml A contrib/swing A contrib/swing/pom.xml.template A contrib/swing/src A contrib/swing/src/test A contrib/swing/src/test/org A contrib/swing/src/test/org/apache A contrib/swing/src/test/org/apache/lucene A contrib/swing/src/test/org/apache/lucene/swing A contrib/swing/src/test/org/apache/lucene/swing/models A contrib/swing/src/test/org/apache/lucene/swing/models/TestSearchingList.java A contrib/swing/src/test/org/apache/lucene/swing/models/BaseTableModel.java A contrib/swing/src/test/org/apache/lucene/swing/models/TestUpdatingTable.java A contrib/swing/src/test/org/apache/lucene/swing/models/RestaurantInfo.java A contrib/swing/src/test/org/apache/lucene/swing/models/TableSearcherSimulator.java A contrib/swing/src/test/org/apache/lucene/swing/models/DataStore.java A contrib/swing/src/test/org/apache/lucene/swing/models/BaseListModel.java A contrib/swing/src/test/org/apache/lucene/swing/models/TestUpdatingList.java A contrib/swing/src/test/org/apache/lucene/swing/models/ListSearcherSimulator.java A contrib/swing/src/test/org/apache/lucene/swing/models/TestBasicTable.java A contrib/swing/src/test/org/apache/lucene/swing/models/TestSearchingTable.java A contrib/swing/src/test/org/apache/lucene/swing/models/TestBasicList.java A contrib/swing/src/java A contrib/swing/src/java/org A contrib/swing/src/java/org/apache A contrib/swing/src/java/org/apache/lucene A contrib/swing/src/java/org/apache/lucene/swing A contrib/swing/src/java/org/apache/lucene/swing/models A contrib/swing/src/java/org/apache/lucene/swing/models/TableSearcher.java A contrib/swing/src/java/org/apache/luc
Re: DocumentsWriter.checkMaxTermLength issues
On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly see this exception. In general it could be a long time before > you "accidentally" our users see this. > > So I'm thinking we should have the default behavior, in IndexWriter, > be to skip immense terms? > > Then people can use TokenFilter to change this behavior if they want. > +1 At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was a wrong comparison, because #terms is a "real" indexing/serarch characteristic that many applications can benefit from being able to modify, whereas a huge token is in most cases a bug. Just to make sure on the scenario - the only change is to skip too long tokens, while any other exception is thrown (not ignored.) Also, for a skipped token I think the position increment of the following token should be incremented.