[jira] [Commented] (LUCENE-4880) Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640816#comment-13640816 ] Tim Allison commented on LUCENE-4880: - Thank you! Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory Key: LUCENE-4880 URL: https://issues.apache.org/jira/browse/LUCENE-4880 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 4.2 Environment: Windows 7 (probably irrelevant) Reporter: Tim Allison Fix For: 5.0, 4.3 Attachments: LUCENE-4880.patch, MemoryIndexVsRamDirZeroLengthTermTest.java MemoryIndex skips tokens that have length == 0 when building the index; the result is that it does not increment the token offset (nor does it store the position offsets if that option is set) for tokens of length == 0. A regular index (via, say, RAMDirectory) does not appear to do this. When using the ICUFoldingFilter, it is possible to have a term of zero length (the \u0640 character separated by spaces). If that occurs in a document, the offsets returned at search time differ between the MemoryIndex and a regular index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4880) Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614073#comment-13614073 ] Robert Muir commented on LUCENE-4880: - Thanks for raising this Timothy. I think its a bug in MemoryIndex: it shouldn't skip terms that are of zero length. Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory Key: LUCENE-4880 URL: https://issues.apache.org/jira/browse/LUCENE-4880 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 4.2 Environment: Windows 7 (probably irrelevant) Reporter: Timothy Allison Attachments: MemoryIndexVsRamDirZeroLengthTermTest.java MemoryIndex skips tokens that have length == 0 when building the index; the result is that it does not increment the token offset (nor does it store the position offsets if that option is set) for tokens of length == 0. A regular index (via, say, RAMDirectory) does not appear to do this. When using the ICUFoldingFilter, it is possible to have a term of zero length (the \u0640 character separated by spaces). If that occurs in a document, the offsets returned at search time differ between the MemoryIndex and a regular index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4880) Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614080#comment-13614080 ] Uwe Schindler commented on LUCENE-4880: --- Yes, I this is a bug in MemoryIndex. In earlier Lucene versions I think we skipped empty terms in standard IndexWriter, but thats no longer the case. So MemoryIndex must be consistent. Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory Key: LUCENE-4880 URL: https://issues.apache.org/jira/browse/LUCENE-4880 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 4.2 Environment: Windows 7 (probably irrelevant) Reporter: Timothy Allison Attachments: MemoryIndexVsRamDirZeroLengthTermTest.java MemoryIndex skips tokens that have length == 0 when building the index; the result is that it does not increment the token offset (nor does it store the position offsets if that option is set) for tokens of length == 0. A regular index (via, say, RAMDirectory) does not appear to do this. When using the ICUFoldingFilter, it is possible to have a term of zero length (the \u0640 character separated by spaces). If that occurs in a document, the offsets returned at search time differ between the MemoryIndex and a regular index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4880) Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614086#comment-13614086 ] Robert Muir commented on LUCENE-4880: - I also think its stupid you get 0640 as a token by itself in any case. I dont agree with the unicode property of letter for this character as that doesnt makes sense to me, in my opinion it should be format. I sure hope there is some good reason for this, but to me its crazy. Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory Key: LUCENE-4880 URL: https://issues.apache.org/jira/browse/LUCENE-4880 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 4.2 Environment: Windows 7 (probably irrelevant) Reporter: Timothy Allison Attachments: MemoryIndexVsRamDirZeroLengthTermTest.java MemoryIndex skips tokens that have length == 0 when building the index; the result is that it does not increment the token offset (nor does it store the position offsets if that option is set) for tokens of length == 0. A regular index (via, say, RAMDirectory) does not appear to do this. When using the ICUFoldingFilter, it is possible to have a term of zero length (the \u0640 character separated by spaces). If that occurs in a document, the offsets returned at search time differ between the MemoryIndex and a regular index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org