[jira] [Updated] (LUCENE-5098) Broadword bit selection
[ https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5098: - Assignee: Adrien Grand Broadword bit selection --- Key: LUCENE-5098 URL: https://issues.apache.org/jira/browse/LUCENE-5098 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5098.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5100) BaseDocIdSetTestCase
[ https://issues.apache.org/jira/browse/LUCENE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5100: - Attachment: LUCENE-5100.patch Thanks for the explanation, Robert. I tried to factorize some code between TestFixedBitSet and TestOpenBitSet by adding an abstraction level on top of both FixedBitSet and OpenBitSet but its complexity made the tests even harder to read, so I think I won't touch the prevSetBit/nextSetBit/flip/... tests and just add the tests from {{BaseDcIdSetTestCase}}. Updated patch. The modification in EliasFanoEncoder is here to always be able to pass {{maxDoc - 1}} as an upper bound even when the set is empty (an assertion would trip otherwise). I think it is ready? BaseDocIdSetTestCase Key: LUCENE-5100 URL: https://issues.apache.org/jira/browse/LUCENE-5100 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-5100.patch, LUCENE-5100.patch As Robert said on LUCENE-5081, we would benefit from having common testing infrastructure for our DocIdSet implementations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5098) Broadword bit selection
[ https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13705827#comment-13705827 ] Adrien Grand commented on LUCENE-5098: -- It does. Broadword bit selection --- Key: LUCENE-5098 URL: https://issues.apache.org/jira/browse/LUCENE-5098 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5098.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2750) add Kamikaze 3.0.1 into Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13705876#comment-13705876 ] Adrien Grand commented on LUCENE-2750: -- FYI I ran his benchmark and the thing is that the version of kamikaze he is using decompresses ints one by one instead of using routines that decompress a full block in one go. Here is the relevant part of the kamikaze code base: https://github.com/linkedin/kamikaze/blob/master/src/main/java/com/kamikaze/pfordelta/PForDelta.java#L114 decompressBBitSlotsWithHardCodes is commented out in favor of decompressBBitSlots. add Kamikaze 3.0.1 into Lucene -- Key: LUCENE-2750 URL: https://issues.apache.org/jira/browse/LUCENE-2750 Project: Lucene - Core Issue Type: Sub-task Components: modules/other Reporter: hao yan Assignee: Adrien Grand Original Estimate: 336h Remaining Estimate: 336h Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve significantly better performance then Kamikaze 2.0.0 in terms of both compressed size and decompression speed. The main difference between the two versions is Kamikaze 3.0.x uses the much more efficient implementation of the PForDelta compression algorithm. My goal is to integrate the highly efficient PForDelta implementation into Lucene Codec. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5098) Broadword bit selection
[ https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706372#comment-13706372 ] Adrien Grand commented on LUCENE-5098: -- bq. A safe conclusion is that moving selectNaive to the test cases now would be premature. OK. bq. I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private. I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic). We used to have a utility method in BitUtil to compute pop counts on longs but we removed it in LUCENE-2221 in favor of Long.bitCount. bq. How about putting the assembly version in BitUtil? Or ToStringUtils? bq. Should LuceneTestCase also be mentioned in the wiki at How to contribute? We try to keep this page as concise as possible so I added a mention to it in https://wiki.apache.org/lucene-java/DeveloperTips. Broadword bit selection --- Key: LUCENE-5098 URL: https://issues.apache.org/jira/browse/LUCENE-5098 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5098.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-5105) IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS has no effect
[ https://issues.apache.org/jira/browse/LUCENE-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand closed LUCENE-5105. Resolution: Invalid IndexOptions only apply to the inverted index. For term vectors, please use the FieldType.setStoreTermVectors* methods. IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS has no effect --- Key: LUCENE-5105 URL: https://issues.apache.org/jira/browse/LUCENE-5105 Project: Lucene - Core Issue Type: Bug Environment: In lucene 4.2 Reporter: milesli In lucene 4.2 it is not effective to set indexOptions to DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, positions and offsets are also not stored with termvector. I have to set StoreTermVectorOffsets to true and set StoreTermVectorPositions to true that is effective . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5100) BaseDocIdSetTestCase
[ https://issues.apache.org/jira/browse/LUCENE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5100. -- Resolution: Fixed Fix Version/s: 4.5 BaseDocIdSetTestCase Key: LUCENE-5100 URL: https://issues.apache.org/jira/browse/LUCENE-5100 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.5 Attachments: LUCENE-5100.patch, LUCENE-5100.patch As Robert said on LUCENE-5081, we would benefit from having common testing infrastructure for our DocIdSet implementations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2750) add Kamikaze 3.0.1 into Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-2750: - Attachment: LUCENE-2750.patch I wrote an implementation of a PForDeltaDocIdSet based on the ones in Kamikaze and D. Lemire's JavaFastPFOR (both are licensed under the ASL 2.0). On the contrary to the original implementation, it uses FOR to encode exceptions (this was easier given that we already have lots of utility methods to pack integers). add Kamikaze 3.0.1 into Lucene -- Key: LUCENE-2750 URL: https://issues.apache.org/jira/browse/LUCENE-2750 Project: Lucene - Core Issue Type: Sub-task Components: modules/other Reporter: hao yan Assignee: Adrien Grand Attachments: LUCENE-2750.patch Original Estimate: 336h Remaining Estimate: 336h Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve significantly better performance then Kamikaze 2.0.0 in terms of both compressed size and decompression speed. The main difference between the two versions is Kamikaze 3.0.x uses the much more efficient implementation of the PForDelta compression algorithm. My goal is to integrate the highly efficient PForDelta implementation into Lucene Codec. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter
[ https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707268#comment-13707268 ] Adrien Grand commented on LUCENE-5101: -- A quick note about alternative DocIdSet we now have. I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html Please note that EliasFanoDocIdSet is disadvantaged for advance() since it doesn't have an index yet, it will be interesting to run this benchmark again when it gets one. Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example) make it easier to plugin different bitset implementations to CachingWrapperFilter - Key: LUCENE-5101 URL: https://issues.apache.org/jira/browse/LUCENE-5101 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Currently this is possible, but its not so friendly: {code} protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException { if (docIdSet == null) { // this is better than returning null, as the nonnull result can be cached return EMPTY_DOCIDSET; } else if (docIdSet.isCacheable()) { return docIdSet; } else { final DocIdSetIterator it = docIdSet.iterator(); // null is allowed to be returned by iterator(), // in this case we wrap with the sentinel set, // which is cacheable. if (it == null) { return EMPTY_DOCIDSET; } else { /* INTERESTING PART */ final FixedBitSet bits = new FixedBitSet(reader.maxDoc()); bits.or(it); return bits; /* END INTERESTING PART */ } } } {code} Is there any value to having all this other logic in the protected API? It seems like something thats not useful for a subclass... Maybe this stuff can become final, and INTERESTING PART calls a simpler method, something like: {code} protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) { final FixedBitSet bits = new FixedBitSet(reader.maxDoc()); bits.or(iterator); return bits; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5098) Broadword bit selection
[ https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707384#comment-13707384 ] Adrien Grand commented on LUCENE-5098: -- Committed. Thanks Paul and Dawid! Broadword bit selection --- Key: LUCENE-5098 URL: https://issues.apache.org/jira/browse/LUCENE-5098 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5098.patch, LUCENE-5098.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5109) EliasFano value index
[ https://issues.apache.org/jira/browse/LUCENE-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5109: - Assignee: Adrien Grand EliasFano value index - Key: LUCENE-5109 URL: https://issues.apache.org/jira/browse/LUCENE-5109 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5109.patch Index upper bits of Elias-Fano sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter
[ https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708113#comment-13708113 ] Adrien Grand commented on LUCENE-5101: -- bq. Do WAH8 and PFOR already have an index? They do, but the index is naive: it is a plain binary search over a subset of the (docID,position) pairs contained in the set. With the first versions of these DocIdSets, I just wanted to guarantee O(log(cardinality)) advance performance. bq. Block decoding might still be added to EliasFano, which should improve its nextDoc() performance The main use-case I see for these sets is to be used as filters. So I think advance() performance is more important? bq. The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR. Well, the PFOR doc ID set is not tuned either. :-) But I agree this is a good surprise for the Elias-Fano set. I mean even the WAH8 doc id set should be pretty fast and is still slower than the Elias-Fano set. bq. Another surprise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then. I'm looking forward for the index. :-) bq. For bit densities 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper? I guess so. make it easier to plugin different bitset implementations to CachingWrapperFilter - Key: LUCENE-5101 URL: https://issues.apache.org/jira/browse/LUCENE-5101 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5101.patch Currently this is possible, but its not so friendly: {code} protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException { if (docIdSet == null) { // this is better than returning null, as the nonnull result can be cached return EMPTY_DOCIDSET; } else if (docIdSet.isCacheable()) { return docIdSet; } else { final DocIdSetIterator it = docIdSet.iterator(); // null is allowed to be returned by iterator(), // in this case we wrap with the sentinel set, // which is cacheable. if (it == null) { return EMPTY_DOCIDSET; } else { /* INTERESTING PART */ final FixedBitSet bits = new FixedBitSet(reader.maxDoc()); bits.or(it); return bits; /* END INTERESTING PART */ } } } {code} Is there any value to having all this other logic in the protected API? It seems like something thats not useful for a subclass... Maybe this stuff can become final, and INTERESTING PART calls a simpler method, something like: {code} protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) { final FixedBitSet bits = new FixedBitSet(reader.maxDoc()); bits.or(iterator); return bits; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5098) Broadword bit selection
[ https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5098: - Fix Version/s: 4.5 Broadword bit selection --- Key: LUCENE-5098 URL: https://issues.apache.org/jira/browse/LUCENE-5098 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Fix For: 4.5 Attachments: LUCENE-5098.patch, LUCENE-5098.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5098) Broadword bit selection
[ https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5098. -- Resolution: Fixed Broadword bit selection --- Key: LUCENE-5098 URL: https://issues.apache.org/jira/browse/LUCENE-5098 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Paul Elschot Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5098.patch, LUCENE-5098.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5111) Fix WordDelimiterFilter
Adrien Grand created LUCENE-5111: Summary: Fix WordDelimiterFilter Key: LUCENE-5111 URL: https://issues.apache.org/jira/browse/LUCENE-5111 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand WordDelimiterFilter is documented as broken is TestRandomChains (LUCENE-4641). Given how used it is, we should try to fix it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5113) Allow for packing the pending values of our AppendingLongBuffers
Adrien Grand created LUCENE-5113: Summary: Allow for packing the pending values of our AppendingLongBuffers Key: LUCENE-5113 URL: https://issues.apache.org/jira/browse/LUCENE-5113 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor When working with small arrays, the pending values might require substantial space. So we could allow for packing the pending values in order to save space, the drawback being that this operation will make the buffer read-only. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5113) Allow for packing the pending values of our AppendingLongBuffers
[ https://issues.apache.org/jira/browse/LUCENE-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5113: - Attachment: LUCENE-5113.patch Here is a patch, there is a new freeze() method that packs the pending values into the (Monotonic)AppendingLongBuffer. This freeze method is used for ordinal maps, index sorting and FieldCache. Allow for packing the pending values of our AppendingLongBuffers Key: LUCENE-5113 URL: https://issues.apache.org/jira/browse/LUCENE-5113 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5113.patch When working with small arrays, the pending values might require substantial space. So we could allow for packing the pending values in order to save space, the drawback being that this operation will make the buffer read-only. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5115) Make WAH8DocIdSet compute its cardinality at building time and use it for cost()
Adrien Grand created LUCENE-5115: Summary: Make WAH8DocIdSet compute its cardinality at building time and use it for cost() Key: LUCENE-5115 URL: https://issues.apache.org/jira/browse/LUCENE-5115 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor DocIdSetIterator.cost() accuracy can be important for the performance of some queries (eg.ConjunctionScorer). Since WAH8DocIdSet is immutable, we could compute its cardinality at building time and use it for the cost function. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5113) Allow for packing the pending values of our AppendingLongBuffers
[ https://issues.apache.org/jira/browse/LUCENE-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5113. -- Resolution: Fixed Fix Version/s: 4.5 Allow for packing the pending values of our AppendingLongBuffers Key: LUCENE-5113 URL: https://issues.apache.org/jira/browse/LUCENE-5113 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.5 Attachments: LUCENE-5113.patch When working with small arrays, the pending values might require substantial space. So we could allow for packing the pending values in order to save space, the drawback being that this operation will make the buffer read-only. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5117) DISI.iterator() should never return null.
[ https://issues.apache.org/jira/browse/LUCENE-5117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709890#comment-13709890 ] Adrien Grand commented on LUCENE-5117: -- +1 DISI.iterator() should never return null. - Key: LUCENE-5117 URL: https://issues.apache.org/jira/browse/LUCENE-5117 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir If you have a Filter, you have to check for null twice: Filter.getDocIDSet() can return a null DocIDSet, and then DocIDSet.iterator() can return a null iterator. There is no reason for this: I think iterator() should never return null (consistent with terms/postings apis). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter
[ https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5101: - Attachment: DocIdSetBenchmark.java Well spotted. Maybe I did a mistake when moving the data from the benchmark output to the charts. I modified the program so that it outputs directly the input of the charts. See the updated charts at http://people.apache.org/~jpountz/doc_id_sets.html. I also modified it so that memory uses a log scale too. make it easier to plugin different bitset implementations to CachingWrapperFilter - Key: LUCENE-5101 URL: https://issues.apache.org/jira/browse/LUCENE-5101 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: DocIdSetBenchmark.java, LUCENE-5101.patch Currently this is possible, but its not so friendly: {code} protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException { if (docIdSet == null) { // this is better than returning null, as the nonnull result can be cached return EMPTY_DOCIDSET; } else if (docIdSet.isCacheable()) { return docIdSet; } else { final DocIdSetIterator it = docIdSet.iterator(); // null is allowed to be returned by iterator(), // in this case we wrap with the sentinel set, // which is cacheable. if (it == null) { return EMPTY_DOCIDSET; } else { /* INTERESTING PART */ final FixedBitSet bits = new FixedBitSet(reader.maxDoc()); bits.or(it); return bits; /* END INTERESTING PART */ } } } {code} Is there any value to having all this other logic in the protected API? It seems like something thats not useful for a subclass... Maybe this stuff can become final, and INTERESTING PART calls a simpler method, something like: {code} protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) { final FixedBitSet bits = new FixedBitSet(reader.maxDoc()); bits.or(iterator); return bits; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2750) add Kamikaze 3.0.1 into Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-2750: - Attachment: LUCENE-2750.patch Updated patch: DISI.cost() now returns the cardinality of the set, computed at building time. add Kamikaze 3.0.1 into Lucene -- Key: LUCENE-2750 URL: https://issues.apache.org/jira/browse/LUCENE-2750 Project: Lucene - Core Issue Type: Sub-task Components: modules/other Reporter: hao yan Assignee: Adrien Grand Attachments: LUCENE-2750.patch, LUCENE-2750.patch Original Estimate: 336h Remaining Estimate: 336h Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve significantly better performance then Kamikaze 2.0.0 in terms of both compressed size and decompression speed. The main difference between the two versions is Kamikaze 3.0.x uses the much more efficient implementation of the PForDelta compression algorithm. My goal is to integrate the highly efficient PForDelta implementation into Lucene Codec. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper
[ https://issues.apache.org/jira/browse/LUCENE-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand closed LUCENE-2949. Resolution: Won't Fix Fix Version/s: (was: 4.4) There is no TermVectorMapper anymore. FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper - Key: LUCENE-2949 URL: https://issues.apache.org/jira/browse/LUCENE-2949 Project: Lucene - Core Issue Type: Improvement Affects Versions: 3.0.3, 4.0-ALPHA Reporter: Grant Ingersoll Priority: Minor Labels: FastVectorHighlighter, Highlighter Attachments: LUCENE-2949.patch Based on my reading of the FieldTermStack constructor that loads the vector from disk, we could probably save a bunch of time and memory by using the TermVectorMapper callback mechanism instead of materializing the full array of terms into memory and then throwing most of them out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4734: - Attachment: LUCENE-4734.patch Ryan, I iterated over your patch in order to be able to handle a few more queries, specifically phrase queries that contain gaps or have several terms at the same position. It is very hard to handle all possibilities without making the highlighting complexity explode. I'm looking forward to LUCENE-2878 so that highlighting can be more efficient and doesn't need to duplicate the query interpretation logic anymore. FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-4118) FastVectorHighlighter fail to highlight taking in input some proximity query.
[ https://issues.apache.org/jira/browse/LUCENE-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand closed LUCENE-4118. Resolution: Duplicate Duplicate of LUCENE-4734 FastVectorHighlighter fail to highlight taking in input some proximity query. - Key: LUCENE-4118 URL: https://issues.apache.org/jira/browse/LUCENE-4118 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 3.4, 5.0 Reporter: Emanuele Lombardi Assignee: Koji Sekiguchi Labels: FastVectorHighlighter Attachments: FVHPatch.txt There are 2 related bug with proximity query 1) In a phrase there are n repeated terms the FVH module fails to highlight that. see testRepeatedTermsWithSlop 2) If you search the terms reversed the FVH module fails to highlight that. see testReversedTermsWithSlop -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable
[ https://issues.apache.org/jira/browse/LUCENE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4542: - Assignee: Adrien Grand (was: Chris Male) Make RECURSION_CAP in HunspellStemmer configurable -- Key: LUCENE-4542 URL: https://issues.apache.org/jira/browse/LUCENE-4542 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.0 Reporter: Piotr Assignee: Adrien Grand Attachments: Lucene-4542-javadoc.patch, LUCENE-4542.patch, LUCENE-4542-with-solr.patch Currently there is private static final int RECURSION_CAP = 2; in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries almost unusable, due to bad performance (f.ex. it costs 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for recursion_cap=1). It would be nice to be able to tune this number as needed. AFAIK this number (2) was chosen arbitrary. (it's a first issue in my life, so please forgive me any mistakes done). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable
[ https://issues.apache.org/jira/browse/LUCENE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4542. -- Resolution: Fixed Fix Version/s: 4.5 Committed, thanks! Make RECURSION_CAP in HunspellStemmer configurable -- Key: LUCENE-4542 URL: https://issues.apache.org/jira/browse/LUCENE-4542 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.0 Reporter: Piotr Assignee: Adrien Grand Fix For: 4.5 Attachments: Lucene-4542-javadoc.patch, LUCENE-4542.patch, LUCENE-4542-with-solr.patch Currently there is private static final int RECURSION_CAP = 2; in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries almost unusable, due to bad performance (f.ex. it costs 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for recursion_cap=1). It would be nice to be able to tune this number as needed. AFAIK this number (2) was chosen arbitrary. (it's a first issue in my life, so please forgive me any mistakes done). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5119) DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory
[ https://issues.apache.org/jira/browse/LUCENE-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713408#comment-13713408 ] Adrien Grand commented on LUCENE-5119: -- +1 I think it makes sense to make DiskDV deserve its name and store everything on disk. DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory -- Key: LUCENE-5119 URL: https://issues.apache.org/jira/browse/LUCENE-5119 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-5119.patch These are accessed sequentially when e.g. faceting, and can be a fairly large amount of data (based on # of docs and # of unique terms). I think this was done so that conceptually random access to a specific docid would be faster than eg. stored fields, but I think we should instead target the DV datastructures towards real use cases (faceting,sorting,grouping,...) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5119) DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory
[ https://issues.apache.org/jira/browse/LUCENE-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713414#comment-13713414 ] Adrien Grand commented on LUCENE-5119: -- David, I think your use-case would still work pretty well with this change. In particular, if you had enough memory to store your ordinals mapping in memory, this means that the file-system cache will likely be able to cache the whole ordinals mapping as well (you may just need to decrease a little the amount of memory given the the JVM) so random access should remain fast? DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory -- Key: LUCENE-5119 URL: https://issues.apache.org/jira/browse/LUCENE-5119 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-5119.patch These are accessed sequentially when e.g. faceting, and can be a fairly large amount of data (based on # of docs and # of unique terms). I think this was done so that conceptually random access to a specific docid would be faster than eg. stored fields, but I think we should instead target the DV datastructures towards real use cases (faceting,sorting,grouping,...) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5115) Make WAH8DocIdSet compute its cardinality at building time and use it for cost()
[ https://issues.apache.org/jira/browse/LUCENE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5115. -- Resolution: Fixed Fix Version/s: 4.5 Make WAH8DocIdSet compute its cardinality at building time and use it for cost() Key: LUCENE-5115 URL: https://issues.apache.org/jira/browse/LUCENE-5115 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.5 Attachments: LUCENE-5115.patch DocIdSetIterator.cost() accuracy can be important for the performance of some queries (eg.ConjunctionScorer). Since WAH8DocIdSet is immutable, we could compute its cardinality at building time and use it for the cost function. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5057: - Assignee: Adrien Grand Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna Assignee: Adrien Grand The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I also wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. I would love to contribute on this and looking forward to your feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713480#comment-13713480 ] Adrien Grand commented on LUCENE-4734: -- Hey Ryan, I think the use-case you are describing will be possible. However this will require some care because offsets computed by Lucene's analysis API are offsets for UTF16-encoded content (Java's internal encoding). So if your client code' programming language has a different internal encoding, you will need to perform conversions (this is not a fundamental problem, just something to be aware of in order not to get bad surprises). FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5057. -- Resolution: Won't Fix I checked with Luca and this is a dictionary issue, fietsen and fiets are both considered as stems of fietsen with the dutch dictionary. For people who have stemming issues, this is very easy to check whether the issue is in Lucene or in the dictionary by installing hunspell-tools (apt-get install hunspell-tools on Debian and related distributions) and running: {noformat} % echo fietsen tmp % /usr/lib/hunspell/analyze nl_NL.aff nl_NL.dic tmp fietsen analyze(fietsen) = st:fietsen analyze(fietsen) = st:fiets fl:N stem(fietsen) = fietsen stem(fietsen) = fiets {noformat} In this particular case, we can see that fietsen is both a stem (1st line) and a variation of fiets with the affix identified with N. Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna Assignee: Adrien Grand The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I also wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. I would love to contribute on this and looking forward to your feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4734: - Assignee: Adrien Grand FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Assignee: Adrien Grand Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4734. -- Resolution: Fixed FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Assignee: Adrien Grand Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5122) DiskDV probably shouldnt use BlockPackedReader for SortedDV doc-to-ord
[ https://issues.apache.org/jira/browse/LUCENE-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713670#comment-13713670 ] Adrien Grand commented on LUCENE-5122: -- For SortingMP, we only provide the ability to sort by a NumericDocValues field out of the box because numbers feel more natural to define a static rank. Maybe another case where BlockPackedReader could help is if almost all documents have the same value. In that case BlockPackedReader will be able to require 0 bits per value for all blocks that contain a single unique value. But I agree PackedInts would likely better in general and remove one level of indirection. DiskDV probably shouldnt use BlockPackedReader for SortedDV doc-to-ord -- Key: LUCENE-5122 URL: https://issues.apache.org/jira/browse/LUCENE-5122 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir I dont think blocking provides any benefit here in general. we can assume the ordinals are essentially random and since SortedDV is single-valued, its probably better to just use the simpler packedints directly? I guess the only case where it would help is if you sorted your segments by that DV field. But that seems kinda wierd/esoteric to sort your index by a deref'ed string value, e.g. I don't think its even supported by SortingMP. For the SortedSet ord stream, this can exceed 2B values so for now I think it should stay as blockpackedreader. but it could use a large blocksize... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5124) fix+document+rename DiskDV to Lucene45
[ https://issues.apache.org/jira/browse/LUCENE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713856#comment-13713856 ] Adrien Grand commented on LUCENE-5124: -- +1 fix+document+rename DiskDV to Lucene45 -- Key: LUCENE-5124 URL: https://issues.apache.org/jira/browse/LUCENE-5124 Project: Lucene - Core Issue Type: New Feature Affects Versions: 4.5 Reporter: Robert Muir The idea is that the default implementation should not hold everything in memory, we can have a Memory impl for that. I think stuff being all in heap memory is just a relic of FieldCache. In my benchmarking diskdv works well, and its much easier to manage (keep a smaller heap, leave it to the OS, no OOMs etc from merging large FSTs, ...) If someone wants to optimize by forcing everything in memory, they can then use the usual approach (e.g. just use FileSwitchDirectory, or pick Memory for even more efficient stuff). Ill keep the issue here for a bit. If we decide to do this, ill work up file format docs and so on. We should also fix a few things that are not great about it (LUCENE-5122) before making it the default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reopened LUCENE-4734: -- FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Assignee: Adrien Grand Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4734: - Attachment: LUCENE-4734-2.patch The approach I used can be memory-intensive when there are many positions that have several terms, here is a fix. FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Assignee: Adrien Grand Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715461#comment-13715461 ] Adrien Grand commented on LUCENE-4734: -- I agree this seems wasteful. Maybe we could open an issue about it? FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Assignee: Adrien Grand Labels: fastvectorhighlighter, highlighter Fix For: 4.4 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5131) CheckIndex is confusing for docvalues fields
[ https://issues.apache.org/jira/browse/LUCENE-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719404#comment-13719404 ] Adrien Grand commented on LUCENE-5131: -- Definitely +1 for this patch and printing statistics about unique value counts for SORTED and SORTED_SET. CheckIndex is confusing for docvalues fields Key: LUCENE-5131 URL: https://issues.apache.org/jira/browse/LUCENE-5131 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-5131.patch, LUCENE-5131.patch it prints things like: {noformat} test: docvalues...OK [0 total doc count; 18 docvalues fields] {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720476#comment-13720476 ] Adrien Grand commented on LUCENE-4876: -- bq. We keep clone() on IWC, and the rest of the objects, and tell users that it's their responsibility to call IWC.clone() before passing to IW? That's line a 1-liner change (well + clarifying the jdocs), that will make 99% of the users happy. The rest should just do new IW(dir, conf.clone()) ... that's simple enough? Even though most users probably don't reuse their IndexWriterConfig objects, doing so should be safe and I'm a little scared of what could happen if a ConcurrentMergeScheduler was mistakenly shared by two different IndexWriters for example. Maybe another option for this issue would be to replace all these objects (MergePolicy, MergeScheduler, etc.) in IndexWriterConfig by factories for these objects that accept an IndexWriter as an argument (and maybe other objects depending on the factory). This would make it clear that IndexWriter has its own instance of these objects and reusing IndexWriterConfig instances would still be safe. An interesting side-effect is that we wouldn't need these SetOnce? in DWPT, FlushPolicy, and MergePolicy anymore, and ConcurrentMergeScheduler.indexWriter could be made final. IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 Attachments: LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720526#comment-13720526 ] Adrien Grand commented on LUCENE-4876: -- bq. This is currently impossible because of SetOnce. The merge schedulers don't have a SetOnceIndexWriter so if a user replaces the MergePolicy and all objects that have a SetOnce in its IndexWriterConfig and forgets the merge scheduler, the problem remains. I don't really like this SetOnce? trick. If a variable should only be set once, it should be final and set in the constructor? bq. how cruel it is to expose clone semantics on end-users I fully agree. In this issue I tried to make clone consistently used across stateful objects held by an IndexWriterConfig object but ideally IndexWriterConfig should only carry stateless objects (in particular none of them should have an IndexWriter as a member) so that we never need to clone it or any of its members when reusing it. IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 Attachments: LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720806#comment-13720806 ] Adrien Grand commented on LUCENE-4876: -- The SetOnceIndexWriter on IWC addresses my main concern. Thanks Shai! IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 Attachments: LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5127) FixedGapTermsIndex should use monotonic compression
[ https://issues.apache.org/jira/browse/LUCENE-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720856#comment-13720856 ] Adrien Grand commented on LUCENE-5127: -- This is a very nice cleanup! In FixedGapTermsIndexWriter, I think we could improve the buffering of offsets and addresses by directly buffering into a MonotonicBlockPackedWriter over a RamOutputStream, and then copy the raw content of the RamOutputStream to the IndexOutput? This would avoid an extra encoding/decoding step. FixedGapTermsIndex should use monotonic compression --- Key: LUCENE-5127 URL: https://issues.apache.org/jira/browse/LUCENE-5127 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5127.patch, LUCENE-5127.patch, LUCENE-5127.patch for the addresses in the big in-memory byte[] and disk blocks, we could save a good deal of RAM here. I think this codec just never got upgraded when we added these new packed improvements, but it might be interesting to try to use for the terms data of sorted/sortedset DV implementations. patch works, but has nocommits and currently ignores the divisor. The annoying problem there being that we have the shared interface with get(int) for PackedInts.Mutable/Reader, but no equivalent base class for monotonics get(long)... Still its enough that we could benchmark/compare for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946
Adrien Grand created LUCENE-5140: Summary: Slowdown of the span queries caused by LUCENE-4946 Key: LUCENE-5140 URL: https://issues.apache.org/jira/browse/LUCENE-5140 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor [~romseygeek] noticed that span queries have been slower since LUCENE-4946 got committed. http://people.apache.org/~mikemccand/lucenebench/SpanNear.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946
[ https://issues.apache.org/jira/browse/LUCENE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5140: - Attachment: LUCENE-5140.patch I think it is due to some overhead of our TimSorter implementation for small arrays. Here is a patch that replaces TimSorter with InPlaceMergeSorter, which should perform better on very small arrays but still has optimizations for sorted content, eg. merging two sorted slices is a no-op if the highest element from the 1st slice is lower than the least element from the 2nd slice. luceneutil seems to be happy with this patch (left is trunk, right is with patch applied): {noformat} LowSpanNear 143.65 (4.5%) 157.75 (3.9%) 9.8% ( 1% - 19%) HighSpanNear5.47 (4.4%)6.20 (9.7%) 13.4% ( 0% - 28%) MedSpanNear 94.27 (3.7%) 107.51 (3.7%) 14.1% ( 6% - 22%) {noformat} Slowdown of the span queries caused by LUCENE-4946 -- Key: LUCENE-5140 URL: https://issues.apache.org/jira/browse/LUCENE-5140 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5140.patch [~romseygeek] noticed that span queries have been slower since LUCENE-4946 got committed. http://people.apache.org/~mikemccand/lucenebench/SpanNear.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec
Adrien Grand created LUCENE-5141: Summary: CheckIndex.fixIndex doesn't need a Codec Key: LUCENE-5141 URL: https://issues.apache.org/jira/browse/LUCENE-5141 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial CheckIndex.fixIndex takes a codec as an argument although it doesn't need one. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec
[ https://issues.apache.org/jira/browse/LUCENE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5141: - Attachment: LUCENE-5141.patch Patch removing Codec from the arguments of CheckIndex.fixIndex. CheckIndex.fixIndex doesn't need a Codec Key: LUCENE-5141 URL: https://issues.apache.org/jira/browse/LUCENE-5141 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-5141.patch CheckIndex.fixIndex takes a codec as an argument although it doesn't need one. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
[ https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4734. -- Resolution: Fixed FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight Key: LUCENE-4734 URL: https://issues.apache.org/jira/browse/LUCENE-4734 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0, 4.1, 5.0 Reporter: Ryan Lauck Assignee: Adrien Grand Labels: fastvectorhighlighter, highlighter Fix For: 5.0, 4.5 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch If a proximity phrase query overlaps with any other query term it will not be highlighted. Example Text: A B C D E F G Example Queries: B E~10 D (D will be highlighted instead of B C D E) B E~10 C F~10 (nothing will be highlighted) This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be B. The second TermInfo will be D which will not be found in the submap for B E~10 and will trigger a failed match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec
[ https://issues.apache.org/jira/browse/LUCENE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5141. -- Resolution: Fixed CheckIndex.fixIndex doesn't need a Codec Key: LUCENE-5141 URL: https://issues.apache.org/jira/browse/LUCENE-5141 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-5141.patch CheckIndex.fixIndex takes a codec as an argument although it doesn't need one. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec
[ https://issues.apache.org/jira/browse/LUCENE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5141: - Fix Version/s: 4.5 5.0 CheckIndex.fixIndex doesn't need a Codec Key: LUCENE-5141 URL: https://issues.apache.org/jira/browse/LUCENE-5141 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 5.0, 4.5 Attachments: LUCENE-5141.patch CheckIndex.fixIndex takes a codec as an argument although it doesn't need one. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5145) Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval)
[ https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5145: - Assignee: Adrien Grand Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval) --- Key: LUCENE-5145 URL: https://issues.apache.org/jira/browse/LUCENE-5145 Project: Lucene - Core Issue Type: Improvement Reporter: Boaz Leskes Assignee: Adrien Grand Attachments: LUCENE-5145.patch Made acceptableOverheadRatio configurable Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval. Introduced a new variant, AppendingPackedLongBuffer which solely relies on PackedInts as a back-end. This new class is useful where people have non-negative numbers with a fairly uniform distribution over a fixed (limited) range. Ex. facets ordinals. To distinguish it from AppendingPackedLongBuffer, delta based AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer Fixed an Issue with NullReader where it didn't respect it's valueCount in bulk gets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5148) SortedSetDocValues caching / state
Adrien Grand created LUCENE-5148: Summary: SortedSetDocValues caching / state Key: LUCENE-5148 URL: https://issues.apache.org/jira/browse/LUCENE-5148 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Priority: Minor I just spent some time digging into a bug which was due to the fact that SORTED_SET doc values are stateful (setDocument/nextOrd) and are cached per thread. So if you try to get two instances from the same field in the same thread, you will actually get the same instance and won't be able to iterate over ords of two documents in parallel. This is not necessarily a bug, this behavior can be documented, but I think it would be nice if the API could prevent from such mistakes by storing the state in a separate object or cloning the SortedSetDocValues object in SegmentCoreReaders.getSortedSetDocValues? What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5127) FixedGapTermsIndex should use monotonic compression
[ https://issues.apache.org/jira/browse/LUCENE-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13722649#comment-13722649 ] Adrien Grand commented on LUCENE-5127: -- +1 FixedGapTermsIndex should use monotonic compression --- Key: LUCENE-5127 URL: https://issues.apache.org/jira/browse/LUCENE-5127 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5127.patch, LUCENE-5127.patch, LUCENE-5127.patch, LUCENE-5127.patch for the addresses in the big in-memory byte[] and disk blocks, we could save a good deal of RAM here. I think this codec just never got upgraded when we added these new packed improvements, but it might be interesting to try to use for the terms data of sorted/sortedset DV implementations. patch works, but has nocommits and currently ignores the divisor. The annoying problem there being that we have the shared interface with get(int) for PackedInts.Mutable/Reader, but no equivalent base class for monotonics get(long)... Still its enough that we could benchmark/compare for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5145) Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval)
[ https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13722770#comment-13722770 ] Adrien Grand commented on LUCENE-5145: -- Thanks Boaz, the patch looks very good! - I like the fact that the addition of the new bulk API helped make fillValues final! - OrdinalMap.subIndexes, SortedDocValuesWriter.pending and SortedSetDocValuesWriter.pending are 0-based so they could use the new {{AppendingPackedLongBuffer}} instead of {{AppendingDeltaPackedLongBuffer}}, can you update the patch? Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval) --- Key: LUCENE-5145 URL: https://issues.apache.org/jira/browse/LUCENE-5145 Project: Lucene - Core Issue Type: Improvement Reporter: Boaz Leskes Assignee: Adrien Grand Attachments: LUCENE-5145.patch Made acceptableOverheadRatio configurable Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval. Introduced a new variant, AppendingPackedLongBuffer which solely relies on PackedInts as a back-end. This new class is useful where people have non-negative numbers with a fairly uniform distribution over a fixed (limited) range. Ex. facets ordinals. To distinguish it from AppendingPackedLongBuffer, delta based AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer Fixed an Issue with NullReader where it didn't respect it's valueCount in bulk gets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5150) WAH8DocIdSet: dense sets compression
Adrien Grand created LUCENE-5150: Summary: WAH8DocIdSet: dense sets compression Key: LUCENE-5150 URL: https://issues.apache.org/jira/browse/LUCENE-5150 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5150) WAH8DocIdSet: dense sets compression
[ https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5150: - Description: In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be able to encode the inverse set to also compress very dense sets. WAH8DocIdSet: dense sets compression Key: LUCENE-5150 URL: https://issues.apache.org/jira/browse/LUCENE-5150 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be able to encode the inverse set to also compress very dense sets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5150) WAH8DocIdSet: dense sets compression
[ https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5150: - Attachment: LUCENE-5150.patch Here is a patch. It reserves an additional bit in the header to say whether the encoding should be inversed (meaning clean words are actually 0xFF instead of 0x00). It should reduce the amount of memory required to build and store dense sets. In spite of this change, compression ratios remain the same for sparse sets. For random dense sets, I observed compression ratios of 87% when the load factor is 90% and 20% when the load factor is 99% (vs. 100% before). WAH8DocIdSet: dense sets compression Key: LUCENE-5150 URL: https://issues.apache.org/jira/browse/LUCENE-5150 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-5150.patch In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be able to encode the inverse set to also compress very dense sets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5145) Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval)
[ https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5145. -- Resolution: Fixed Fix Version/s: 4.5 5.0 Committed. Thanks Boaz! Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval) --- Key: LUCENE-5145 URL: https://issues.apache.org/jira/browse/LUCENE-5145 Project: Lucene - Core Issue Type: Improvement Reporter: Boaz Leskes Assignee: Adrien Grand Fix For: 5.0, 4.5 Attachments: LUCENE-5145.patch, LUCENE-5145.v2.patch Made acceptableOverheadRatio configurable Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval. Introduced a new variant, AppendingPackedLongBuffer which solely relies on PackedInts as a back-end. This new class is useful where people have non-negative numbers with a fairly uniform distribution over a fixed (limited) range. Ex. facets ordinals. To distinguish it from AppendingPackedLongBuffer, delta based AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer Fixed an Issue with NullReader where it didn't respect it's valueCount in bulk gets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5153) Allow wrapping Reader from AnalyzerWrapper
[ https://issues.apache.org/jira/browse/LUCENE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723992#comment-13723992 ] Adrien Grand commented on LUCENE-5153: -- I think this is the right thing? On the opposite, if wrapReader inserted char filters at the end of the charfilter chain, the behavior of the wrapper analyzer would be altered (it would allow to insert something between the first CharFilter and the last TokenFilter of the wrapped analyzer). Allow wrapping Reader from AnalyzerWrapper -- Key: LUCENE-5153 URL: https://issues.apache.org/jira/browse/LUCENE-5153 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-5153.patch It can be useful to allow AnalyzerWrapper extensions to wrap the Reader given to initReader, e.g. with a CharFilter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946
[ https://issues.apache.org/jira/browse/LUCENE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724003#comment-13724003 ] Adrien Grand commented on LUCENE-5140: -- I will commit the patch as-is soon and have a look at the lucenebench reports in the next days if there is no objection. Slowdown of the span queries caused by LUCENE-4946 -- Key: LUCENE-5140 URL: https://issues.apache.org/jira/browse/LUCENE-5140 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5140.patch [~romseygeek] noticed that span queries have been slower since LUCENE-4946 got committed. http://people.apache.org/~mikemccand/lucenebench/SpanNear.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946
[ https://issues.apache.org/jira/browse/LUCENE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-5140. -- Resolution: Fixed Fix Version/s: 4.5 5.0 Committed. I will have a look at lucenebench in the next few days. Slowdown of the span queries caused by LUCENE-4946 -- Key: LUCENE-5140 URL: https://issues.apache.org/jira/browse/LUCENE-5140 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 5.0, 4.5 Attachments: LUCENE-5140.patch [~romseygeek] noticed that span queries have been slower since LUCENE-4946 got committed. http://people.apache.org/~mikemccand/lucenebench/SpanNear.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4634) PackedInts: streaming API that supports variable numbers of bits per value
Adrien Grand created LUCENE-4634: Summary: PackedInts: streaming API that supports variable numbers of bits per value Key: LUCENE-4634 URL: https://issues.apache.org/jira/browse/LUCENE-4634 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor It could be convenient to have a streaming API (writers and iterators, no random access) that supports variable numbers of bits per value. Although this would be much slower than the current fixed-size APIs, it could help save bytes in our codec formats. The API could look like: {code} Iterator { long next(int bitsPerValue); } Writer { void write(long value, int bitsPerValue); // assert PackedInts.bitsRequired(value) = bitsPerValue; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535079#comment-13535079 ] Adrien Grand commented on LUCENE-4599: -- Hey Shawn, I'm still working actively on this issue. I made good progress regarding compression ratio but term vectors are more complicated than stored fields (with lots of corner cases like negative start offsets, negative lengths, fields that don't always have the same options, etc.) so I will need time and lots of Jenkins builds to feel comfortable making it the default term vectors impl. It will depend on the 4.1 release schedule but given that it's likely to comme rather soon and that I will have very little time to work on this issue until next month it will probably only make it to 4.2. Compressed term vectors --- Key: LUCENE-4599 URL: https://issues.apache.org/jira/browse/LUCENE-4599 Project: Lucene - Core Issue Type: Task Components: core/codecs, core/termvectors Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1 Attachments: LUCENE-4599.patch We should have codec-compressed term vectors similarly to what we have with stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4634) PackedInts: streaming API that supports variable numbers of bits per value
[ https://issues.apache.org/jira/browse/LUCENE-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4634: - Attachment: LUCENE-4634.patch Here is a patch. (I would like to use it for LUCENE-4599.) PackedInts: streaming API that supports variable numbers of bits per value -- Key: LUCENE-4634 URL: https://issues.apache.org/jira/browse/LUCENE-4634 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4634.patch It could be convenient to have a streaming API (writers and iterators, no random access) that supports variable numbers of bits per value. Although this would be much slower than the current fixed-size APIs, it could help save bytes in our codec formats. The API could look like: {code} Iterator { long next(int bitsPerValue); } Writer { void write(long value, int bitsPerValue); // assert PackedInts.bitsRequired(value) = bitsPerValue; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4215) Optimize facets when multi-valued field is really single valued
[ https://issues.apache.org/jira/browse/SOLR-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535804#comment-13535804 ] Adrien Grand commented on SOLR-4215: Hi Ryan. I think this test should be done on every segment rather than on the top-level composite reader, because if the index has several segments, {{terms(field)}} will return a MultiTerms instance whose {{size()}} method always returns -1? Optimize facets when multi-valued field is really single valued --- Key: SOLR-4215 URL: https://issues.apache.org/jira/browse/SOLR-4215 Project: Solr Issue Type: Improvement Components: SearchComponents - other Reporter: Ryan McKinley Priority: Minor Fix For: 4.1 Attachments: SOLR-4215-check-single-valued.patch In lucene 4+, the Terms interface can quickly tell us if the index is actually single-valued. We should use that for better facet performance with multi-valued fields (when they are actually single valued) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4599: - Attachment: LUCENE-4599.patch New patch (still not committable yet) with better compression ratio thanks to the following optimizations: * the block of data compressed by LZ4 only contains term and payload bytes (without their lengths), everything else (positions, flags, term lengths, etc.) is stored using packed ints, * term freqs are encoded in a pfor-like way to save space (this was a 3x/4x decrease of the space needed to store freqs), * when all fields have the same flags (a 3-bits int that says whether positions/offsets/payloads are enabled), the flag is stored only once per distinct field, * when both positions and offsets are enabled, I compute average term lengths and only store the difference between the start offset and the expected start offset computed from the average term length and the position, * for lengths, this impl stores the difference between the indexed term length and the actual length (endOffset - startOffset), with an optimization when they are always equal to 0 (can happen with ASCII and an analyzer that does not perform stemming). Depending on the size of docs, not the same data takes most space in a single chunk: || || Small docs (28 * 1K) || Large doc (1 * 750K) || | Total chunk size (positions and offsets enabled) | 21K | 450K | | Term bytes | 11K (16K before compression) | 64K (84K before compression) | | Term lengths | 2K | 8K | | Positions | 3K | 215K | | Offsets | 3K (4K if positions are disabled) | 150K (240K if positions are disabled) | | Term freqs | 500 | 7K | the rest is negligible * So with small docs, most of space is occupied by term bytes whereas with large docs positions and offsets can easily take 80% of the chunk size. * Compression might not be as good as with stored fields, especially when docs are large because terms have already been deduplicated. Overall, the on-disk format is more compact than the Lucene40 term vectors format (positions and offsets enabled, the number of documents indexed is not the same for small and large docs): || || Small docs || Large docs || | Lucene40 tvx | 160033 | 1633 | | Lucene40 tvd | 49971 | 232 | | Lucene40 tvf | 11279483 | 56640734 | | Compressing tvx | 1116 | 78 | | Compressing tvd | 7589550 | 44633841 | This impl is 34% smaller than the Lucene40 one on small docs (mainly thanks to compression) and 21% on large docs (mainly thanks to packed ints). If you have other ideas to improve this ratio, let me know! I still have to write more tests, clean up the patch, make reading term vectors more memory-efficient, and implement efficient merging... Compressed term vectors --- Key: LUCENE-4599 URL: https://issues.apache.org/jira/browse/LUCENE-4599 Project: Lucene - Core Issue Type: Task Components: core/codecs, core/termvectors Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1 Attachments: LUCENE-4599.patch, LUCENE-4599.patch We should have codec-compressed term vectors similarly to what we have with stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536928#comment-13536928 ] Adrien Grand commented on LUCENE-4609: -- bq. Attached a PackedEncoder, which is based on PackedInts. Nice! You could probably improve memory efficiency and speed of the decoder by using a ReaderIterator instead of a Reader: * getReader: consumes the packed array stream and returns an in-memory packed array, * getDirectReader: does not consume the whole stream and return an impl that uses IndexInput.seek to look up values, * getReaderIterator: returns a sequential iterator which bulk-decodes values (the mem parameter allows you to control the speed/memory-efficiency trade-off), so it will be much faster than iterating over the values of getReader. For improved speed, getReaderIterator has the {{next(int count)}} method which returns several values in a single call, this proved to be faster. Another option could be to directly use PackedInts.Encoder/Decoder similarly to Lucene41PostingsFormat (packed writers and reader iterators also use them under the hood). bq. This is PForDelta compression (the outliers are encoded separately) I think? We can test it and see if it helps ... but we weren't so happy with it for encoding postings If the packed stream is very large, another option is to split it into blocks that all have the same number of values (but different number of bits per value). This should prevent the whole stream from growing because of rare extreme values. This is what the stored fields index (with blocks of 1024 values) and Lucene41PostingsFormat (with blocks of 128 values) do. Storing the min value at the beginning of the block and then only encoding deltas could help too. bq. The header is very large ... really you should only need 1) bpv, and 2) bytes.length (which I think you already have, via both payloads and DocValues). If the PackedInts API isn't flexible enough for you to feed it bpv and bytes.length then let's fix that! Most PackedInts method have a *NoHeader variant that does the exact same job whithout relying on a header at the beginning of the stream (LUCENE-4161), I think this is what you are looking for. We should probably make this header stuff opt-in rather than opt-out (by replacing getWriter/Reader/ReaderIterator with the NoHeader methods and adding a method dedicated to reading/writing a header). Write a PackedIntsEncoder/Decoder for facets Key: LUCENE-4609 URL: https://issues.apache.org/jira/browse/LUCENE-4609 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Shai Erera Priority: Minor Attachments: LUCENE-4609.patch Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders. It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4634) PackedInts: streaming API that supports variable numbers of bits per value
[ https://issues.apache.org/jira/browse/LUCENE-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4634. -- Resolution: Fixed PackedInts: streaming API that supports variable numbers of bits per value -- Key: LUCENE-4634 URL: https://issues.apache.org/jira/browse/LUCENE-4634 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4634.patch It could be convenient to have a streaming API (writers and iterators, no random access) that supports variable numbers of bits per value. Although this would be much slower than the current fixed-size APIs, it could help save bytes in our codec formats. The API could look like: {code} Iterator { long next(int bitsPerValue); } Writer { void write(long value, int bitsPerValue); // assert PackedInts.bitsRequired(value) = bitsPerValue; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
Adrien Grand created LUCENE-4643: Summary: PackedInts: convenience classes to write blocks of packed ints Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4643: - Attachment: LUCENE-4643.patch Patch. This should be useful for LUCENE-4609 and LUCENE-4599, what do you think? PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538143#comment-13538143 ] Adrien Grand commented on LUCENE-4609: -- Gilad, I created LUCENE-4643 which I assume should be better than PackedInts.Writer and PackedInts.ReaderIterator for your use-case? It doesn't write heavyweight headers (meaning that you need to know the PackedInts version and the size of the stream otherwise) and encodes data in fixed-size blocks. Write a PackedIntsEncoder/Decoder for facets Key: LUCENE-4609 URL: https://issues.apache.org/jira/browse/LUCENE-4609 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Shai Erera Priority: Minor Attachments: LUCENE-4609.patch Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders. It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4643: - Attachment: LUCENE-4643.patch Good point. I removed zig-zag encoding and modified the javadocs to say these classes only support positive values. PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch, LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4656) Fix EmptyTokenizer
Adrien Grand created LUCENE-4656: Summary: Fix EmptyTokenizer Key: LUCENE-4656 URL: https://issues.apache.org/jira/browse/LUCENE-4656 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4656) Fix EmptyTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4656: - Attachment: LUCENE-4656.patch Patch. I wasn't sure whether to add a CharTermAttribute to EmptyTokenizer or to try fixing BaseTokenStreamTestCase but I couldn't think of a non-trivial tokenizer that wouldn't have a CharTermAttribute so I left the assertion that checks that a token stream always has a CharTermAttribute. Fix EmptyTokenizer -- Key: LUCENE-4656 URL: https://issues.apache.org/jira/browse/LUCENE-4656 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-4656.patch TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4656) Fix EmptyTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542899#comment-13542899 ] Adrien Grand commented on LUCENE-4656: -- bq. Why do we have that? It feels strange to me that a non-trivial TokenStream could have no CharTermAttribute? Fix EmptyTokenizer -- Key: LUCENE-4656 URL: https://issues.apache.org/jira/browse/LUCENE-4656 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-4656.patch TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4656) Fix EmptyTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4656: - Attachment: LUCENE-4656.patch Alternative patch that fixes BaseTokenStreamTestCase. I needed to add a quick hack to add a TermToBytesRefAttribute when the tokenstream doesn't have one so that TermsHashPerField doesn't complain that it can't find this attribute when indexing. Fix EmptyTokenizer -- Key: LUCENE-4656 URL: https://issues.apache.org/jira/browse/LUCENE-4656 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-4656.patch, LUCENE-4656.patch TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4656) Fix EmptyTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542995#comment-13542995 ] Adrien Grand commented on LUCENE-4656: -- bq. So we should fix IndexWriter to handle that case? How would IndexWriter handle token streams with no TermToBytesRefAttribute? - fail if the tokens stream happens to have tokens? (incrementToken returns true at least once) - index empty terms? Fix EmptyTokenizer -- Key: LUCENE-4656 URL: https://issues.apache.org/jira/browse/LUCENE-4656 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-4656.patch, LUCENE-4656.patch TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4656) Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream (without CharTermAttribute), fix BaseTokenStreamTestCase
[ https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543133#comment-13543133 ] Adrien Grand commented on LUCENE-4656: -- Uwe, I just ran all Lucene tests with your patch and they passed, so +1. +1 to removing EmptyTokenizer too. Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream (without CharTermAttribute), fix BaseTokenStreamTestCase -- Key: LUCENE-4656 URL: https://issues.apache.org/jira/browse/LUCENE-4656 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.0 Reporter: Adrien Grand Assignee: Uwe Schindler Priority: Trivial Fix For: 4.1, 5.0 Attachments: LUCENE-4656_bttc.patch, LUCENE-4656-IW-bug.patch, LUCENE-4656-IW-fix.patch, LUCENE-4656-IW-fix.patch, LUCENE-4656.patch, LUCENE-4656.patch, LUCENE-4656.patch, LUCENE-4656.patch, LUCENE-4656.patch TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545995#comment-13545995 ] Adrien Grand commented on LUCENE-4643: -- I made some tests with my compressed TermVectorsFormat and the problem is that it sometimes wastes space. For example if all values from a block are between -1 and 6, the first patch would require 3 bits whereas the 2nd one + zig-zag encoding a level above would require 4 bits per value so I think I should rather commit the first patch? PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch, LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546017#comment-13546017 ] Adrien Grand commented on LUCENE-4643: -- bq. actually i'm confused why we need it at all, since we are writing only positive numbers (deltas from minValue, which itself is the only one that need be negative). Oh! I think we misunderstood. The first patch uses zig-zag encoding for minValue only and the 2nd patch requires people to zig-zag encode before feeding the writer. PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch, LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546038#comment-13546038 ] Adrien Grand commented on LUCENE-4643: -- All bits are currently used (one to say whether the minValue is 0 or not and 7 for the number of bitsPerValue (0 = bpv = 64, 0 means all values equal, similarly to the block PF). But maybe we could: 1. add a constructor argument to say that all values are positive, and it won't zig-zag encode, 2. or disable either the 0 or the 64 bits per value cases and add a sign bit? I think the first option is better? PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch, LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546051#comment-13546051 ] Adrien Grand commented on LUCENE-4643: -- bq. just because of the silliness in termvectors Actually, the ability to block-encode negative values can be useful for other use-cases, for example to encode the difference from an expected value (for example you can compute an expected offset from the position and the average number of chars per term). An other thing to know is that if all values are positive, minValue is likely to be 0. For example, let's say the actual min is 200 and the max is 2000. Given that encoding the [0-2000] range requires as many bits per value as encoding the [200-2000] range, I set minValue=0. This will require only one bit in the token instead of two bytes (a VInt = 2^7) for the minimum. So in the end, even if one bit is wasted for the minimum value because of zig-zag encoding, this is not too bad. PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch, LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4664) oal.codec.compressing: Make Compressor and Decompressor public
Adrien Grand created LUCENE-4664: Summary: oal.codec.compressing: Make Compressor and Decompressor public Key: LUCENE-4664 URL: https://issues.apache.org/jira/browse/LUCENE-4664 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1, 5.0 Compressor and Decompressor are currently package-private, making it impossible for users to implement their own CompressionMode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4664) oal.codec.compressing: Make Compressor and Decompressor public
[ https://issues.apache.org/jira/browse/LUCENE-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4664: - Attachment: LUCENE-4664.patch Patch. I moved DummyCompressingCodec to oal.codecs.compressing.dummy to make sure all classes are visible enough. oal.codec.compressing: Make Compressor and Decompressor public -- Key: LUCENE-4664 URL: https://issues.apache.org/jira/browse/LUCENE-4664 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1, 5.0 Attachments: LUCENE-4664.patch Compressor and Decompressor are currently package-private, making it impossible for users to implement their own CompressionMode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4664) oal.codec.compressing: Make Compressor and Decompressor public
[ https://issues.apache.org/jira/browse/LUCENE-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4664. -- Resolution: Fixed oal.codec.compressing: Make Compressor and Decompressor public -- Key: LUCENE-4664 URL: https://issues.apache.org/jira/browse/LUCENE-4664 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1, 5.0 Attachments: LUCENE-4664.patch Compressor and Decompressor are currently package-private, making it impossible for users to implement their own CompressionMode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints
[ https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4643. -- Resolution: Fixed PackedInts: convenience classes to write blocks of packed ints -- Key: LUCENE-4643 URL: https://issues.apache.org/jira/browse/LUCENE-4643 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4643.patch, LUCENE-4643.patch It is often useful to divide a packed stream into fixed blocks which are all compressed independently: * if your sequence of ints is very large, you won't have to buffer everything into memory to compute the required number of bits per value, * the compression ratio will be better in case of rare extreme values. The only drawback compared to the original PackedInts API is that the stream cannot be directly used to deserialize a random-access PackedInts.Reader (but for sequential access, this is just fine). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4666) Simplify CompressingStoredFieldsFormat merging
Adrien Grand created LUCENE-4666: Summary: Simplify CompressingStoredFieldsFormat merging Key: LUCENE-4666 URL: https://issues.apache.org/jira/browse/LUCENE-4666 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1, 5.0 Merging is currently unnecessarily complex: it tries to compute the size of the compressed block by analyzing the compressed stream although it could use the fields index instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4666) Simplify CompressingStoredFieldsFormat merging
[ https://issues.apache.org/jira/browse/LUCENE-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4666: - Attachment: LUCENE-4666.patch Patch. Simplify CompressingStoredFieldsFormat merging -- Key: LUCENE-4666 URL: https://issues.apache.org/jira/browse/LUCENE-4666 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1, 5.0 Attachments: LUCENE-4666.patch Merging is currently unnecessarily complex: it tries to compute the size of the compressed block by analyzing the compressed stream although it could use the fields index instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
Adrien Grand created LUCENE-4667: Summary: Change TestRandomChains to replace the list of broken classes by a list of broken constructors Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Priority: Minor Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4666) Simplify CompressingStoredFieldsFormat merging
[ https://issues.apache.org/jira/browse/LUCENE-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4666. -- Resolution: Fixed Simplify CompressingStoredFieldsFormat merging -- Key: LUCENE-4666 URL: https://issues.apache.org/jira/browse/LUCENE-4666 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1, 5.0 Attachments: LUCENE-4666.patch Merging is currently unnecessarily complex: it tries to compute the size of the compressed block by analyzing the compressed stream although it could use the fields index instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
[ https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4667: - Attachment: LUCENE-4667.patch Patch. Change TestRandomChains to replace the list of broken classes by a list of broken constructors -- Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Priority: Minor Attachments: LUCENE-4667.patch Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
[ https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548505#comment-13548505 ] Adrien Grand commented on LUCENE-4667: -- The test failed when I used an IdentityHashMap. Did I miss something or can't constructors be compared using ==? Change TestRandomChains to replace the list of broken classes by a list of broken constructors -- Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Priority: Minor Attachments: LUCENE-4667.patch Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
[ https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4667: - Attachment: LUCENE-4667.patch New patch that adds exceptions to TrimFilter and TypeTokenFilter as well and uses a constructor map for all components, following Uwe's advice. Change TestRandomChains to replace the list of broken classes by a list of broken constructors -- Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Priority: Minor Attachments: LUCENE-4667.patch, LUCENE-4667.patch Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
[ https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4667: Assignee: Adrien Grand Change TestRandomChains to replace the list of broken classes by a list of broken constructors -- Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4667.patch, LUCENE-4667.patch Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
[ https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4667: - Attachment: LUCENE-4667.patch bq. Maybe that's the case! Sorry. I was expecting that constructors are singletons like classes. No problem, I had the same expectation and was a little disappointed to see that it didn't work! bq. I think maybe the whole Predicate approach is too much detailed? I think it's worth exluding with a predicate: for example this allows to test random chains with LimitTokenCountFilter(consumeAllTokens=true) (when consumeAllTokens=false, this filter is broken). bq. I would exclude all broken construcors with the ALWAYS predicate in beforeClass() Sounds good, I updated the patch. Change TestRandomChains to replace the list of broken classes by a list of broken constructors -- Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4667.patch, LUCENE-4667.patch, LUCENE-4667.patch Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4669) Document wrongly deleted from index
[ https://issues.apache.org/jira/browse/LUCENE-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548622#comment-13548622 ] Adrien Grand commented on LUCENE-4669: -- Hi Miguel, c has not been deleted, the problem is that you used IndexReader.numDocs instead of IndexReader.maxDoc. Given that you deleted a document, IndexReader.numDocs decreased from 3 to 2 but c still has docId==2 so your print(File) method doesn't display it. Document wrongly deleted from index --- Key: LUCENE-4669 URL: https://issues.apache.org/jira/browse/LUCENE-4669 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 4.0 Environment: OS = Mac OS X 10.7.5 Java = JVM 1.6 Reporter: Miguel Ferreira I'm trying to implement document deletion from an index. If I create an index with three documents (A, B and C) and then try to delete A, A gets marked as deleted but C is removed from the index. I've tried this with different number of documents and saw that it is always the last document that is removed. When I run the example unit test code bellow I get this output: {code} Before delete Found 3 documents Document at = 0; isDeleted = false; path = a; Document at = 1; isDeleted = false; path = b; Document at = 2; isDeleted = false; path = c; After delete Found 2 documents Document at = 0; isDeleted = true; path = a; Document at = 1; isDeleted = false; path = b; {code} Example unit test: {code:title=ExampleUnitTest.java} @Test public void delete() throws Exception { File indexDir = FileUtils.createTempDir(); IndexWriter writer = new IndexWriter(new NIOFSDirectory(indexDir), new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40))); Document doc = new Document(); String fieldName = path; doc.add(new StringField(fieldName, a, Store.YES)); writer.addDocument(doc); doc = new Document(); doc.add(new StringField(fieldName, b, Store.YES)); writer.addDocument(doc); doc = new Document(); doc.add(new StringField(fieldName, c, Store.YES)); writer.addDocument(doc); writer.commit(); System.out.println(Before delete); print(indexDir); writer.deleteDocuments(new Term(fieldName, a)); writer.commit(); System.out.println(After delete); print(indexDir); } public static void print(File indexDirectory) throws IOException { DirectoryReader reader = DirectoryReader.open(new NIOFSDirectory(indexDirectory)); Bits liveDocs = MultiFields.getLiveDocs(reader); int numDocs = reader.numDocs(); System.out.println(Found + numDocs + documents); for (int i = 0; i numDocs; i++) { Document document = reader.document(i); StringBuffer sb = new StringBuffer(); sb.append(Document at = ).append(i); sb.append(; isDeleted = ).append(liveDocs != null ? !liveDocs.get(i) : false).append(; ); for (IndexableField field : document.getFields()) { String fieldName = field.name(); for (String value : document.getValues(fieldName)) { sb.append(fieldName).append( = ).append(value).append(; ); } } System.out.println(sb.toString()); } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors
[ https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4667. -- Resolution: Fixed Change TestRandomChains to replace the list of broken classes by a list of broken constructors -- Key: LUCENE-4667 URL: https://issues.apache.org/jira/browse/LUCENE-4667 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4667.patch, LUCENE-4667.patch, LUCENE-4667.patch Some classes are currently in the list of bad apples although only one constructor is broken. For example, LimitTokenCountFilter has an option to consume the whole stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4670) Add TermVectorsWriter.finish{Doc,Field,Term} to make development of new formats easier
Adrien Grand created LUCENE-4670: Summary: Add TermVectorsWriter.finish{Doc,Field,Term} to make development of new formats easier Key: LUCENE-4670 URL: https://issues.apache.org/jira/browse/LUCENE-4670 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.1 This is especially useful to LUCENE-4599 where actions have to be taken after a doc/field/term has been added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org