[jira] [Commented] (SOLR-2976) stats.facet no longer works on single valued trie fields that don't use precision step
[ https://issues.apache.org/jira/browse/SOLR-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585163#comment-13585163 ] Adrien Grand commented on SOLR-2976: bq. if precisionStep != 0, faceting on a single-valued numeric field builds an UninvertedField I think the last commits on SOLR-3855 fix it (they even make faceting use the numeric field caches instead of the terms index). stats.facet no longer works on single valued trie fields that don't use precision step -- Key: SOLR-2976 URL: https://issues.apache.org/jira/browse/SOLR-2976 Project: Solr Issue Type: Bug Affects Versions: 3.5 Reporter: Hoss Man Attachments: SOLR-2976_3.4_test.patch, SOLR-2976.patch As reported on the mailing list, 3.5 introduced a regression that prevents single valued Trie fields that don't use precision steps (to add course grained terms) from being used in stats.facet. two immediately obvious problems... 1) in 3.5 the stats component is checking if isTokenzed() is true for the field type (which is probably wise) but regardless of the precisionStep used, TrieField.isTokenized is hardcoded to return true 2) the 3.5 stats faceting will fail if the FieldType is multivalued - it doesn't check if the SchemaField is configured to be single valued (overriding the FieldType) so even if a user has something like this in their schema... {code} fieldType name=long class=solr.TrieLongField precisionStep=0 omitNorms=true / field name=ts type=long indexed=true stored=true required=true multiValued=false / {code} ...stats.facet will not work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4792) Smaller doc maps
[ https://issues.apache.org/jira/browse/LUCENE-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585490#comment-13585490 ] Adrien Grand commented on LUCENE-4792: -- In case someone would like to use this class, I'd add that: - the encoded sequence does not strictly need to be monotonic: it can encode any sequence of values but it compresses best when the stream contains monotonic sub-sequences of 1024 longs at least (for example it would have a good compression ratio if there are first 1 increasing values and then 5000 decreasing values), - it can address up to 2^42 values, - there are writer/reader equivalents called MonotonicBlockPackedWriter and MonotonicBlockPackedReader (which can either load values in memory or read from disk). Smaller doc maps Key: LUCENE-4792 URL: https://issues.apache.org/jira/browse/LUCENE-4792 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.2 Attachments: LUCENE-4792.patch MergeState.DocMap could leverage MonotonicAppendingLongBuffer to save memory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4795) Add FacetsCollector based on SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585878#comment-13585878 ] Adrien Grand commented on LUCENE-4795: -- Not having to manage a taxonomy index is very appealing to me! What about collecting based on segment ords and bulk translating these ords to the global ords in setNextReader and when the collection ends? This way ordinalMap.get would be called less often (once per value per segment instead of once per value per doc per segment) and in a sequential way so I assume it would be faster while remaining easy to implement? Add FacetsCollector based on SortedSetDocValues --- Key: LUCENE-4795 URL: https://issues.apache.org/jira/browse/LUCENE-4795 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Michael McCandless Attachments: LUCENE-4795.patch, pleaseBenchmarkMe.patch Recently (LUCENE-4765) we added multi-valued DocValues field (SortedSetDocValuesField), and this can be used for faceting in Solr (SOLR-4490). I think we should also add support in the facet module? It'd be an option with different tradeoffs. Eg, it wouldn't require the taxonomy index, since the main index handles label/ord resolving. There are at least two possible approaches: * On every reopen, build the seg - global ord map, and then on every collect, get the seg ord, map it to the global ord space, and increment counts. This adds cost during reopen in proportion to number of unique terms ... * On every collect, increment counts based on the seg ords, and then do a merge in the end just like distributed faceting does. The first approach is much easier so I built a quick prototype using that. The prototype does the counting, but it does NOT do the top K facets gathering in the end, and it doesn't know parent/child ord relationships, so there's tons more to do before this is real. I also was unsure how to properly integrate it since the existing classes seem to expect that you use a taxonomy index to resolve ords. I ran a quick performance test. base = trunk except I disabled the compute top-K in FacetsAccumulator to make the comparison fair; comp = using the prototype collector in the patch: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff OrHighLow 18.79 (2.5%) 14.36 (3.3%) -23.6% ( -28% - -18%) HighTerm 21.58 (2.4%) 16.53 (3.7%) -23.4% ( -28% - -17%) OrHighMed 18.20 (2.5%) 13.99 (3.3%) -23.2% ( -28% - -17%) Prefix3 14.37 (1.5%) 11.62 (3.5%) -19.1% ( -23% - -14%) LowTerm 130.80 (1.6%) 106.95 (2.4%) -18.2% ( -21% - -14%) OrHighHigh9.60 (2.6%)7.88 (3.5%) -17.9% ( -23% - -12%) AndHighHigh 24.61 (0.7%) 20.74 (1.9%) -15.7% ( -18% - -13%) Fuzzy1 49.40 (2.5%) 43.48 (1.9%) -12.0% ( -15% - -7%) MedSloppyPhrase 27.06 (1.6%) 23.95 (2.3%) -11.5% ( -15% - -7%) MedTerm 51.43 (2.0%) 46.21 (2.7%) -10.2% ( -14% - -5%) IntNRQ4.02 (1.6%)3.63 (4.0%) -9.7% ( -15% - -4%) Wildcard 29.14 (1.5%) 26.46 (2.5%) -9.2% ( -13% - -5%) HighSloppyPhrase0.92 (4.5%)0.87 (5.8%) -5.4% ( -15% -5%) MedSpanNear 29.51 (2.5%) 27.94 (2.2%) -5.3% ( -9% -0%) HighSpanNear3.55 (2.4%)3.38 (2.0%) -4.9% ( -9% -0%) AndHighMed 108.34 (0.9%) 104.55 (1.1%) -3.5% ( -5% - -1%) LowSloppyPhrase 20.50 (2.0%) 20.09 (4.2%) -2.0% ( -8% -4%) LowPhrase 21.60 (6.0%) 21.26 (5.1%) -1.6% ( -11% - 10%) Fuzzy2 53.16 (3.9%) 52.40 (2.7%) -1.4% ( -7% -5%) LowSpanNear8.42 (3.2%)8.45 (3.0%) 0.3% ( -5% -6%) Respell 45.17 (4.3%) 45.38 (4.4%) 0.5% ( -7% -9%) MedPhrase 113.93 (5.8%) 115.02 (4.9%) 1.0% ( -9% - 12%) AndHighLow 596.42 (2.5%) 617.12 (2.8%) 3.5% ( -1% -8%) HighPhrase 17.30 (10.5%) 18.36 (9.1%) 6.2% ( -12% - 28%)
[jira] [Commented] (SOLR-4490) add support for multivalued docvalues
[ https://issues.apache.org/jira/browse/SOLR-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586349#comment-13586349 ] Adrien Grand commented on SOLR-4490: +1 add support for multivalued docvalues -- Key: SOLR-4490 URL: https://issues.apache.org/jira/browse/SOLR-4490 Project: Solr Issue Type: New Feature Reporter: Robert Muir Attachments: SOLR-4490.patch, SOLR-4490.patch exposing LUCENE-4765 essentially. I think we don't need any new options, it just means doing the right thing when someone has docValues=true and multivalued=true. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592107#comment-13592107 ] Adrien Grand commented on LUCENE-4752: -- I think a very simple first step could be have an experimental IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view (easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger instead of the segments themselves (see calls to SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments would remain unsorted, but the big ones, those that are interesting for both index compression and early query termination, would be sorted. It can seem inefficient to sort segments over and over but I don't think we should worry too much: - if we are merging initial segments (those created from IndexWriter.flush), they would be small so sorting/merging them would be fast? - if we are merging big segments, I think that the following reasons could make merging slower than a regular merge: 1. computing the new doc ID mapping, 2. random I/O access, 3. not being able to use the specialized codec merging methods. But if the segments to merge are sorted, computing the new doc ID mapping could be actually fast (some sorting algorithms such as [TimSort|http://en.wikipedia.org/wiki/Timsort] perform in O(n) when the input is a succession of sorted sequences), and the access patterns to the individual segments would be I/O cache-friendly (because each segment would be read sequentially). So I think this approach could be fast enough? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592366#comment-13592366 ] Adrien Grand commented on LUCENE-4752: -- bq. How can you early terminate a query for a single segment? [...] To early terminate efficiently, you must have the segments in a consistent order, e.g. S1 S2 S3. I think this is just an API limitation? Segments being processed independently, we should be able to terminate collection on a per-segment basis? bq. instead of stuffing into IWC what seems like a random setting (pick-segments-for-sorting), we should have something more generic, like AtomicReaderFactory I didn't mean this should be a boolean. Of course it should be something more flexible/configurable! I'm very bad at picking names, but following your naming suggestion, we could have something like {code} abstract class AtomicReaderFactory { abstract ListAtomicReader reorder(ListSegmentReader segmentReaders); } {code}? The default impl would be the identity whereas the sorting impl would return a singleton containing a sorted view over the segment readers? bq. Also, a custom SegmentMerger to implement the zig-zag merge would help too. This is another option. I actually started exploring this option when David opened this issue, but it can become complicated, especially for postings lists merging, whereas reusing the sorted view from LUCENE-3918 would make merging trivial. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs
[ https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594650#comment-13594650 ] Adrien Grand commented on LUCENE-3918: -- Thanks for your work Shai. Indeed it looks really good now! Here a a few suggestions/questions: - Are there actual use-cases for sorting by stored fields or payloads? If not I think we should remove StoredFieldsSorter and PayloadSorter? - Remove IndexSorter.java and make SortDoc package-private? {code} + // we cannot reuse the given DocsAndPositionsEnum because we return our + // own wrapper, and not all Codecs like it. {code} Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum? Port index sorter to trunk APIs --- Key: LUCENE-3918 URL: https://issues.apache.org/jira/browse/LUCENE-3918 Project: Lucene - Core Issue Type: Task Components: modules/other Affects Versions: 4.0-ALPHA Reporter: Robert Muir Fix For: 4.2, 5.0 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch LUCENE-2482 added an IndexSorter to 3.x, but we need to port this functionality to 4.0 apis. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs
[ https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594787#comment-13594787 ] Adrien Grand commented on LUCENE-3918: -- Regarding PayloadSorter and StoredFieldsSorter I'm just afraid that the fact that they exist might make users think these are viable options... bq. IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it? I think taking an AtomicReader as an argument (instead of a Directory) and feeding an IndexWriter (instead of another Directory) would be much more flexible but then it would just be a call to IndexWriter.addIndexes... If we want an utility to sort indexes, maybe it should rather be something callable from command-line? (java oal.index.sorter.IndexSorter fromDir toDir sortField) bq. Get rid of SortDoc. Sorter is now abstract class with a helper int[] compute(int[] docs, T[] values) I think it's better! Maybe a List instead of an array would be even better so that NumericDocValuesSorter could use a view over the doc values instead of loading all of them into memory? Reusage of DocsEnum looks great! Port index sorter to trunk APIs --- Key: LUCENE-3918 URL: https://issues.apache.org/jira/browse/LUCENE-3918 Project: Lucene - Core Issue Type: Task Components: modules/other Affects Versions: 4.0-ALPHA Reporter: Robert Muir Fix For: 4.2, 5.0 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch LUCENE-2482 added an IndexSorter to 3.x, but we need to port this functionality to 4.0 apis. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3918) Port index sorter to trunk APIs
[ https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-3918: - Attachment: LUCENE-3918.patch bq. I use two parallel arrays to sort the documents (docs and values) I updated the patch to use doc IDs as ords so that values are never swapped (only doc IDs) and the numeric doc values don't need to be all loaded in memory. bq. So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works. +1 bq. I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options. I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results? Port index sorter to trunk APIs --- Key: LUCENE-3918 URL: https://issues.apache.org/jira/browse/LUCENE-3918 Project: Lucene - Core Issue Type: Task Components: modules/other Affects Versions: 4.0-ALPHA Reporter: Robert Muir Fix For: 4.2, 5.0 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch LUCENE-2482 added an IndexSorter to 3.x, but we need to port this functionality to 4.0 apis. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598331#comment-13598331 ] Adrien Grand commented on LUCENE-4752: -- bq. the SortingSegmentMerger will accumulate the readers in add(SegmentReader) and open a SortingAtomicReader over a MultiReader of all SegReaders... what do you think? I think this is a good idea! However, I don't understand this global sorting issue. What would it bring? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array
Adrien Grand created LUCENE-4830: Summary: Sorter API: use an abstract doc map instead of an array Key: LUCENE-4830 URL: https://issues.apache.org/jira/browse/LUCENE-4830 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 The sorter API uses arrays to store the old-new and new-old doc IDs mappings. It should rather be an abstract class given that in some cases an array is not required at all (reverse mapping for example). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array
[ https://issues.apache.org/jira/browse/LUCENE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4830: - Attachment: LUCENE-4830.patch Patch. I also changed SortingAtomicReader.liveDocs() to be a view over the original liveDocs. Sorter API: use an abstract doc map instead of an array --- Key: LUCENE-4830 URL: https://issues.apache.org/jira/browse/LUCENE-4830 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4830.patch The sorter API uses arrays to store the old-new and new-old doc IDs mappings. It should rather be an abstract class given that in some cases an array is not required at all (reverse mapping for example). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
Adrien Grand created LUCENE-4833: Summary: Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4833: - Attachment: LUCENE-4833.patch Patch. Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602265#comment-13602265 ] Adrien Grand commented on LUCENE-4833: -- Good point. I copied the behavior of setCodec which throws a NPE although you are right that most methods seem to set the default value... Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602269#comment-13602269 ] Adrien Grand commented on LUCENE-4833: -- I'm not sure I like the fact that passing null to setXXX actually sets the default value, what do other committers think? Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602274#comment-13602274 ] Adrien Grand commented on LUCENE-4833: -- My point is that if someone wants to use the default value, all he has to do is to never call the setter? Moreover users can't pass null to methods that expect primitive types (such as setMaxBufferedDocs) so throwing an exception when encountering null would be more consistent? Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602285#comment-13602285 ] Adrien Grand commented on LUCENE-4833: -- bq. We throw IllegalArg in the other setters (which take primitives), so maybe throw that and not NPE? +1 I'll update the patch. Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4833: - Attachment: LUCENE-4833.patch Updated patch. IndexWriterConfig.setXXX methods now throw an IllegalArgumentException when passed null instead of setting the default value. Tests pass. Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch, LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4833. -- Resolution: Fixed Fix default MergePolicy in IndexWriterConfig Key: LUCENE-4833 URL: https://issues.apache.org/jira/browse/LUCENE-4833 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4833.patch, LUCENE-4833.patch Although the default merge policy is TieredMergePolicy (as documented in IndexWriterConfig constructor), setMergePolicy assumes that the default is LogByteSizeMergePolicy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array
[ https://issues.apache.org/jira/browse/LUCENE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602775#comment-13602775 ] Adrien Grand commented on LUCENE-4830: -- bq. I think that we should make the DocMap impl final? Maybe it will encourage JIT ... Looks like it doesn't help much? http://stackoverflow.com/questions/8354412/do-java-finals-help-the-compiler-create-more-efficient-bytecode Sorter API: use an abstract doc map instead of an array --- Key: LUCENE-4830 URL: https://issues.apache.org/jira/browse/LUCENE-4830 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4830.patch The sorter API uses arrays to store the old-new and new-old doc IDs mappings. It should rather be an abstract class given that in some cases an array is not required at all (reverse mapping for example). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752.patch I've tried playing with SegmentMerger to make it configurable. This could be used to reorder document IDs (if you look at the diff in LuceneTestCase, all that is needed to reorder doc IDs is to wrap the SlowCompositeReaderWrapper with a SortingAtomicReader). Do you think it is a step in the right direction? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array
[ https://issues.apache.org/jira/browse/LUCENE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4830. -- Resolution: Fixed Thank you for the review, Shai! Sorter API: use an abstract doc map instead of an array --- Key: LUCENE-4830 URL: https://issues.apache.org/jira/browse/LUCENE-4830 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4830.patch The sorter API uses arrays to store the old-new and new-old doc IDs mappings. It should rather be an abstract class given that in some cases an array is not required at all (reverse mapping for example). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4834) Sorter API: Make TermsEnum.docs accept any source of liveDocs
Adrien Grand created LUCENE-4834: Summary: Sorter API: Make TermsEnum.docs accept any source of liveDocs Key: LUCENE-4834 URL: https://issues.apache.org/jira/browse/LUCENE-4834 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 TermsEnum.docs currently only works when liveDocs is null or the reader's liveDocs. This is enough for addIndexes but it would be cleaner to follow TermsEnum.docs contract and accept any source of liveDocs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4834) Sorter API: Make TermsEnum.docs accept any source of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4834: - Attachment: LUCENE-4834.patch Patch. I'll commit soon. Sorter API: Make TermsEnum.docs accept any source of liveDocs - Key: LUCENE-4834 URL: https://issues.apache.org/jira/browse/LUCENE-4834 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4834.patch TermsEnum.docs currently only works when liveDocs is null or the reader's liveDocs. This is enough for addIndexes but it would be cleaner to follow TermsEnum.docs contract and accept any source of liveDocs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4834) Sorter API: Make TermsEnum.docs accept any source of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4834. -- Resolution: Fixed Thanks Shai. Sorter API: Make TermsEnum.docs accept any source of liveDocs - Key: LUCENE-4834 URL: https://issues.apache.org/jira/browse/LUCENE-4834 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4834.patch TermsEnum.docs currently only works when liveDocs is null or the reader's liveDocs. This is enough for addIndexes but it would be cleaner to follow TermsEnum.docs contract and accept any source of liveDocs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists
Adrien Grand created LUCENE-4839: Summary: Sorter API: Use TimSort to sort doc IDs and postings lists Key: LUCENE-4839 URL: https://issues.apache.org/jira/browse/LUCENE-4839 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used by python and Java's Arrays.sort(Object[]) in particular) is a sorting algorithm that performs very well on partially-sorted data. Indeed, with TimSort, sorting an array which is in reverse order or a finite concatenation of sorted arrays is a linear operation (instead of O(n ln(n))). The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS or merging several sorted readers for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604272#comment-13604272 ] Adrien Grand commented on LUCENE-4752: -- bq. Is it possible to make fieldInfos final? Sure. I removed the final keyword because it was easier to hack up a quick patch but this can definitely be fixed. bq. Adrien, perhaps add a SortingSegmentMerger to the sorter package? Or at least add a test that verifies merges keep things sorted? I'll do that in the next patches! bq. And finally i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. I'm fine with that option too, I need to think more about how to name it and where to plug it. In addition to the API, I think something important to validate is whether sorting the segments to merge is viable and doesn't blow up memory or indexing time... I started working on this (LUCENE-4830 for memory and LUCENE-4839 for complexity) and will run some indexing benchmarks with the Wikipedia corpus to see how it behaves compared to natural merging. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists
[ https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604279#comment-13604279 ] Adrien Grand commented on LUCENE-4839: -- One major difference with the original impl is that I reused the merge routine used by mergeSort instead of porting the original one which has a few optimizations to merge runs which have different lengths and/or some patterns (look for galloping in listsort.txt) but requires extra memory. This doesn't change the fact that this impl performs extremely well when data is partially sorted. Sorter API: Use TimSort to sort doc IDs and postings lists -- Key: LUCENE-4839 URL: https://issues.apache.org/jira/browse/LUCENE-4839 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4839.patch TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used by python and Java's Arrays.sort(Object[]) in particular) is a sorting algorithm that performs very well on partially-sorted data. Indeed, with TimSort, sorting an array which is in reverse order or a finite concatenation of sorted arrays is a linear operation (instead of O(n ln(n))). The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS or merging several sorted readers for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists
[ https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4839: - Attachment: LUCENE-4839.patch bq. Nice! Why do we need the private inner class TimSort? It's no needed but my first patch (not uploaded) did not use a helper class and was hard to read, so I think this is better this way? bq. I would be happy to also add the timSort algorithm to ArrayUtils and CollectionUtils. Done in the patch. bq. The bonus would be: The extensive random tests in TestArrayUtils and TestCollectionUtils could be used for timSort, too (their existence is the reason why there is no TestSorterTemplate class in current code). Done. Sorter API: Use TimSort to sort doc IDs and postings lists -- Key: LUCENE-4839 URL: https://issues.apache.org/jira/browse/LUCENE-4839 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4839.patch, LUCENE-4839.patch TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used by python and Java's Arrays.sort(Object[]) in particular) is a sorting algorithm that performs very well on partially-sorted data. Indeed, with TimSort, sorting an array which is in reverse order or a finite concatenation of sorted arrays is a linear operation (instead of O(n ln(n))). The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS or merging several sorted readers for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists
[ https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604318#comment-13604318 ] Adrien Grand commented on LUCENE-4839: -- Thanks UWe, I'll fix it before committing! Sorter API: Use TimSort to sort doc IDs and postings lists -- Key: LUCENE-4839 URL: https://issues.apache.org/jira/browse/LUCENE-4839 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4839.patch, LUCENE-4839.patch TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used by python and Java's Arrays.sort(Object[]) in particular) is a sorting algorithm that performs very well on partially-sorted data. Indeed, with TimSort, sorting an array which is in reverse order or a finite concatenation of sorted arrays is a linear operation (instead of O(n ln(n))). The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS or merging several sorted readers for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists
[ https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4839. -- Resolution: Fixed Sorter API: Use TimSort to sort doc IDs and postings lists -- Key: LUCENE-4839 URL: https://issues.apache.org/jira/browse/LUCENE-4839 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4839.patch, LUCENE-4839.patch TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used by python and Java's Arrays.sort(Object[]) in particular) is a sorting algorithm that performs very well on partially-sorted data. Indeed, with TimSort, sorting an array which is in reverse order or a finite concatenation of sorted arrays is a linear operation (instead of O(n ln(n))). The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS or merging several sorted readers for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752.patch bq. i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. like a proper pluggable api (e.g. codec is an example of this) versus letting people just subclass concrete things. Here is a patch that allows for reordering via a simple hook instead of having to subclass a class that does concrete things like SegmentMerger. The hook is on MergePolicy because I felt like it makes sense to think about doc ID reordering at merging time as part of a merge policy but it could also be put somewhere else or have its own class. (The patch is just here to gather some API feedback, I haven't tried to run anything with it yet). Does it look more reasonable? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604536#comment-13604536 ] Adrien Grand commented on LUCENE-4752: -- bq. This looks less invasive indeed, but I feel that MP.reorder() is kind of out of the blue. Maybe we should find a way to stuff it into OneMerge? Indeed, I thought about OneMerge too and liked this option better but I think this is a problem for addIndexes(IndexReader...): this method doesn't need to find merges and as a consequence doesn't manipulate OnMerge instances. How would we make addIndexes(IndexReader...) sort doc IDs? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604540#comment-13604540 ] Adrien Grand commented on LUCENE-4752: -- Good point! I'll update the patch! Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752.patch Patch with tests that makes OneMerge responsible for reordering doc IDs. Thoughts? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752.patch bq. But, since LTC is quite big, perhaps we can move these methods to a util, e.g. CompareIndexes? Why is the size of the class a concern? I think it's more convenient to have all assert*Equals methods in the same class? (LuceneTestCase already has many assert*Equals methods inherited from Assert.) And it makes these methods easier to find when writing a test? bq. Can we make OneMerge.readers private and add OneMerge.add(AtomicReader) for IW to use? It looks odd that IW manipulates OneMerge.readers directly, but then calls OneMerge.getMergeReaders() I think it would be odd if getMergeReaders was just a getter but it is more than that since it filters out empty readers and can even return an arbitrary view over the readers to merge. But here it is just a method that computes data based on the class members, like segString? bq. Can we remove SegmentMerger.add() Good point, I updated the patch. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4847) Sorter API: Fully reuse docs enums
Adrien Grand created LUCENE-4847: Summary: Sorter API: Fully reuse docs enums Key: LUCENE-4847 URL: https://issues.apache.org/jira/browse/LUCENE-4847 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 SortingAtomicReader reuses the filtered docs enums but not the wrapper. In the case of SortingAtomicReader this can be a problem because the wrappers are heavyweight (they load the whole postings list into memory), so an index with many terms with high freqs will make the JVM allocate a lot of memory when browsing the postings lists. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4847) Sorter API: Fully reuse docs enums
[ https://issues.apache.org/jira/browse/LUCENE-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4847: - Attachment: LUCENE-4847.patch Patch. Sorter API: Fully reuse docs enums -- Key: LUCENE-4847 URL: https://issues.apache.org/jira/browse/LUCENE-4847 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4847.patch SortingAtomicReader reuses the filtered docs enums but not the wrapper. In the case of SortingAtomicReader this can be a problem because the wrappers are heavyweight (they load the whole postings list into memory), so an index with many terms with high freqs will make the JVM allocate a lot of memory when browsing the postings lists. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4747) java7 as a minimum requirement for lucene 5
[ https://issues.apache.org/jira/browse/LUCENE-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605126#comment-13605126 ] Adrien Grand commented on LUCENE-4747: -- Maybe we should fix all places that should use Integer.compare/Long.compare/... too? java7 as a minimum requirement for lucene 5 --- Key: LUCENE-4747 URL: https://issues.apache.org/jira/browse/LUCENE-4747 Project: Lucene - Core Issue Type: Task Affects Versions: 5.0 Reporter: Robert Muir Assignee: Uwe Schindler Fix For: 5.0 Attachments: LUCENE-4747.patch, LUCENE-4747.patch Spinoff from LUCENE-4746. I propose we make this change on trunk only. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
Adrien Grand created LUCENE-4851: Summary: Use Java 7's {Integer,Long,Float,Double}.compare instead of branches Key: LUCENE-4851 URL: https://issues.apache.org/jira/browse/LUCENE-4851 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 5.0 We can use those methods now that trunk is on Java 7. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
[ https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4851: - Attachment: LUCENE-4851.patch Patch. Most changes are in FieldComparator. Use Java 7's {Integer,Long,Float,Double}.compare instead of branches Key: LUCENE-4851 URL: https://issues.apache.org/jira/browse/LUCENE-4851 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 5.0 Attachments: LUCENE-4851.patch We can use those methods now that trunk is on Java 7. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
[ https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605216#comment-13605216 ] Adrien Grand commented on LUCENE-4851: -- Good idea, I'll do it! Use Java 7's {Integer,Long,Float,Double}.compare instead of branches Key: LUCENE-4851 URL: https://issues.apache.org/jira/browse/LUCENE-4851 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 5.0 Attachments: LUCENE-4851.patch We can use those methods now that trunk is on Java 7. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
[ https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4851: - Attachment: LUCENE-4851.patch It found two calls to signum in ConjunctionScorer and PostingsHighlighter. Use Java 7's {Integer,Long,Float,Double}.compare instead of branches Key: LUCENE-4851 URL: https://issues.apache.org/jira/browse/LUCENE-4851 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 5.0 Attachments: LUCENE-4851.patch, LUCENE-4851.patch We can use those methods now that trunk is on Java 7. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
[ https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4851. -- Resolution: Fixed Use Java 7's {Integer,Long,Float,Double}.compare instead of branches Key: LUCENE-4851 URL: https://issues.apache.org/jira/browse/LUCENE-4851 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 5.0 Attachments: LUCENE-4851.patch, LUCENE-4851.patch We can use those methods now that trunk is on Java 7. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4852) BaseStoredFieldsFormatTestCase
[ https://issues.apache.org/jira/browse/LUCENE-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605466#comment-13605466 ] Adrien Grand commented on LUCENE-4852: -- Patch looks good! BaseStoredFieldsFormatTestCase -- Key: LUCENE-4852 URL: https://issues.apache.org/jira/browse/LUCENE-4852 Project: Lucene - Core Issue Type: Task Components: general/test Reporter: Robert Muir Attachments: LUCENE-4852.patch, LUCENE-4852_prototype.patch The idea is similar to Base[Postings/DocValues/TermVectors]TestCase. We ensure each codec has certain checks and its easier to maintain and also easier to ensure new impls are correct. For example hunting around today, a lot of the best tests are actually tucked away in TestCompressingStoredFieldsFormat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: sorting_10M_ingestion.log natural_10M_ingestion.log LUCENE-4752.patch bq. Maybe just put a comment in IW where it calls merge.getReaders() why we don't access the readers list directly Done. bq. I started working on this (LUCENE-4830 for memory and LUCENE-4839 for complexity) and will run some indexing benchmarks with the Wikipedia corpus to see how it behaves compared to natural merging. Now that SortingAtomicReader uses TimSort to compute the doc ID mapping and sort postigs lists, using SortingMergePolicy only increases the merge complexity by constant factors compared to a natural merge if the readers to merge are sorted (I'm assuming the number of segments to merge is bounded). I think this makes online sorting a viable option. I ran some indexing benchmarks to see how slower indexing is with SortingMergePolicy. To do this I quickly patched luceneutil to add a random NumericDocValuesField to all documents and wrap the merge policy with SortingMergePolicy. Indexing 10M docs from the wikimedium collection was 2x slower with SortingMergePolicy (see ingestion rate logs attached). To measure pure merge performance, I ran a forceMerge(1) on those indexes and SortingMergePolicy made this forceMerge 3.5x slower (856415 ms vs 250054 ms). If you're curious, here is where the merging time is spent with SortingMergePolicy according to my profiler: - 32%: CompressingStoredField.visitDocument (vs. 1% when using a regular merge policy) - 17%: TimSort: to sort the doc mapping and postings lists - 6%: Sorter.DocMap.oldToNew: used by SortingDocsEnum to map the old IDs to the new ones Most of the time is not spent into actual sorting but in visitDocument because the codec-specific merge routine can't be used, so the stored fields format decompresses every chunk multiple times (a few hundred times given that my docs are really small, this would be less noticeable with larger docs). I think it's close, what do you think? Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752.patch bq. I think these are not bad numbers. Me neither! I'm rather happy with them actually. bq. As for search, perhaps we can quickly hack up IndexSearcher to allow terminating per-segment and then compare two Collectors TopFields and TopSortedFields [...] but in order to do that, we must make sure that each segment is sorted (i.e. those that are not hit by MP are still in random order), or we somehow mark on each segment whether it's sorted or not The attached patch contains a different approach, the idea is to use together SortingMergePolicy and IndexWriterConfig.getMaxBufferedDocs: this guarantees that all segments whose size is above maxBufferedDocs are sorted. Then there is a new EarlyTerminationIndexSearcher that extends search to collect normally segments in random order and to early terminate collection on segments which are sorted. bq. Accessing close documents together ... we can make an artificial test which accesses documents with sort-by-value in a specific range. But that's a too artificial test, not sure what it will tell us. Yes, I think the important thing to validate here is that merging does not get exponentially slower as segments grow. Other checks are just bonus. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4858) Ability to terminate queries on a per-segment basis
Adrien Grand created LUCENE-4858: Summary: Ability to terminate queries on a per-segment basis Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752. When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607647#comment-13607647 ] Adrien Grand commented on LUCENE-4752: -- I opened LUCENE-4858 to deal with early query termination (as you suggested earlier) so that we can concentrate on sorting in this issue. bq. Adrien, perhaps in order to keep the patch small, commit separately the changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change) I'll do that soon if nobody objects. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4858) Ability to terminate queries on a per-segment basis
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4858: - Description: Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. was: Spin-off of LUCENE-4752. When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. Ability to terminate queries on a per-segment basis --- Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Ability to terminate queries on a per-segment basis
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607711#comment-13607711 ] Adrien Grand commented on LUCENE-4858: -- {quote} What in the patch guarantees that any segment with more than maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which ensures every such segment gets picked up by SortingMP, however didn't find it. I don't think that in general we should make assumptions based on a maxBufferedDocs setting because the default setting in IWC is per RAM consumption and also it seems slightly unrelated. I.e. if a segment is sorted, but has deletions such that numDocs maxBufferedDocs, we do full collection, while we can early terminate as usual?{quote} Indeed I think that finding out which segments are sorted is the main issue. My idea was to say that if you want to use early query termination, you need to set maxBufferedDocs to a given limit (low values improve early query termination while high values improve indexing speed), so that large segments (the ones that are interesting for early query termination since they require time to collect) that have more than maxBufferedDocs documents (deleted or not) are known to be sorted, because they result from a merge. Of course, this could miss some small segments which are sorted but since they are small, they're not as interesting for early query termination? What options do we have here? I think you mentionned tagging sorted segments, do you have an idea where/how we could do that? bq. And hopefully we can stuff the early termination logic down to IndexSearcher eventually. There are other scenarios for early termination, such as time limit, and therefore I think it's ok if we have an EarlyTerminationException which IndexSearcher responds to. Inded, I think this makes sense. Ability to terminate queries on a per-segment basis --- Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4862) Ability to terminate queries on a per-segment basis
Adrien Grand created LUCENE-4862: Summary: Ability to terminate queries on a per-segment basis Key: LUCENE-4862 URL: https://issues.apache.org/jira/browse/LUCENE-4862 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752. The idea is to add a marker exception that tells IndexSearcher to terminate the collection of the current segment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4858: - Summary: Early termination with SortingMergePolicy (was: Ability to terminate queries on a per-segment basis) Early termination with SortingMergePolicy - Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607816#comment-13607816 ] Adrien Grand commented on LUCENE-4858: -- bq. Can't we split this issue up? I think the current discussion is focused much on this sorted segments thing, but thats not the only possible implementation for this kind of thing. I created LUCENE-4862. Early termination with SortingMergePolicy - Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4862) Ability to terminate queries on a per-segment basis
[ https://issues.apache.org/jira/browse/LUCENE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4862: - Attachment: LUCENE-4862.patch Patch that adds a new CollectionTerminatedException. When thrown from Collector.collect, IndexSearcher swallows it and terminates collection of the current IndexReader leaf. Ability to terminate queries on a per-segment basis --- Key: LUCENE-4862 URL: https://issues.apache.org/jira/browse/LUCENE-4862 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4862.patch Spin-off of LUCENE-4752. The idea is to add a marker exception that tells IndexSearcher to terminate the collection of the current segment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4847) Sorter API: Fully reuse docs enums
[ https://issues.apache.org/jira/browse/LUCENE-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4847. -- Resolution: Fixed Sorter API: Fully reuse docs enums -- Key: LUCENE-4847 URL: https://issues.apache.org/jira/browse/LUCENE-4847 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4847.patch SortingAtomicReader reuses the filtered docs enums but not the wrapper. In the case of SortingAtomicReader this can be a problem because the wrappers are heavyweight (they load the whole postings list into memory), so an index with many terms with high freqs will make the JVM allocate a lot of memory when browsing the postings lists. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752.patch New patch, focused on SortingMergePolicy, ready to be reviewed! Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4867) SorterTemplate.merge is slow
Adrien Grand created LUCENE-4867: Summary: SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4867) SorterTemplate.merge is slow
[ https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4867: - Attachment: SortBench.java Here is the program I used for testing. SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: SortBench.java SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4867) SorterTemplate.merge is slow
[ https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4867: - Attachment: LUCENE-4867.patch bq. If you want a faster algorithm, you have to move away from in-place. In that case, could we make SorterTemplate.merge overridable (protected) so that custom templates can use extra memory to merge? The attached patch modifies ArrayUtil to show how it could be used to implement a faster merge, which makes mergeSort more than 2x faster (~21s on average on my 50M array) although it only requires 1% of additional memory. What do you think? SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4867.patch, SortBench.java SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4867) SorterTemplate.merge is slow
[ https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609014#comment-13609014 ] Adrien Grand commented on LUCENE-4867: -- bq. Or did you implement it separate to not allocate the extra array, if only quicksort is called? Exactly. SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4867.patch, SortBench.java SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4867) SorterTemplate.merge is slow
[ https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609024#comment-13609024 ] Adrien Grand commented on LUCENE-4867: -- bq. Otherwise I am fine with doing it that way, if we do not enforce users to implement the merge code. OK. I'll update the patch to port the same behavior to CollectionUtil. SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4867.patch, SortBench.java SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4867) SorterTemplate.merge is slow
[ https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4867: - Attachment: LUCENE-4867.patch Patch that makes SorterTemplate.merge protected and makes ArrayUtil and CollectionUtil use specialized SorterTemplate instances that use up to 1% extra memory for faster merge-based sorts. I'll open a separate issue to use the same optimizations for the sorter API's timsorts. SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4867.patch, LUCENE-4867.patch, SortBench.java SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4862) Ability to terminate queries on a per-segment basis
[ https://issues.apache.org/jira/browse/LUCENE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4862. -- Resolution: Fixed Thank you for the review Shai! Ability to terminate queries on a per-segment basis --- Key: LUCENE-4862 URL: https://issues.apache.org/jira/browse/LUCENE-4862 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4862.patch Spin-off of LUCENE-4752. The idea is to add a marker exception that tells IndexSearcher to terminate the collection of the current segment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609231#comment-13609231 ] Adrien Grand commented on LUCENE-4752: -- I plan to commit it tomorrow unless someone objects. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4571) speedup disjunction with minShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610099#comment-13610099 ] Adrien Grand commented on LUCENE-4571: -- Agreed, these speedups are awesome! speedup disjunction with minShouldMatch Key: LUCENE-4571 URL: https://issues.apache.org/jira/browse/LUCENE-4571 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 4.1 Reporter: Mikhail Khludnev Attachments: LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole disjunction, and verifies minShouldMatch condition [on every doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]: {code} public int nextDoc() throws IOException { assert doc != NO_MORE_DOCS; while(true) { while (subScorers[0].docID() == doc) { if (subScorers[0].nextDoc() != NO_MORE_DOCS) { heapAdjust(0); } else { heapRemoveRoot(); if (numScorers minimumNrMatchers) { return doc = NO_MORE_DOCS; } } } afterNext(); if (nrMatchers = minimumNrMatchers) { break; } } return doc; } {code} [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the heap first, and then push them back advancing behind that top doc. For me the question no.1 is there a performance test for minShouldMatch constrained disjunction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4871) Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum
Adrien Grand created LUCENE-4871: Summary: Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum Key: LUCENE-4871 URL: https://issues.apache.org/jira/browse/LUCENE-4871 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 SortingDocsAndPositionsEnum could easily save memory by using a Lucene40TCF-like compression method for positions, offsets and payloads: - delta-encode positions and startOffsets (with the previous end offset), - store the length of the tokens instead of their end offset (endOffset == startOffset + length), - use a single bit to say whether the token has a payload. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4871) Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum
[ https://issues.apache.org/jira/browse/LUCENE-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4871: - Attachment: LUCENE-4871.patch Patch. Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum -- Key: LUCENE-4871 URL: https://issues.apache.org/jira/browse/LUCENE-4871 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4871.patch SortingDocsAndPositionsEnum could easily save memory by using a Lucene40TCF-like compression method for positions, offsets and payloads: - delta-encode positions and startOffsets (with the previous end offset), - store the length of the tokens instead of their end offset (endOffset == startOffset + length), - use a single bit to say whether the token has a payload. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4752. -- Resolution: Fixed bq. Adrien, you didn't put your name in the CHANGES entry . +1 to commit. Fixed and committed. Thank you Shai! Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4871) Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum
[ https://issues.apache.org/jira/browse/LUCENE-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4871. -- Resolution: Fixed Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum -- Key: LUCENE-4871 URL: https://issues.apache.org/jira/browse/LUCENE-4871 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4871.patch SortingDocsAndPositionsEnum could easily save memory by using a Lucene40TCF-like compression method for positions, offsets and payloads: - delta-encode positions and startOffsets (with the previous end offset), - store the length of the tokens instead of their end offset (endOffset == startOffset + length), - use a single bit to say whether the token has a payload. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4867) SorterTemplate.merge is slow
[ https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4867. -- Resolution: Fixed SorterTemplate.merge is slow Key: LUCENE-4867 URL: https://issues.apache.org/jira/browse/LUCENE-4867 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-4867.patch, LUCENE-4867.patch, SortBench.java SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick benchmark that sorts an Integer[] array of 50M elements, and mergeSort was almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). This is even worse when the cost of a swap is higher (e.g. parallel arrays). This is due to SorterTemplate.merge. I first feared that this method might not be linear, but it is, so the slowness is due to the fact that this method needs to swap lots of values in order not to require extra memory. Could we make it faster? For reference, I hacked a SorterTemplate instance to use the usual merge routine (that requires n/2 elements in memory), and it was much faster: ~17s on average, so there is room for improvement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4874) Remove FilterTerms.intersect
Adrien Grand created LUCENE-4874: Summary: Remove FilterTerms.intersect Key: LUCENE-4874 URL: https://issues.apache.org/jira/browse/LUCENE-4874 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Priority: Minor Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends Terms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4874) Remove FilterTerms.intersect
[ https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4874: - Description: Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends FilterTerms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem. (was: Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends Terms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem.) Remove FilterTerms.intersect Key: LUCENE-4874 URL: https://issues.apache.org/jira/browse/LUCENE-4874 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Priority: Minor Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends FilterTerms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4874) Remove FilterTerms.intersect
[ https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611829#comment-13611829 ] Adrien Grand commented on LUCENE-4874: -- This makes sense. I found another bug in SortingAtomicReader which doesn't override getCoreCacheKey, this could lead to very bad things if an atomic reader and its sorted view were both used with the same FieldCache instance. I've started looking at methods that override default impls and would like to have your opinion on some of them: - shouldn't IndexReader.hasDeletions return numDeletedDocs() 0 by default instead of being abstract? - isn't the default impl of TermsEnum.termState dangerous? Shouldn't it throw an UnsupportedOperationException or being abstract instead? Remove FilterTerms.intersect Key: LUCENE-4874 URL: https://issues.apache.org/jira/browse/LUCENE-4874 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Priority: Minor Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends FilterTerms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4875) Make SorterTemplate.mergeSort run in linear time on sorted arrays
Adrien Grand created LUCENE-4875: Summary: Make SorterTemplate.mergeSort run in linear time on sorted arrays Key: LUCENE-4875 URL: https://issues.apache.org/jira/browse/LUCENE-4875 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Through minor modifications, SorterTemplate.mergeSort could run in linear time on sorted arrays, so I think we should do it? The idea is to modify merge so that it returns instantly when compare(pivot-1, pivot) = 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4875) Make SorterTemplate.mergeSort run in linear time on sorted arrays
[ https://issues.apache.org/jira/browse/LUCENE-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4875: - Attachment: LUCENE-4875.patch Patch. I modified the test case to make sure merge is never called when the concatenation of the two runs to merge is already sorted. Make SorterTemplate.mergeSort run in linear time on sorted arrays - Key: LUCENE-4875 URL: https://issues.apache.org/jira/browse/LUCENE-4875 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4875.patch Through minor modifications, SorterTemplate.mergeSort could run in linear time on sorted arrays, so I think we should do it? The idea is to modify merge so that it returns instantly when compare(pivot-1, pivot) = 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
Adrien Grand created LUCENE-4876: Summary: IndexWriterConfig.clone should clone the MergeScheduler Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Fix For: 4.3 ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reopened LUCENE-4752: -- I just found what caused the last Jenkins failures: sometimes deletions happen concurrently with a merge. In this case, deletes are still applied to the old ReaderAndLiveDocs and once the merge is finished, IndexWriter runs commitMergedDeletes to apply deletes to the new segment too, but since it assumes doc IDs are assigned sequentially, it doesn't work with SortingMergePolicy. (This explains why the bug was hard to reproduce too.) Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4874) Remove FilterTerms.intersect
[ https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4874: Assignee: Adrien Grand Remove FilterTerms.intersect Key: LUCENE-4874 URL: https://issues.apache.org/jira/browse/LUCENE-4874 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends FilterTerms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4874) Remove FilterTerms.intersect
[ https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612908#comment-13612908 ] Adrien Grand commented on LUCENE-4874: -- Although DocIdSetIterator.advance is abstract, it describes a default implementation that many classes that extend DocsEnum/DocsAndPositionsEnum duplicate. Maybe we should just provide a default implementation for advance, this would save copy-pastes. Remove FilterTerms.intersect Key: LUCENE-4874 URL: https://issues.apache.org/jira/browse/LUCENE-4874 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Terms.intersect is an optional method. The fact that it is overridden in FilterTerms forces any non-trivial class that extends FilterTerms to override intersect in order this method to have a correct behavior. If FilterTerms did not override this method and used the default impl, we would not have this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
Adrien Grand created LUCENE-4888: Summary: SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 Key: LUCENE-4888 URL: https://issues.apache.org/jira/browse/LUCENE-4888 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.2 Reporter: Adrien Grand SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 although the behavior of this method is undefined in such cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
[ https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4888: - Attachment: LUCENE-4888.patch A patch that adds assertions to AssertingDocsAndPositionsEnum. You can reproduce the issue by applying this patch and running {{ant test -Dtestcase=TestSloppyPhraseQuery -Dtests.codec=Asserting}}. SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 -- Key: LUCENE-4888 URL: https://issues.apache.org/jira/browse/LUCENE-4888 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.2 Reporter: Adrien Grand Attachments: LUCENE-4888.patch SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 although the behavior of this method is undefined in such cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4752: - Attachment: LUCENE-4752-2.patch Patch: - fixes the issue by allowing OneMerges to return a doc map that translates doc IDs to their new value so that IndexWriter can commit merged deletes, - TestSortingMergePolicy has been modified to make deletions more likely to happen concurrently with a merge. Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752-2.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-4647) Grouping is broken on docvalues-only fields
Adrien Grand created SOLR-4647: -- Summary: Grouping is broken on docvalues-only fields Key: SOLR-4647 URL: https://issues.apache.org/jira/browse/SOLR-4647 Project: Solr Issue Type: Bug Affects Versions: 4.2 Reporter: Adrien Grand There are a few places where grouping uses FieldType.toObject(SchemaField.createField(String, float)) to translate a String field value to an Object. The problem is that createField returns null when the field is neither stored nor indexed, even if it has doc values. An option to fix it could be to use the ValueSource instead to resolve the Object value (similarily to NumericFacets). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4876: Assignee: Adrien Grand IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4876: - Attachment: LUCENE-4876.patch Patch: * MergeScheduler implements Cloneable * IndexDeletionPolicy is now an abstract class (so that it can provide a default clone impl) and implements Cloneable * InfoStream implements Cloneable (there is no need for this today but I assumed that some people might be interested to display line numbers or other things that would require adding a state to the InfoStream, I've no strong feeling about it and can remove it if you think it shouldn't implement Cloneable) * MergeSchedulers and IndexDeletionPolicies have been fixed so that clones don't share state with the instance they've been cloned from * IndexWriterConfig clones mergeScheduler and delPolicy (in addition to mergePolicy, flushPolicy and indexerThreadPool which were already cloned) * Most of the patch changes are due to the fact that many tests assumed that the IndexDeletionPolicy instance passed to IndexWriterConfig was the same one as the one used by IndexWriter (which is not true now since IndexWriter clones the provided config in its constructor and we now clone del policies in IndexWriterConfig.clone). IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 Attachments: LUCENE-4876.patch ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
[ https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615496#comment-13615496 ] Adrien Grand commented on LUCENE-4888: -- May someone confirm that the assertions I added to AssertingDocsAndPositionsEnum are correct (meaning there is actually a bug in SloppyPhraseScorer)? SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 -- Key: LUCENE-4888 URL: https://issues.apache.org/jira/browse/LUCENE-4888 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.2 Reporter: Adrien Grand Attachments: LUCENE-4888.patch SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 although the behavior of this method is undefined in such cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615534#comment-13615534 ] Adrien Grand commented on LUCENE-4752: -- Thank you for the review Mike, I hope it will pass tests now! Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752-2.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4752) Merge segments to sort them
[ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4752. -- Resolution: Fixed Merge segments to sort them --- Key: LUCENE-4752 URL: https://issues.apache.org/jira/browse/LUCENE-4752 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: David Smiley Assignee: Adrien Grand Attachments: LUCENE-4752-2.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4875) Make SorterTemplate.mergeSort run in linear time on sorted arrays
[ https://issues.apache.org/jira/browse/LUCENE-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4875. -- Resolution: Fixed Make SorterTemplate.mergeSort run in linear time on sorted arrays - Key: LUCENE-4875 URL: https://issues.apache.org/jira/browse/LUCENE-4875 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.3 Attachments: LUCENE-4875.patch Through minor modifications, SorterTemplate.mergeSort could run in linear time on sorted arrays, so I think we should do it? The idea is to modify merge so that it returns instantly when compare(pivot-1, pivot) = 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615741#comment-13615741 ] Adrien Grand commented on LUCENE-4858: -- bq. I am thinking for some time on segment-level metadata. Something like SegmentInfo.attributes(). I agree that something like SegmentInfo.attributes would be helpful but why not SegmentInfo.attributes themselves? (I'm not trying to push for it, just curious what their use-cases are, they seem to be unused today?) Early termination with SortingMergePolicy - Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615752#comment-13615752 ] Adrien Grand commented on LUCENE-4858: -- bq. Why is additional metadata necessary? Isnt SegmentInfo.getDiagnostics().get(source) enough to tell you if the segment was created via a flush or a merge... maybe a little evil but the data is already there. It looks good, I hadn't noticed that we store this information in the diagnostics, thanks! Early termination with SortingMergePolicy - Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615839#comment-13615839 ] Adrien Grand commented on LUCENE-4876: -- bq. Does PersistentSnapshotDeletionPolicy need clone() too? At first, I though about making its clone() method throw an exception but we can't because IndexWriter constructor always clones the provided IndexWriterConfig. I'll add warnings about sharing in the javadocs. IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 Attachments: LUCENE-4876.patch ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4858: - Attachment: LUCENE-4858.patch Here is a first patch: * New convenient abstract collector class: EarlyTerminationCollector which makes no assumption about the readers it collects (it relies on sub-classes in order to know whether the collected context is sorted and how many docs should be collected at most). * New collector: SortingMergePolicyCollector that assumes that segments that result from a merge are sorted (to do so it inspect the diagnostics of the SegmentInfo). I named it this way to make it clear it needs to be used with SortingMergePolicy. * I made SegmentReader.getSegmentInfo public (instead of pkg-private) to be able to read the diagnostics. Is it OK to do so/Is there a cleaner way to expose diagnostics to high-level APIs? Early termination with SortingMergePolicy - Key: LUCENE-4858 URL: https://issues.apache.org/jira/browse/LUCENE-4858 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 Attachments: LUCENE-4858.patch Spin-off of LUCENE-4752, see https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 and https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 When an index is sorted per-segment, queries that sort according to the index sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4876: - Attachment: LUCENE-4876.patch New patch: * Added CHANGES entries * Added documentation to PersistentSnapshotDeletionPolicy to make clear that instances of this classes must not be shared across IndexWriters * Some Solr tests were failing because Solr expects SolrCore.solrDelPolicy to be the same instance as IndexWriter.getConfig().getIndexDeletionPolicy(). There is sensible code relying on it (SnapShooter/ReplicationHandler in particular) so I preferred emulating the old behavior by making IndexDeletetionPolicyWrapper.clone() return 'this' for the moment. This is not a problem because each core has its own private deletion policy and never opens more than one IndexWriter with it. IndexWriterConfig.clone should clone the MergeScheduler --- Key: LUCENE-4876 URL: https://issues.apache.org/jira/browse/LUCENE-4876 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 Attachments: LUCENE-4876.patch, LUCENE-4876.patch ConcurrentMergeScheduler has a ListMergeThread member to track the running merging threads, so IndexWriterConfig.clone should clone the merge scheduler so that both IndexWriterConfig instances are independant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
[ https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4888: Assignee: Adrien Grand SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 -- Key: LUCENE-4888 URL: https://issues.apache.org/jira/browse/LUCENE-4888 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.2 Reporter: Adrien Grand Assignee: Adrien Grand Attachments: LUCENE-4888.patch SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 although the behavior of this method is undefined in such cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
[ https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4888: - Attachment: LUCENE-4888.patch Patch that adds assertions from the previous patch to new bug fixes: - SloppyPhraseScorer.advance - MultiDocs(AndPositions)Enum.advance - MultiSpansWrapper.skipTo These three methods relied on the assumption that advance(target) is equivalent to nextDoc() when target is = the current position (which is wrong, although all our impls behave this way). SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 -- Key: LUCENE-4888 URL: https://issues.apache.org/jira/browse/LUCENE-4888 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.2 Reporter: Adrien Grand Assignee: Adrien Grand Attachments: LUCENE-4888.patch, LUCENE-4888.patch SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 although the behavior of this method is undefined in such cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4877) Fix analyzer factories to throw exception when arguments are invalid
[ https://issues.apache.org/jira/browse/LUCENE-4877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617472#comment-13617472 ] Adrien Grand commented on LUCENE-4877: -- +1 Fix analyzer factories to throw exception when arguments are invalid Key: LUCENE-4877 URL: https://issues.apache.org/jira/browse/LUCENE-4877 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Attachments: LUCENE-4877_one_solution_prototype.patch Currently if someone typos an argument someParamater=xyz instead of someParameter=xyz, they get no exception and sometimes incorrect behavior. It would be way better if these factories threw exception on unknown params, e.g. they removed the args they used and checked they were empty at the end. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-4654) Integrate Lucene's sorting and early query termination capabilities into Solr
Adrien Grand created SOLR-4654: -- Summary: Integrate Lucene's sorting and early query termination capabilities into Solr Key: SOLR-4654 URL: https://issues.apache.org/jira/browse/SOLR-4654 Project: Solr Issue Type: Improvement Reporter: Adrien Grand Priority: Trivial I think there would be some interesting work to do to integrate Lucene's sorting and early query termination capabilities into Solr, in particular (just ideas, maybe they're not all interesting/useful): - configuring a SortingMergePolicy, - figuring out when the sort order of queries matches the sort order of the index segments, - giving the ability to get approximated results when the query is not sorted but only boosted by the sort order of the index, - integration with TimeLimitingCollector: maybe it's better to collect only half of all segments than to fully collect half of the segments, - approximation of the number of matches based on the ratio of collected documents, - ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org