[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Ignacio Vera commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields I think you are assuming that we always visit all terms in the index. The iteration might be driven by the result of a query that contains a subset of the documents. In pseudo-code would look like: DocIdSetIterator iterator = executeQuery(); SortedDocValues sortedDocValues = getSortedDocValues(); int doc = iterator.nextDoc(); while (doc != DocIdSetIterator.NO_MORE_DOCS) { if (sortedDocValues.advanceExact(doc)) { BytesRef bytesRef = sortedDocValues.lookupOrd(sortedDocValues.ordValue()); consume(bytesRef); // advance our iterator to the next document with different ordinal doc = iterator.advance(sortedDocValues.advanceOrd()); } else { doc = iterator.nextDoc(); } } I don't see how to do this efficiently with the inverted index only. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields In fact, i don't see any need to involve docvalues at all for this feature. You can go thru terms dict and just read the first docID() of the first posting for each term and then you know how the terms map to these docid "ranges". No need to read the whole postings list, just get the first doc, since the index is sorted. No need to consult docvalues at all. This could be another way to implement whatever it is you are doing, that's what i referred to as "XY" problem. It seems to me, for mapping terms to ranges of documents for a sorted index, that the inverted index is the correct data structure. But maybe again, we need to optimize something here so that getting that first doc is even faster for dense terms for this use case: e.g. inlining SingletonDocID (delta) for terms when the index is sorted on the field, so that its all super-efficient solely from the terms dictionary. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields I don't understand why you are starting from an ordinal at all? it seems a bit of an XY problem. Doesn't the task start with a term (e.g. bytes or text)? the two terms dictionaries are "aligned". if you are just gonna next() thru the terms of the terms dict, you can make int ordinal = 0; and just do ordinal++ after processing each term. then you don't need to do any ordinal lookups anywhere. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Ignacio Vera commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields More exactly, the SortedDocValues iterator might have been advanced to a different ordinal. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Ignacio Vera commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields > The slowness is probably the lookupOrd? Can you avoid this? Just next() the termsenum to move on to the next ord. Not really because the call to advance can move the SortedDocValues iterator to a different ordinal. We need to position our TermsEnum in each call. That's one of the advantage of my proposal as everything advances together. > I'd also modify the call to termsEnum.postings() to be termsEnum.postings(postingsEnum, PostingEnum.NONE) I will try that and I will report back. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields I'd also modify the call to termsEnum.postings() to be termsEnum.postings(postingsEnum, PostingEnum.NONE). Depending on your data, it might not do anything, but you don't need frequencies so it is ok to skip over them rather than decode them. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields The slowness is probably the lookupOrd? Can you avoid this? Just next() the termsenum to move on to the next ord. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Ignacio Vera commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields Here is the code I am using when using postings/TermsEnum from the inverted index which might be totally wrong / inefficient as unfortunately I am not an expert in this area: // position the terms enum on the current term termsEnum.seekExact(sortedDocValues.lookupOrd(sortedDocValues.ordValue())); // advance if (termsEnum.next() == null) { doc = sortedDocValues.advance(DocIdSetIterator.NO_MORE_DOCS); } else { termsEnum.postings(postingsEnum); doc = sortedDocValues.advance(postingsEnum.nextDoc()); } This code performs ok for lower cardinality but it becomes slow for high cardinality. Similar to what I have done in the linked PR, I have indexed 50 million documents in a sorted index. The documents contain a SortedDocValues with a 10 bytes term and the term is indexed using a StringField as well. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like: 1000 cardinality INDEX SIZE: 6.110762596130371 MB Average: 0.00264192705 seconds 1 cardinality INDEX SIZE: 22.32368278503418 MB Average: 0.01891452695003 seconds 10 cardinality INDEX SIZE: 86.16338920593262 MB Average: 0.1449143188 seconds 50 cardinality INDEX SIZE: 108.61055660247803 MB Average: 0.4431338875005 seconds Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields Also if performance changes are minor between the two solutions, perhaps we could speed terms/postings up for the sorted case to close the gap. For example, in the sorted case we could consider always writing SingletonDocID, and because of sorting, it could be delta-encoded. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields do we have any idea of the comparison? I'm just curious because it seems like doing TermsEnum.next() and getting first doc ID should be relatively optimized. The nice thing about it, is that it can already be done today without adding additional datastructures and APIs. The docvalues advanceOrd is a bit of a mismatch for column data structure, it seems like an inverted structure is more appropriate for this. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Ignacio Vera commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields If I understand you correct, you mean leveraging the inverted index to get the first document per term. I tried that and my conclusion was that it was slower than manually iterate the doc values. Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
Title: Message Title Robert Muir commented on LUCENE-10396 Re: Automatically create sparse indexes for sort fields If you just need the first document with the each value, why not use postings/TermsEnum? Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558474#comment-17558474 ] Ignacio Vera commented on LUCENE-10396: --- I open a draft PR that shows the idea I exposed above. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > Time Spent: 10m > Remaining Estimate: 0h > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558103#comment-17558103 ] Ignacio Vera commented on LUCENE-10396: --- I have been thinking on the ability if visiting a single documents per field value as explained for Adrien above and I think we can implement it in a not intrusive way for SortedDocValues that have low cardinality. The idea is to add the following method to the SortedDocValues api with a default implementation: {code} /** * Advances the iterator the the next document whose ordinal is distinct to the current ordinal * and returns the document number itself. Exhausts the iterator and returns {@link * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one. * * The behaviour of this method is undefined when called with target current * , or after the iterator has exhausted. Both cases may result in unpredicted behaviour. * * The default implementation just iterates over the documents and manually checks if the ordinal has changed but * but some implementations are considerably more efficient than that. * */ public int advanceOrd() throws IOException { int doc = docID(); if (doc == DocIdSetIterator.NO_MORE_DOCS) { return doc; } final long ord = ordValue(); do { doc = nextDoc(); } while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord); assert doc == docID(); return doc; } {code} When consuming the doc values, if the field is the primary sort of he index and the cardinality is low (average of documents per field >64?), then we will use a DirectMonotonicWriter to write the offset for each ord which should not use too much disk space. When producing the doc value, we will override this method with a faster DirectMonotonicReader implementation. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551429#comment-17551429 ] Adrien Grand commented on LUCENE-10396: --- Another potential use-case we are interested in for Elasticsearch would be to have the ability to visit a single document per field value for a field that is the primary index sort. This has a few applications, one of them is to compute the number of unique values of this primary sort field for documents that match a query. The collector could implement {{LeafCollector#competitiveIterator}} by using the sparse index to skip all documents that have the same value as the last collected hit. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485378#comment-17485378 ] Adrien Grand commented on LUCENE-10396: --- I'm also wondering if this could help better support pre-filtering with NN vectors. We could use these sparse indexes to produce Bits representations of the set of matching doc IDs that consist of a union of ranges, which we could then pass to the HNSW search logic as live docs? > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485355#comment-17485355 ] Adrien Grand commented on LUCENE-10396: --- To clarify, I was thinking of the case when a user sorts on a single field at search time, which happens to be the secondary sort field on the index sort. Good question about range facets. Maybe it could help a bit in the sense that we could use this data structure to know that all docs within a range of doc IDs would go to a given bucket, and then we'd just need to count these documents, we wouldn't even have to look at their associated value. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485292#comment-17485292 ] Robert Muir commented on LUCENE-10396: -- thanks [~jpountz]. I like that there is a sort use-case as well, although it is still an advanced/complex case (multi-field sort). What about facets (range or otherwise)? > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485229#comment-17485229 ] Adrien Grand commented on LUCENE-10396: --- Actually I think it would help make things faster not slower. IndexSortSortedNumericDocValuesRangeQuery already outperforms PointRangeQuery for ranges that match lots of documents, even more if the range query doesn't run on its own but within a conjunction. I would like to generalize this mechanism for secondary and tertiary sorts when the previous sort fields have a low cardinality (e.g. an e-commerce index sorted by category then price). And fold this logic into our query factory methods and top field collectors so that range queries would automatically take advantage of index sorting when they see it. I think we'd almost be guaranteed to make queries faster by leveraging this sparse index in the following cases: - If a range query targets the primary sort field, then use IndexSortSortedNumericDocValuesRangeQuery (hopefully enhanced with this sparse index to be less I/O-intensive). - If a range query targets the N-th sort field (N > 0), then return the usual IndexOrDocValuesQuery where the doc values query takes advantage of the sparse index to jump over ranges of doc IDs that have no chance of matching. - If the query is sorted by the N-th sort field (N > 0), use the sparse index to produce a good {{LeafCollector#competitiveIterator()}} that jumps over range of doc IDs that have no chance of matching because their sort value is below the current top value of the heap. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485178#comment-17485178 ] Robert Muir commented on LUCENE-10396: -- an entire new file format abstraction for this one slow range query? is there any other use-case for this data? I'm concerned about complexity, this seems VERY specialized. > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields
[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484830#comment-17484830 ] Adrien Grand commented on LUCENE-10396: --- One interesting question about this will be the read API. What I have in mind for now looks something like this: {code:java} class SparseIndexProducer { DocIdRangeIterator getIterator() } class DocIdRange { int first, last; } class DocIdRangeIterator { /** * Return a range of doc IDs that may contain {@code value} as the value of the sortPos-th sort field. * All documents between target included and {@code range.first} excluded are guaranteed to not * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used to signal that no more * documents may contain {@code value}. */ DocIdRange nextRangeNumeric(int target, int sortPos, long value); /** * Likewise for SORTED doc values. */ DocIdRange nextRangeSorted(int target, int sortPos, long ord); /** * Likewise for BINARY doc values. */ DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue); } {code} > Automatically create sparse indexes for sort fields > --- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org