[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Ignacio Vera (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ignacio Vera commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 I think you are assuming that we always visit all terms in the index. The iteration might be driven by the result of a query that contains a subset of the documents. In pseudo-code would look like: 

 

DocIdSetIterator iterator = executeQuery();
SortedDocValues sortedDocValues = getSortedDocValues();
int doc = iterator.nextDoc();
while (doc != DocIdSetIterator.NO_MORE_DOCS) {
if (sortedDocValues.advanceExact(doc)) {
BytesRef bytesRef = sortedDocValues.lookupOrd(sortedDocValues.ordValue());
consume(bytesRef);
// advance our iterator to the next document with different ordinal
doc = iterator.advance(sortedDocValues.advanceOrd());
} else {
doc = iterator.nextDoc();
}
}
 

 I don't see how to do this efficiently with the inverted index only.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 In fact, i don't see any need to involve docvalues at all for this feature. You can go thru terms dict and just read the first docID() of the first posting for each term and then you know how the terms map to these docid "ranges". No need to read the whole postings list, just get the first doc, since the index is sorted. No need to consult docvalues at all. This could be another way to implement whatever it is you are doing, that's what i referred to as "XY" problem. It seems to me, for mapping terms to ranges of documents for a sorted index, that the inverted index is the correct data structure. But maybe again, we need to optimize something here so that getting that first doc is even faster for dense terms for this use case: e.g. inlining SingletonDocID (delta) for terms when the index is sorted on the field, so that its all super-efficient solely from the terms dictionary.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 I don't understand why you are starting from an ordinal at all? it seems a bit of an XY problem. Doesn't the task start with a term (e.g. bytes or text)? the two terms dictionaries are "aligned". if you are just gonna next() thru the terms of the terms dict, you can make int ordinal = 0; and just do ordinal++ after processing each term. then you don't need to do any ordinal lookups anywhere.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Ignacio Vera (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ignacio Vera commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 More exactly, the SortedDocValues iterator might have been advanced to a different ordinal.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Ignacio Vera (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ignacio Vera commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 > The slowness is probably the lookupOrd? Can you avoid this? Just next() the termsenum to move on to the next ord. Not really because the call to advance can move the SortedDocValues iterator to a different ordinal. We need to position our TermsEnum in each call. That's one of the advantage of my proposal as everything advances together. > I'd also modify the call to termsEnum.postings() to be termsEnum.postings(postingsEnum, PostingEnum.NONE) I will try that and I will report back.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 I'd also modify the call to termsEnum.postings() to be termsEnum.postings(postingsEnum, PostingEnum.NONE). Depending on your data, it might not do anything, but you don't need frequencies so it is ok to skip over them rather than decode them.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 The slowness is probably the lookupOrd? Can you avoid this? Just next() the termsenum to move on to the next ord.   
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Ignacio Vera (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ignacio Vera commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 Here is the code I am using when using postings/TermsEnum from the inverted index which might be totally wrong / inefficient as unfortunately I am not an expert in this area: 

 

// position the terms enum on the current term
termsEnum.seekExact(sortedDocValues.lookupOrd(sortedDocValues.ordValue()));
// advance
if (termsEnum.next() == null) {
doc = sortedDocValues.advance(DocIdSetIterator.NO_MORE_DOCS);
} else {
termsEnum.postings(postingsEnum);
doc = sortedDocValues.advance(postingsEnum.nextDoc());
}

 

 This code performs ok for lower cardinality but it becomes slow for high cardinality. Similar to what I have done in the linked PR, I have indexed 50 million documents in a sorted index. The documents contain a SortedDocValues with a 10 bytes term and the term is indexed using a StringField as well. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like:  1000 cardinality INDEX SIZE: 6.110762596130371 MB Average: 0.00264192705 seconds  1 cardinality INDEX SIZE: 22.32368278503418 MB Average: 0.01891452695003 seconds  10 cardinality INDEX SIZE: 86.16338920593262 MB Average: 0.1449143188 seconds  50 cardinality INDEX SIZE: 108.61055660247803 MB Average: 0.4431338875005 seconds  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 

[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 Also if performance changes are minor between the two solutions, perhaps we could speed terms/postings up for the sorted case to close the gap. For example, in the sorted case we could consider always writing SingletonDocID, and because of sorting, it could be delta-encoded.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 do we have any idea of the comparison? I'm just curious because it seems like doing TermsEnum.next() and getting first doc ID should be relatively optimized. The nice thing about it, is that it can already be done today without adding additional datastructures and APIs. The docvalues advanceOrd is a bit of a mismatch for column data structure, it seems like an inverted structure is more appropriate for this.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Ignacio Vera (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ignacio Vera commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 If I understand you correct, you mean leveraging the inverted index to get the first document per term.  I tried that and my conclusion was that it was slower than manually iterate the doc values.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-28 Thread Robert Muir (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Robert Muir commented on  LUCENE-10396  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Automatically create sparse indexes for sort fields   
 

  
 
 
 
 

 
 If you just need the first document with the each value, why not use postings/TermsEnum?  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-24 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558474#comment-17558474
 ] 

Ignacio Vera commented on LUCENE-10396:
---

I open a draft PR that shows the idea I exposed above.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-23 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558103#comment-17558103
 ] 

Ignacio Vera commented on LUCENE-10396:
---

I have been thinking on the ability if visiting a single documents per field 
value as explained for Adrien above and I think we can implement it in a not 
intrusive way for SortedDocValues that have low cardinality. The idea is to add 
the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to 
the current ordinal
   * and returns the document number itself. Exhausts the iterator and returns 
{@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one.
   *
   * The behaviour of this method is undefined when called with 
 target  current
   * , or after the iterator has exhausted. Both cases may result in 
unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually 
checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
int doc = docID();
if (doc == DocIdSetIterator.NO_MORE_DOCS) {
  return doc;
}
final long ord = ordValue();
do  {
  doc = nextDoc();
} while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
assert doc == docID();
return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and 
the cardinality is low (average of documents per field >64?), then we will use 
a DirectMonotonicWriter to write the offset for each ord which should not use 
too much disk space. When producing the doc value, we will override this method 
with a faster DirectMonotonicReader implementation.




> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-08 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551429#comment-17551429
 ] 

Adrien Grand commented on LUCENE-10396:
---

Another potential use-case we are interested in for Elasticsearch would be to 
have the ability to visit a single document per field value for a field that is 
the primary index sort. This has a few applications, one of them is to compute 
the number of unique values of this primary sort field for documents that match 
a query. The collector could implement {{LeafCollector#competitiveIterator}} by 
using the sparse index to skip all documents that have the same value as the 
last collected hit.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-02-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485378#comment-17485378
 ] 

Adrien Grand commented on LUCENE-10396:
---

I'm also wondering if this could help better support pre-filtering with NN 
vectors. We could use these sparse indexes to produce Bits representations of 
the set of matching doc IDs that consist of a union of ranges, which we could 
then pass to the HNSW search logic as live docs?

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-02-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485355#comment-17485355
 ] 

Adrien Grand commented on LUCENE-10396:
---

To clarify, I was thinking of the case when a user sorts on a single field at 
search time, which happens to be the secondary sort field on the index sort.

Good question about range facets. Maybe it could help a bit in the sense that 
we could use this data structure to know that all docs within a range of doc 
IDs would go to a given bucket, and then we'd just need to count these 
documents, we wouldn't even have to look at their associated value.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-02-01 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485292#comment-17485292
 ] 

Robert Muir commented on LUCENE-10396:
--

thanks [~jpountz]. I like that there is a sort use-case as well, although it is 
still an advanced/complex case (multi-field sort). What about facets (range or 
otherwise)?

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-02-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485229#comment-17485229
 ] 

Adrien Grand commented on LUCENE-10396:
---

Actually I think it would help make things faster not slower. 
IndexSortSortedNumericDocValuesRangeQuery already outperforms PointRangeQuery 
for ranges that match lots of documents, even more if the range query doesn't 
run on its own but within a conjunction. I would like to generalize this 
mechanism for secondary and tertiary sorts when the previous sort fields have a 
low cardinality (e.g. an e-commerce index sorted by category then price). And 
fold this logic into our query factory methods and top field collectors so that 
range queries would automatically take advantage of index sorting when they see 
it.

I think we'd almost be guaranteed to make queries faster by leveraging this 
sparse index in the following cases:
 - If a range query targets the primary sort field, then use 
IndexSortSortedNumericDocValuesRangeQuery (hopefully enhanced with this sparse 
index to be less I/O-intensive).
 - If a range query targets the N-th sort field (N > 0), then return the usual 
IndexOrDocValuesQuery where the doc values query takes advantage of the sparse 
index to jump over ranges of doc IDs that have no chance of matching.
 - If the query is sorted by the N-th sort field (N > 0), use the sparse index 
to produce a good {{LeafCollector#competitiveIterator()}} that jumps over range 
of doc IDs that have no chance of matching because their sort value is below 
the current top value of the heap.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-02-01 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485178#comment-17485178
 ] 

Robert Muir commented on LUCENE-10396:
--

an entire new file format abstraction for this one slow range query? is there 
any other use-case for this data? I'm concerned about complexity, this seems 
VERY specialized.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-01-31 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484830#comment-17484830
 ] 

Adrien Grand commented on LUCENE-10396:
---

One interesting question about this will be the read API. What I have in mind 
for now looks something like this:

{code:java}
class SparseIndexProducer {
  DocIdRangeIterator getIterator()
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return a range of doc IDs that may contain {@code value} as the value of 
the sortPos-th sort field.
   * All documents between target included and {@code range.first} excluded are 
guaranteed to not
   * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used 
to signal that no more
   * documents may contain {@code value}.
   */
  DocIdRange nextRangeNumeric(int target, int sortPos, long value);

  /**
   * Likewise for SORTED doc values.
   */
  DocIdRange nextRangeSorted(int target, int sortPos, long ord);

  /**
   * Likewise for BINARY doc values.
   */
  DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue);

}
{code}

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org