[jira] [Updated] (LUCENE-7618) Hypothetical perf improvements in DocValuesRangeQuery: reducing comparisons for some queries/segments

Hoss Man (JIRA) Tue, 03 Jan 2017 13:24:18 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-7618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated LUCENE-7618:
-----------------------------
    Attachment: LUCENE-7618.patch


*TL;DR:* based on some half assed benchmarking using junit timings, the results 
did *NOT* show any improvements -- for some seeds (on my laptop) the new code 
was actaully slower sometimes, leading me to suspect I may have a bug somewhere 
in either the test or the code, but I haven't had time to dig into it.

----

*Background...*

Just before I went on vacation, I had a conversation with someone who asked a 
question that can be summarized as:

bq. If I already know the smallest possible value ({{minF}}) in the field 
{{foo}}, which is faster: {{foo:\[\* TO X\]}} or {{foo:\[minF TO X\]}} ?  
_(where X will be diff in every query)_

My answer, based on recollection of how simple term indexing range queries use 
to work, was that any difference should be negligable -- in the {{\*}} case 
lucene can skip the call to {{seekTo(minF)}} but otherwise the executation 
would be identical after that.  For DocValues, I speculated that using {{\*}} 
should theoretically be faster, because as the code iterates ove the DocValues, 
it would only have to compare each doc's value to {{X}}, and not to {{minF}}.

When I dug into the code, I found that wasn't the case: while query rewriting 
had a special case optimization for {{foo:\[\* TO \*\]}} there were no special 
case optimizations for a range query open on one end -- instead the 
TwoPhaseIterator created for SortedNumericDocValues assigns 
Long.MIN_VALUE/Long.MAX_VALUE as needed anytime min/max are null, and does 2 
comparisons per doc value.  Likewise the codepath for SortedSetDocValues 
assigns minOrd ({{0}}) and maxOrd ({{values.getValueCount() - 1}}) when the 
user specified min/max are null; and again did 2 comparisons per value.

*Theory...*

Adding special case handling for the open ended range queries, to return 
TwoPhaseIterators that only did one comparison in these special cases, 
seem(ed|s) like an easy win?  In particular for the case of SortedSetDocValues: 
Even if the original query has non-null min & max values, after calling 
{{lookupTerm(...)}} on each we might find that one of them is completely out of 
range _for individual segments_ and optimize away one comparison on a 
per-segment level.  (an existing optimization already handled the case where 
the min/max were both completely below/above the range of values in the segment)

*Attempt...*

Step #1 was writting a {{TestDocValuesRangeQueryOptimizations}} (see patch) 
against the existing code that built up an index with identical values in 3 
diff fields (SortedNumeric, SortedSet, and Points) and then did a lot of 
randomized, open ended, range queries against that index, one field per test 
method.  

Once I had the basics of the test written, I then ran the test over and over 
with the same seed and noted down the junit test times for each.

I then added what seemed like the most straight forward optimizations to 
{{DocValuesRangeQuery}}, and re-ran the test a few times with the same seed.

*Results...*

The results were not good.  Looking back, didn't even bother to save the tmp 
file where I had been pasting the junit times -- that's how not good they were. 
 Even accounting for potential "noise" in the results from other stuff running 
on my laptop, there was no inkling of any speedup in the new code (as compared 
to the test against the points field which didn't involve any modified code 
paths) and IIRC the results were occasionally slower.


I set all this aside to come back to later to try and figure out if I'd made 
any obvious mistakes, but skimming the code today, and writting up my 
recollections, nothing jumps out at me.  My best guess is that HotSpot is 
already optimizing away some of these comparisons, and that the new code just 
complicates things for HotSpot to the point that it's sometimes slower.  I 
don't plan on taking this any further, but I wanted to open the issue to track 
the idea in case anyone else sees value in pursuing it futher.






> Hypothetical perf improvements in DocValuesRangeQuery: reducing comparisons 
> for some queries/segments
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7618
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7618
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Hoss Man
>         Attachments: LUCENE-7618.patch
>
>
> In reviewing the DocValuesRangeQuery code, it occured to me that there 
> _might_ be some potential performance optimizations possible in a few cases 
> relating queries that involve explicitly specified open ranges (ie: min or 
> max are null) or in the case of SortedSet: range queries that are 
> *effectively* open ended on particular segments, because the min/max are 
> below/above the minOrd/maxOrd for the segment.
> Since these seemed like semi-common situations (open ended range queries are 
> fairly common in my experience, i'm not sure about the secondary SortedSet 
> "ord" case, but it seemd potentially promising particularly for fields like 
> incrementing ids, or timestamps, where values are added sequentially and 
> likeley to be clustered together) I did a bit of experimenting and wanted to 
> post my findings in jira -- patch & details to follow in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7618) Hypothetical perf improvements in DocValuesRangeQuery: reducing comparisons for some queries/segments

Reply via email to