[jira] [Commented] (SOLR-14166) Use TwoPhaseIterator for non-cached filter queries

2021-03-07 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296964#comment-17296964
 ] 

David Smiley commented on SOLR-14166:
-

CC [~yonik] [~jbernste] [~hossman] as possible reviewers for this attached PR 
which is rather technical into code which few people have touched but you all 
three in some shape/form.  Please review the issue description, and take a look 
at the PR.  In the PR, each commit is well isolated to the what the commit 
message says, so you may prefer to go commit-by-commit, or you could just look 
at the thing as a whole.  In a comment above I pondered "Maybe we could make a 
wrapping query that wraps the underlying TPI.matchCost"; as you'll see in the 
PR, I did that.  The test works in validating that match() isn't called more 
than it needs to be.  It used to be called more which is verifiable by copying 
the test to the 8x line (if I recall, it was called two additional times).  I 
suspect the test doesn't test that MatchCostQuery is having an effect... I may 
need to think a bit more on how to do that.

I suspect someone will ask me if I did some performance tests.  No I did not.  
My goal is removal of tech debt -- Filter, and in the process expect some 
performance improvements that Filter was blocking.  So in this issue, anyone 
with non-cached filter queries may see a benefit, especially when those queries 
have TwoPhaseIterators (phrase queries, frange, spatial, more).  The benefit 
may be further pronounced if the main query also has TPIs because Lucene 
cleverly sees through the boolean queries to group the TPIs of required clauses 
in the tree.

> Use TwoPhaseIterator for non-cached filter queries
> --
>
> Key: SOLR-14166
> URL: https://issues.apache.org/jira/browse/SOLR-14166
> Project: Solr
>  Issue Type: Sub-task
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> "fq" filter queries that have cache=false and which aren't processed as a 
> PostFilter (thus either aren't a PostFilter or have a cost < 100) are 
> processed in SolrIndexSearcher using a custom Filter thingy which uses a 
> cost-ordered series of DocIdSetIterators.  This is not TwoPhaseIterator 
> aware, and thus the match() method may be called on docs that ideally would 
> have been filtered by lower-cost filter queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14166) Use TwoPhaseIterator for non-cached filter queries

2020-01-05 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008503#comment-17008503
 ] 

David Smiley commented on SOLR-14166:
-

I filed LUCENE-9114 to get the matchCost on ValueSource/FunctionValues API.  
It's perhaps a "required" dependency to get reasonable performance.

> Use TwoPhaseIterator for non-cached filter queries
> --
>
> Key: SOLR-14166
> URL: https://issues.apache.org/jira/browse/SOLR-14166
> Project: Solr
>  Issue Type: Sub-task
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "fq" filter queries that have cache=false and which aren't processed as a 
> PostFilter (thus either aren't a PostFilter or have a cost < 100) are 
> processed in SolrIndexSearcher using a custom Filter thingy which uses a 
> cost-ordered series of DocIdSetIterators.  This is not TwoPhaseIterator 
> aware, and thus the match() method may be called on docs that ideally would 
> have been filtered by lower-cost filter queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14166) Use TwoPhaseIterator for non-cached filter queries

2020-01-05 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008410#comment-17008410
 ] 

David Smiley commented on SOLR-14166:
-

The PR has the code details but I want to mention some more bigger picture here.

I have this as a sub-task of Remove/refactor Filter because this reduces the 
use of the old Filter abstraction.  SolrIndexSearcher.ProcessedFilter.filter is 
now declared as a Query.  SolrIndexSearcher no longer has FilterImpl.  Now that 
pf.filter is a Query, this allowed for SolrIndexSearcher.getDocSet(List 
fqs) to be simpler and allowed me to remove the similar getDocSetScore.

So how is TwoPhaseIterator used efficiently you may ask?  BooleanQuery's FILTER 
clauses use this internally via ConjunctionDISI.  I modified 
SolrIndexSearcher.getProcessedFilter to create a BooleanQuery with these FILTER 
clauses for the non-cached queries.

Unfortunately we lose the ability for the "cost" param on these non-cached 
filter queries to have meaning.  Instead, the Queries themselves and any TPIs 
they may have ought to have suitable costs, and they are not externally 
configurable.  Maybe we could make a wrapping query that wraps the underlying 
TPI.matchCost... or just not bother, letting the queries themselves actually 
compute an internal cost that is perhaps better than whatever the user 
supplies.  I lean this way; less complexity.  Unfortunately, 
ValueSourceScorer's TPI matchCost is a constant 100 instead of varying based on 
the particular FunctionValues implementation.  That should be its own issue to 
address.

> Use TwoPhaseIterator for non-cached filter queries
> --
>
> Key: SOLR-14166
> URL: https://issues.apache.org/jira/browse/SOLR-14166
> Project: Solr
>  Issue Type: Sub-task
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "fq" filter queries that have cache=false and which aren't processed as a 
> PostFilter (thus either aren't a PostFilter or have a cost < 100) are 
> processed in SolrIndexSearcher using a custom Filter thingy which uses a 
> cost-ordered series of DocIdSetIterators.  This is not TwoPhaseIterator 
> aware, and thus the match() method may be called on docs that ideally would 
> have been filtered by lower-cost filter queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org