[jira] [Commented] (SOLR-14166) Use TwoPhaseIterator for non-cached filter queries
[ https://issues.apache.org/jira/browse/SOLR-14166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296964#comment-17296964 ] David Smiley commented on SOLR-14166: - CC [~yonik] [~jbernste] [~hossman] as possible reviewers for this attached PR which is rather technical into code which few people have touched but you all three in some shape/form. Please review the issue description, and take a look at the PR. In the PR, each commit is well isolated to the what the commit message says, so you may prefer to go commit-by-commit, or you could just look at the thing as a whole. In a comment above I pondered "Maybe we could make a wrapping query that wraps the underlying TPI.matchCost"; as you'll see in the PR, I did that. The test works in validating that match() isn't called more than it needs to be. It used to be called more which is verifiable by copying the test to the 8x line (if I recall, it was called two additional times). I suspect the test doesn't test that MatchCostQuery is having an effect... I may need to think a bit more on how to do that. I suspect someone will ask me if I did some performance tests. No I did not. My goal is removal of tech debt -- Filter, and in the process expect some performance improvements that Filter was blocking. So in this issue, anyone with non-cached filter queries may see a benefit, especially when those queries have TwoPhaseIterators (phrase queries, frange, spatial, more). The benefit may be further pronounced if the main query also has TPIs because Lucene cleverly sees through the boolean queries to group the TPIs of required clauses in the tree. > Use TwoPhaseIterator for non-cached filter queries > -- > > Key: SOLR-14166 > URL: https://issues.apache.org/jira/browse/SOLR-14166 > Project: Solr > Issue Type: Sub-task >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > "fq" filter queries that have cache=false and which aren't processed as a > PostFilter (thus either aren't a PostFilter or have a cost < 100) are > processed in SolrIndexSearcher using a custom Filter thingy which uses a > cost-ordered series of DocIdSetIterators. This is not TwoPhaseIterator > aware, and thus the match() method may be called on docs that ideally would > have been filtered by lower-cost filter queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14166) Use TwoPhaseIterator for non-cached filter queries
[ https://issues.apache.org/jira/browse/SOLR-14166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008503#comment-17008503 ] David Smiley commented on SOLR-14166: - I filed LUCENE-9114 to get the matchCost on ValueSource/FunctionValues API. It's perhaps a "required" dependency to get reasonable performance. > Use TwoPhaseIterator for non-cached filter queries > -- > > Key: SOLR-14166 > URL: https://issues.apache.org/jira/browse/SOLR-14166 > Project: Solr > Issue Type: Sub-task >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > "fq" filter queries that have cache=false and which aren't processed as a > PostFilter (thus either aren't a PostFilter or have a cost < 100) are > processed in SolrIndexSearcher using a custom Filter thingy which uses a > cost-ordered series of DocIdSetIterators. This is not TwoPhaseIterator > aware, and thus the match() method may be called on docs that ideally would > have been filtered by lower-cost filter queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14166) Use TwoPhaseIterator for non-cached filter queries
[ https://issues.apache.org/jira/browse/SOLR-14166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008410#comment-17008410 ] David Smiley commented on SOLR-14166: - The PR has the code details but I want to mention some more bigger picture here. I have this as a sub-task of Remove/refactor Filter because this reduces the use of the old Filter abstraction. SolrIndexSearcher.ProcessedFilter.filter is now declared as a Query. SolrIndexSearcher no longer has FilterImpl. Now that pf.filter is a Query, this allowed for SolrIndexSearcher.getDocSet(List fqs) to be simpler and allowed me to remove the similar getDocSetScore. So how is TwoPhaseIterator used efficiently you may ask? BooleanQuery's FILTER clauses use this internally via ConjunctionDISI. I modified SolrIndexSearcher.getProcessedFilter to create a BooleanQuery with these FILTER clauses for the non-cached queries. Unfortunately we lose the ability for the "cost" param on these non-cached filter queries to have meaning. Instead, the Queries themselves and any TPIs they may have ought to have suitable costs, and they are not externally configurable. Maybe we could make a wrapping query that wraps the underlying TPI.matchCost... or just not bother, letting the queries themselves actually compute an internal cost that is perhaps better than whatever the user supplies. I lean this way; less complexity. Unfortunately, ValueSourceScorer's TPI matchCost is a constant 100 instead of varying based on the particular FunctionValues implementation. That should be its own issue to address. > Use TwoPhaseIterator for non-cached filter queries > -- > > Key: SOLR-14166 > URL: https://issues.apache.org/jira/browse/SOLR-14166 > Project: Solr > Issue Type: Sub-task >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > "fq" filter queries that have cache=false and which aren't processed as a > PostFilter (thus either aren't a PostFilter or have a cost < 100) are > processed in SolrIndexSearcher using a custom Filter thingy which uses a > cost-ordered series of DocIdSetIterators. This is not TwoPhaseIterator > aware, and thus the match() method may be called on docs that ideally would > have been filtered by lower-cost filter queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org