[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573156#comment-17573156 ] ASF subversion and git services commented on LUCENE-10559: -- Commit 33d5ab96f266447b228833baee3489b12bdc3a68 in lucene's branch refs/heads/branch_9x from Kaival Parikh [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=33d5ab96f26 ] LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932) Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks. `filterSelectivity` expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search. In case of post-filter, we over-select results as `topK / filterSelectivity` to get final hits close to actual requested `topK`. For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery. > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > Fix For: 9.4 > > Time Spent: 6h 40m > Remaining Estimate: 0h > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573103#comment-17573103 ] ASF subversion and git services commented on LUCENE-10559: -- Commit 1ad28a3136fceb248ef55f2a09e77e7797bef51e in lucene's branch refs/heads/main from Kaival Parikh [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1ad28a3136f ] LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932) Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks. `filterSelectivity` expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search. In case of post-filter, we over-select results as `topK / filterSelectivity` to get final hits close to actual requested `topK`. For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery. > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > Time Spent: 6h 40m > Remaining Estimate: 0h > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541282#comment-17541282 ] Kaival Parikh commented on LUCENE-10559: The graph construction parameters were: docs = path_to_vec_file, ndoc = 100, dim = 100, fanout = 0, maxConn = 150, beamWidthIndex = 300 All these were the same for search time, with additional: search = path_to_query_file, niter = 1000, selectivity = (as required, 0.01 ~ 0.8), prefilter (as required) Also you were right about the search vectors, there was an overlap with the training set. I created a fresh query file excluding trained vectors and re-ran the utility: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.965|1.57|0.976|1.61| |0.6|166|0.959|2.07|0.981|2.00| |0.4|250|0.962|2.71|0.986|2.65| |0.2|500|0.958|4.80|0.992|4.51| |0.1|1000|0.954|8.61|0.994|7.74| |0.01|1|0.971|58.78|1.000|9.44| The recall and time seem to be in the same range as before. The high recall for selective queries (selectivity = 0.01, prefilter, recall = 1.000) may be due to performing an exact search when the nodes visited limit is reached > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539583#comment-17539583 ] Michael Sokolov commented on LUCENE-10559: -- I think it makes sense to use a fixed bit set so that we can test HNSW performance with filtering independently from the cost of the filter Query. I think your test seems to be demonstrating that for similar latencies (~cost) we can achieve significantly higher recall with pre-filtering? I wonder if we could also demonstrate the converse -- what effective topK is required when post-filtering to drive recall to be the same as pre-filtering? Also, these recall numbers seem curiously high, higher than we usually see. Could you publish the graph construction and HNSW search time parameters you used? I'm also wondering whether perhaps you tested with vectors from the training set? > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539350#comment-17539350 ] Kaival Parikh commented on LUCENE-10559: We can maintain a *FixedBitSet* to record passing docs, and randomly *set* bits based on a *selectivity* argument, so over a large set of docs we would have a close proportion of passing docs We can use this bitset both while finding brute force NN in *computeNN* and HNSW kNN in {*}testSearch{*}. To incorporate pre-filtering, we can take an additional argument *prefilter* which wraps our bitset into a query, and passes it to *doKnnVectorQuery* to give pre-filtered results For post-filtering, we can request *topK/selectivity* number of results, to get close to *topK* after filtering out non passing docs I ran some tests with these changes (topK = 100, numDocs = 1M (wikimedia dataset), numIter = 1K): ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.974|1.67|0.987|1.71| |0.6|166|0.965|2.06|0.989|2.09| |0.4|250|0.967|2.85|0.991|2.83| |0.2|500|0.941|5.06|0.996|4.85| |0.1|1000|0.972|8.81|0.997|7.99| |0.01|1|0.953|59.37|1.000|9.76| Note: I have directly updated the bitset in *BitSetCollector* to work around the arbitrary time it takes to *collect* results by iterating over the entire bitset (as it would be different for different types of queries, and {_}has to be considered separately{_}) The changes are available [here|https://github.com/kaivalnp/lucene/commit/c985fad694dfff4e9341649228b3a08d64c98a7f] in a separate branch. Looking for suggestions/comments on this approach > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531972#comment-17531972 ] Julie Tibshirani commented on LUCENE-10559: --- Big +1 for this addition to KnnGraphTester. I modified KnnGraphTester on a branch to incorporate a random filter when I ran the experiments here: [https://github.com/apache/lucene/pull/656#issuecomment-1032109021]. It's important for everyone to be able to reproduce those experiments, and it'd be good to add kNN with filtering to luceneutil as well. As an aside, I can imagine scenarios where filters are correlated or anti-correlated with proximity to the query vector. For example, maybe you're looking for a similar-looking product (vector proximity), but in a certain price range (filter). Or you're looking for similar news headlines (vector proximity), but within a certain time range (filter). > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org