[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-07-29 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573156#comment-17573156
 ] 

ASF subversion and git services commented on LUCENE-10559:
--

Commit 33d5ab96f266447b228833baee3489b12bdc3a68 in lucene's branch 
refs/heads/branch_9x from Kaival Parikh
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=33d5ab96f26 ]

LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)

Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-07-29 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573103#comment-17573103
 ] 

ASF subversion and git services commented on LUCENE-10559:
--

Commit 1ad28a3136fceb248ef55f2a09e77e7797bef51e in lucene's branch 
refs/heads/main from Kaival Parikh
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1ad28a3136f ]

LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)

Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-05-23 Thread Kaival Parikh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541282#comment-17541282
 ] 

Kaival Parikh commented on LUCENE-10559:


The graph construction parameters were:

docs = path_to_vec_file, ndoc = 100, dim = 100, fanout = 0, maxConn = 150, 
beamWidthIndex = 300

All these were the same for search time, with additional:

search = path_to_query_file, niter = 1000, selectivity = (as required, 0.01 ~ 
0.8), prefilter (as required)

 

Also you were right about the search vectors, there was an overlap with the 
training set. I created a fresh query file excluding trained vectors and re-ran 
the utility:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.965|1.57|0.976|1.61|
|0.6|166|0.959|2.07|0.981|2.00|
|0.4|250|0.962|2.71|0.986|2.65|
|0.2|500|0.958|4.80|0.992|4.51|
|0.1|1000|0.954|8.61|0.994|7.74|
|0.01|1|0.971|58.78|1.000|9.44|

The recall and time seem to be in the same range as before. The high recall for 
selective queries (selectivity = 0.01, prefilter, recall = 1.000) may be due to 
performing an exact search when the nodes visited limit is reached

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-05-19 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539583#comment-17539583
 ] 

Michael Sokolov commented on LUCENE-10559:
--

I think it makes sense to use a fixed bit set so that we can test HNSW 
performance with filtering independently from the cost of the filter Query. I 
think your test seems to be demonstrating that for similar latencies (~cost) we 
can achieve significantly higher recall with pre-filtering?  I wonder if we 
could also demonstrate the converse -- what effective topK is required when 
post-filtering to drive recall to be the same as pre-filtering?

Also, these recall numbers seem curiously high, higher than we usually see. 
Could you publish the graph construction and HNSW search time parameters you 
used? I'm also wondering whether perhaps you tested with vectors from the 
training set?

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-05-19 Thread Kaival Parikh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539350#comment-17539350
 ] 

Kaival Parikh commented on LUCENE-10559:


We can maintain a *FixedBitSet* to record passing docs, and randomly *set* bits 
based on a *selectivity* argument, so over a large set of docs we would have a 
close proportion of passing docs

We can use this bitset both while finding brute force NN in *computeNN* and 
HNSW kNN in {*}testSearch{*}.

To incorporate pre-filtering, we can take an additional argument *prefilter* 
which wraps our bitset into a query, and passes it to *doKnnVectorQuery* to 
give pre-filtered results

For post-filtering, we can request *topK/selectivity* number of results, to get 
close to *topK* after filtering out non passing docs

I ran some tests with these changes (topK = 100, numDocs = 1M (wikimedia 
dataset), numIter = 1K):
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.974|1.67|0.987|1.71|
|0.6|166|0.965|2.06|0.989|2.09|
|0.4|250|0.967|2.85|0.991|2.83|
|0.2|500|0.941|5.06|0.996|4.85|
|0.1|1000|0.972|8.81|0.997|7.99|
|0.01|1|0.953|59.37|1.000|9.76|

Note: I have directly updated the bitset in *BitSetCollector* to work around 
the arbitrary time it takes to *collect* results by iterating over the entire 
bitset (as it would be different for different types of queries, and {_}has to 
be considered separately{_})

The changes are available 
[here|https://github.com/kaivalnp/lucene/commit/c985fad694dfff4e9341649228b3a08d64c98a7f]
 in a separate branch. Looking for suggestions/comments on this approach

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester

2022-05-04 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531972#comment-17531972
 ] 

Julie Tibshirani commented on LUCENE-10559:
---

Big +1 for this addition to KnnGraphTester. I modified KnnGraphTester on a 
branch to incorporate a random filter when I ran the experiments here: 
[https://github.com/apache/lucene/pull/656#issuecomment-1032109021]. It's 
important for everyone to be able to reproduce those experiments, and it'd be 
good to add kNN with filtering to luceneutil as well.

As an aside, I can imagine scenarios where filters are correlated or 
anti-correlated with proximity to the query vector. For example, maybe you're 
looking for a similar-looking product (vector proximity), but in a certain 
price range (filter). Or you're looking for similar news headlines (vector 
proximity), but within a certain time range (filter).

> Add preFilter/postFilter options to KnnGraphTester
> --
>
> Key: LUCENE-10559
> URL: https://issues.apache.org/jira/browse/LUCENE-10559
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org