subject:"\[jira\] \[Commented\] \(LUCENE\-9322\) Discussing a unified vectors format API"

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296873#comment-17296873
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 606cea94d76ffeb978fb23c32dd4baf848a36baf in lucene-solr's branch 
refs/heads/master from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=606cea9 ]

LUCENE-9322: trivial fix in documentation.


> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-02-16 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285214#comment-17285214
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 4cdfbbb95be1b25adb839ee8d1fe61052a53a4a3 in lucene-solr's branch 
refs/heads/master from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4cdfbbb ]

LUCENE-9322: Lucene90VectorReader can leak open files (#2371)



> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-02-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284738#comment-17284738
 ] 

Michael Sokolov commented on LUCENE-9322:
-

I'm not actively working on it, just recorded here for reference. If you're 
able to fix that would be great!

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-02-15 Thread Ignacio Vera (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284711#comment-17284711
 ] 

Ignacio Vera commented on LUCENE-9322:
--

Oh yes, Let me know if you are working on the fix or want me to do it.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-02-15 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284707#comment-17284707
 ] 

Michael Sokolov commented on LUCENE-9322:
-

Looks like the new random testing kicked out another bug related to properly 
closing on exception: 
[https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/29452/testReport/junit/org.apache.lucene.codecs.lucene90/TestLucene90VectorFormat/testRandomExceptions/,]
 this time in Lucene90VectorReader

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-02-11 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283110#comment-17283110
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 683a9bd78abcf486a668881bc3294847ce5d5d1a in lucene-solr's branch 
refs/heads/master from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=683a9bd ]

LUCENE-9322:  Add Vectors format to CodecReader accounting methods (#2353)



> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2021-01-26 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272303#comment-17272303
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 188728047511cc32e313a96bcc77dd88adeba287 in lucene-solr's branch 
refs/heads/master from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1887280 ]

LUCENE-9322: Move old field infos format to backwards-codecs. (#2245)

We introduced a new `Lucene90FieldInfosFormat`, so the old
`Lucene60FieldInfosFormat` should live in backwards-codecs.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-11-10 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229058#comment-17229058
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 514c363f1d82b801234b16ef16804f08da86dc7a in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=514c363 ]

LUCENE-9322: Move Solr to Lucene90Codec.

And drop configurability of Lucene87Codec since it shouldn't be used for 
writing anymore.


> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-11-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228951#comment-17228951
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 42c5206cea5c85d486813d42f7d52e44a5a695ba in lucene-solr's branch 
refs/heads/master from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=42c5206 ]

LUCENE-9322: Some fixes to SimpleTextVectorFormat. (#2071)

* Make sure the file extensions are unique.

* Fix bug in vector reading.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-11-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228945#comment-17228945
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 42c5206cea5c85d486813d42f7d52e44a5a695ba in lucene-solr's branch 
refs/heads/master from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=42c5206 ]

LUCENE-9322: Some fixes to SimpleTextVectorFormat. (#2071)

* Make sure the file extensions are unique.

* Fix bug in vector reading.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-11-09 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228742#comment-17228742
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit ec9a659845973a0dd0ee7c04e0075db818ed118d in lucene-solr's branch 
refs/heads/master from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ec9a659 ]

LUCENE-9322: fix minor cosmetic refactoring error in logging string in 
IndexWriter's infoStream logging. It was always printing 'vector values' for 
all merging times instead of the other parts of Lucene index ('doc values', 
'stored fields', etc.)


> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-10-26 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220639#comment-17220639
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit 37c7d156ab54ce9baae08bebb76eebe4da2e5b81 in lucene-solr's branch 
refs/heads/master from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=37c7d15 ]

LUCENE-9322: Make sure to account for vectors in SortingCodecReader. (#2028)



> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-10-24 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220065#comment-17220065
 ] 

Michael Sokolov commented on LUCENE-9322:
-

Ooh, a fun failure - ah I see it's because we got an index with multiple 
segments, but we only expect one. I'll add a forceMerge(1) here and in a few 
other tests that have similar assumptions

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-10-24 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220027#comment-17220027
 ] 

Tomoko Uchida commented on LUCENE-9322:
---

I encountered a test failure in TestVectorValues that can be reproduced with 
this seed on my Linux PC (Fedora) and Java 11:
{code}
$ ./gradlew :lucene:core:test --tests 
"org.apache.lucene.index.TestVectorValues" -Ptests.seed=BA5BA0B8B98813F2
org.apache.lucene.index.TestVectorValues > testSortedIndex FAILED
java.lang.AssertionError: expected:<3> but was:<2>
at 
__randomizedtesting.SeedInfo.seed([BA5BA0B8B98813F2:CDC4C64E81716A26]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.lucene.index.TestVectorValues.testSortedIndex(TestVectorValues.java:583)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}

> Discussing a unified vectors format API
>

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-10-19 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216503#comment-17216503
 ] 

Julie Tibshirani commented on LUCENE-9322:
--

Thank @msokolov for the great PR!! I left some belated thoughts on the PR 
regarding ScoreFunction.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-10-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216237#comment-17216237
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit dbcbcd0ee8bc2dc2ba49b12b9b7d45c33baad061 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=dbcbcd0 ]

Add CHANGES entry for LUCENE-9322


> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-10-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216214#comment-17216214
 ] 

ASF subversion and git services commented on LUCENE-9322:
-

Commit c02f07f2d5db5c983c2eedf71febf9516189595d in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c02f07f ]

LUCENE-9322: Add Lucene90 codec, including VectorFormat

This commit adds support for dense floating point VectorFields.
The new VectorValues class provides access to the indexed vectors.


> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-09-29 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204091#comment-17204091
 ] 

Michael Sokolov commented on LUCENE-9322:
-

I posted a PR addressing that builds on the discussion and earlier PR's from 
[~jtibshirani] and [~tomoko] and would appreciate your review if you have time. 
Just to address some of the recent discussion here:

1. This is for dense vectors only. I think handling sparse vectors is 
potentially interesting, but would require a completely different approach, so 
I think should be done separately.
2. I would like to see if we can completely hide the ANN implementation behind 
the vector API, as Julie initially proposed, making the selection of an 
algorithm a simple parameter of VectorValues. In the soon-to-come NSW graph 
implementation I have in mind there is no new graph format, just another 
auxiliary index file inside the vector format. To that end, I included both L2 
and dot-product distances with the idea of maintaining something in the API 
that enables control over the underlying KNN implementation. EG we could have 
ScoreFunction overloaded with graph algorithm? Maybe it's too much, I'd like 
feedback on this part.


> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-07-20 Thread Alex Klibisz (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161700#comment-17161700
 ] 

Alex Klibisz commented on LUCENE-9322:
--

Very briefly, I just remembered another thing you might consider if you are 
considering storing both dense vectors and sparse vectors. There are two 
optimizations for sparse vectors at the storage level:
 # Very obvious, just store the "present/true/positive" indices instead of the 
full vector.
 # Maybe less obvious, if you store the indices in sorted order, you can 
compute intersections more efficiently, which is useful for some similarity 
functions. For example `int size_of_intersection((0,1,2,3),(2,3,4)) = 2` can be 
computed with only an int counter and no other intermediate data structures. 
Whereas, `int size_of_intersection((0,2,1,3),(2,3,4)) = 2` requires converting 
one of the arrays to a hashset, which adds up at scale. The sorted intersection 
algo is pretty obvious but here it is in case you need it: 
[https://github.com/alexklibisz/elastiknn/blob/74815f2613653e2c266bf7eb56b020943dd80b9a/core/src/main/java/com/klibisz/elastiknn/utils/ArrayUtils.java#L10-L36]

- Ak

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-07-20 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161697#comment-17161697
 ] 

Julie Tibshirani commented on LUCENE-9322:
--

Hello everyone, I'm sorry for the very late response. Thank you for your 
comments on the proposal! And thanks [~alexklibisz] for the suggestions, I will 
take a look at the link.
{quote}Personally I would prefer an unified file format for vectors since it is 
(theoretically) independent from higher level ANN algorithms. Could we expose 
just one "Lucene90VectorsFormat" and low-level I/O, and make only higher logic 
(o.a.l.a.index/document/search) to be customizable? Forward iteration is 
encouraged anyway... 
{quote}
I'm not sure how we could have a completely unified `VectorsFormat`, because 
different ANN algorithms require building and maintaining customized data 
structures like nearest-neighbor graphs? However it would be great to share the 
logic for writing/ reading the original vectors if possible.
{quote}What about different distance metrics like angular and L1 distance? JFYI 
I previously implemented switchable distance function on the HNSW branch, if 
you have not noticed it…
{quote}
I have the same intuition as Mayya that it’s nice to keep the design simple at 
first and just use euclidean distance in the first iteration. It’s possible to 
rank based on angular distance using euclidean distance by first normalizing 
the document and query vectors to unit length. However I could certainly see 
support for maximum inner product search being useful in the future. 
{quote}Query part would also need some abstraction and there are many things to 
be well thought..., so could we discuss about it in another dedicated issue, to 
keep the scope here small ?
{quote}
Right, perhaps we can focus on moving the current proposal forward before 
nailing down how it will integrate with `Query`. It will be an interesting 
follow-up discussion!
{quote}How would we feel to break this part and commit it separately ? 
{quote}
Personally I would be okay with committing basic vector support first, but with 
solid APIs/ plugin points for ANN as well. My motivation with considering both 
vectors and ANN was to make sure the APIs + codec design could accommodate all 
the functionality we think is important.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-07-16 Thread Alex Klibisz (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159599#comment-17159599
 ] 

Alex Klibisz commented on LUCENE-9322:
--

Hi all. Some great discussion here and in #9004 and #9136.

I've been working on an Elasticsearch plugin for ANN for about 8 months now: 
[http://elastiknn.klibisz.com/ |http://elastiknn.klibisz.com/]Obviously using 
Lucene under-the-hood but I'm definitely more fluent in Elasticsearch concepts 
than Lucene internals.

Figured I would mention: One of the early bottlenecks was vector serialization 
(using BinaryDocValues to store the vectors). I did extensive benchmarking to 
figure out the fastest way to de-/serialize `float[]` and `int[]` arrays 
to/from byte arrays. In the end I ended up finding the `sun.misc.Unsafe` module 
beat all others. Here's the Java utility class that I'm using for 
de-/serialization in my plugin: 
[https://github.com/alexklibisz/elastiknn/blob/adf8262907093315d772ae524e822a1152b0e929/core/src/main/java/com/klibisz/elastiknn/storage/UnsafeSerialization.java]

Maybe it can be helpful.

 

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-06-17 Thread Varun Thacker (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138707#comment-17138707
 ] 

Varun Thacker commented on LUCENE-9322:
---

JDK 

 
{code:java}
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.242-b08, mixed mode)
{code}
 

 

This is my first time trying out JMH. I took the encoding approach we used in 
VectorField vs the encoding approach taken by DenseVectorField ( in SOLR-14397 
) and compared them

 

The VectorField approach to encode is much faster than using Base64 encoding  

 

 
{code:java}
@Benchmark
public void testVectorFieldEncoding() {
float[] vector = new float[512];
for (int i=0; i<512; i++) {
vector[i] = i + i/1000f;
}

for (int i=0; i<10_000; i++) {
ByteBuffer buffer = ByteBuffer.allocate(Float.BYTES * vector.length);
buffer.asFloatBuffer().put(vector);
buffer.array();
}
}
{code}
 

JMH output

 
{code:java}
Result: 123.116 ±(99.9%) 2.671 ops/s [Average]
  Statistics: (min, avg, max) = (95.557, 123.116, 143.097), stdev = 11.310
  Confidence interval (99.9%): [120.445, 125.787]




# Run complete. Total time: 00:08:07


Benchmark                      Mode  Samples    Score  Score error  Units
o.e.MyBenchmark.testVectorFieldEncoding    thrpt      200  123.116        2.671 
 ops/s
{code}
 

 

 
{code:java}
@Benchmark
public void testBase64Encoding() {
float[] vector = new float[512];
for (int i=0; i<512; i++) {
vector[i] = i + i/1000f;
}

for (int i=0; i<10_000; i++) {
ByteBuffer buffer = ByteBuffer.allocate(Float.BYTES * vector.length);
for (float value : vector) {
buffer.putFloat(value);
}
buffer.rewind();

java.util.Base64.getEncoder().encode(buffer).array();
}
}
{code}
 

JMH output
{code:java}
Result: 35.069 ±(99.9%) 0.745 ops/s [Average]
  Statistics: (min, avg, max) = (25.792, 35.069, 41.335), stdev = 3.154
  Confidence interval (99.9%): [34.324, 35.814]




# Run complete. Total time: 00:08:06


Benchmark                      Mode  Samples   Score  Score error  Units
o.e.MyBenchmark.testBase64Encoding    thrpt      200  35.069        0.745  ops/s
{code}
 

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-06-16 Thread Varun Thacker (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137909#comment-17137909
 ] 

Varun Thacker commented on LUCENE-9322:
---

I've taken only VectorField parts from your PR in 
[https://github.com/apache/lucene-solr/pull/1584] . This was mostly me trying 
to to see if it makes sense to break out the work.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-06-16 Thread Varun Thacker (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136805#comment-17136805
 ] 

Varun Thacker commented on LUCENE-9322:
---

Hello [~jtibshirani] ! Thanks for tackling this

 

> Support for storing and retrieving individual float vectors.

How would we feel to break this part and commit it separately ? I believe this 
is adding the VectorField field part ? The PR on SOLR-14397 also added a 
DenseVectorField ( Solr field ) so maybe we could reuse VectorField ( although 
there is some nuance since DenseVectorField currently supports string and 
vector encoding and a code comment saying bfloat16 as well )

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-04-24 Thread Mayya Sharipova (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091684#comment-17091684
 ] 

Mayya Sharipova commented on LUCENE-9322:
-

> It is implemented by enum with {{distance()}} function. Also, I think it 
>would be good to persist (in the codec) which distance metric we use for the 
>field.

 

May be for now, it is worth to keep the API simple and use euclidean distance.  
Both ann approaches we would like to pursue: HNSW and Clustering based approach 
use euclidean distance.

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-04-15 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084204#comment-17084204
 ] 

Tomoko Uchida commented on LUCENE-9322:
---

Hi [~jtibshirani],

thank you for hard working on this!

 
{code:java}
TopDocs findNearestVectors(float[] queryVector, int k, int recallFactor) throws 
new IOException;
{code}
I like this interface, {{recallFactor}} might be an interface for further 
flexibility, but it's just an idea. 

 
{quote}Why do we have different implementations of `VectorsFormat`, couldn’t we 
just add an enum to the field info like `Strategy.HNSW` and 
`Strategy.COARSE_QUANTIZATION`?
{quote}
Personally I would prefer an unified file format for vectors since it is 
(theoretically) independent from higher level ANN algorithms. Could we expose 
just one "Lucene90VectorsFormat" and low-level I/O, and make only higher logic 
(o.a.l.a.index/document/search) to be customizable? Forward iteration is 
encouraged anyway...
  
{quote}What about different distance metrics like angular and L1 distance?
{quote}
JFYI I previously implemented switchable distance function on the HNSW branch, 
if you have not noticed it: 
[https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/index/VectorValues.java].
 It is implemented by enum with {{distance()}} function. Also, I think it would 
be good to persist (in the codec) which distance metric we use for the field.
{quote}How exactly is this used in a search? Where are the `Query` classes? 
This would be the next part of the API to design/ discuss.
{quote}
We could refer/follow o.a.l.a.index.PointValues's approach, in other words, 
concrete field classes with newXXXQuery() methods? 
[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/PointValues.java]
 Query part would also need some abstraction and there are many things to be 
well thought..., so could we discuss about it in another dedicated issue, to 
keep the scope here small ?

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

2020-04-14 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083332#comment-17083332
 ] 

Julie Tibshirani commented on LUCENE-9322:
--

I pushed a proposal for the API here: 
https://github.com/jtibshirani/lucene-solr/pull/2

It adds a new format `VectorsFormat`, which can read and write vectors. This 
format and all associated classes would be experimental. Each approach would 
have its own format implementation, for example `HNSWVectorsFormat` or 
`CoarseQuantizationFormat`.
Given a field, the vectors reader returns a `VectorValues` object. This object 
supports the following:
* Retrieving a vector value for each document. This capability is currently 
exposed through a `DocIdSetIterator`, so forward iteration is encouraged. The 
simple coarse quantization approach could choose to store the vectors as binary 
doc values, while HNSW could use a new storage format with more explicit 
support for random access (which it needs to efficiently support ANN). Perhaps 
in the future we’ll just choose on a single way of storing the vectors that all 
implementations can use.
* Finding `k` approximate nearest neighbors to a query vector through the 
`findNearestVectors` method.

{code:java}
 /**
   * For the given query vector, finds an approximate set of nearest neighbors.
   *
   * @param queryVector the query vector.
   * @param k the number of nearest neighbors to return.
   * @param recallFactor a parameter which controls the recall of the search. 
Higher values correspond to better
   *                     recall at the expense of more distance computations. 
The exact meaning of this parameter
   *                     depends on the underlying nearest neighbor 
implementation.
   */
TopDocs findNearestVectors(float[] queryVector, int k, int recallFactor) throws 
new IOException;
{code}

Each format can use its dedicated data structures to perform ANN, without 
exposing the details externally. Adding this method to `VectorValues` lets us 
avoid having a different query type per ANN strategy, like `HNSWQuery`. Note 
that it’s a bit tricky to define a unified method, because each algorithm has 
different search parameters -- HNSW has `ef` to control the size of the 
candidate set, whereas coarse quantization has `numCentroids` to specify the 
number of nearest clusters that should be considered. This proposal takes a 
simple strategy: the implementation is allowed one tuning parameter to control 
recall, and the meaning of the parameter depends on the implementation.

On the write side, we would need to add the ability to buffer + write vectors 
in `DefaultIndexingChain`. This logic would be shared, but the flush and merge 
calls would be delegated to the format. So each implementation could build + 
write the specialized data structures it needs, and define its own way of 
performing merges.

Questions that may come up:
* _Why do we have different implementations of `VectorsFormat`, couldn’t we 
just add an enum to the field info like `Strategy.HNSW` and 
`Strategy.COARSE_QUANTIZATION`?_ It seemed cleaner to keep each strategy’s 
writing/ reading logic separate. The current design also makes it possible to 
plug-in a custom implementation of `VectorsFormat`.
* _What about different distance metrics like angular and L1 distance?_ This is 
an important aspect to consider. I think that even if we just supported 
euclidean distance at first, it would be really useful. Along with angular 
distance, it’s the distance metric I’ve seen used most frequently in 
applications. And euclidean distance is equal to angular distance if the 
vectors are normalized to unit length.
* _How exactly is this used in a search? Where are the `Query` classes?_ This 
would be the next part of the API to design/ discuss. I think the current 
format proposal could support a few different options.

Thanks, and looking forward to your thoughts!

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

27 matches

Site Navigation

Mail list logo

Footer information