date:20220810

[jira] [Updated] (LUCENE-10577) Enable quantization of HNSW vectors to 8 bits

2022-08-10 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov updated LUCENE-10577:
-
Summary: Enable quantization of HNSW vectors to 8 bits  (was: Quantize 
vector values)

> Enable quantization of HNSW vectors to 8 bits
> -
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #1057: LUCENE-10670: Add a codec class to track merge time of each index part

2022-08-10 Thread GitBox



rmuir commented on PR #1057:
URL: https://github.com/apache/lucene/pull/1057#issuecomment-1211278416

   this will actually slow down the merge heavily, by preventing things like 
optimized bulk merges of stored fields.
   
   I really don't think we should be doing this with a codec-wrapper. you can 
get this data alrady from InfoStream!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-08-10 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578150#comment-17578150
 ] 

ASF subversion and git services commented on LUCENE-10577:
--

Commit a693fe819b04f07942bb1bcbc28169838f1becfc in lucene's branch 
refs/heads/main from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a693fe819b0 ]

LUCENE-10577: enable quantization of HNSW vectors to 8 bits (#1054)

* LUCENE-10577: enable supplying, storing, and comparing HNSW vectors with 8 
bit precision

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov merged pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-10 Thread GitBox



msokolov merged PR #1054:
URL: https://github.com/apache/lucene/pull/1054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10678) computing the partition point on a BKD tree merge can overflow

2022-08-10 Thread Ignacio Vera (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578098#comment-17578098
 ] 

Ignacio Vera commented on LUCENE-10678:
---

The error from this bug looks like:
{code:java}
: partitionPoint must be >= from
    at __randomizedtesting.SeedInfo.seed([FD9FF7A242381652:E43C04A83ACC9B76]:0)
    at 
org.apache.lucene.util.bkd.BKDRadixSelector.checkArgs(BKDRadixSelector.java:140)
    at 
org.apache.lucene.util.bkd.BKDRadixSelector.select(BKDRadixSelector.java:107)
    at org.apache.lucene.util.bkd.BKDWriter.build(BKDWriter.java:2033)
    at org.apache.lucene.util.bkd.BKDWriter.finish(BKDWriter.java:974)
 {code}
 

> computing the partition point on a BKD tree merge can overflow
> --
>
> Key: LUCENE-10678
> URL: https://issues.apache.org/jira/browse/LUCENE-10678
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I just discover a bad bug in the BKD tree when doing merges. Before calling 
> the BKDTreeRadix selector we need to compute the partition point which is 
> dome multiplying two integers. If the partition point is > Integer.MAX_VALUE 
> then it will overflow.
> https://github.com/apache/lucene/blob/35ca2d79f73c6dfaf5e648fe241f7e0b37084a90/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java#L2021
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase opened a new pull request, #1065: LUCENE-10678: Fix possible overflow when computing the partition point on the BKD tree

2022-08-10 Thread GitBox



iverase opened a new pull request, #1065:
URL: https://github.com/apache/lucene/pull/1065

   We currently compute the partition point for a set of points by multiplying 
the number of nodes that needs to be on the left of the BKD tree by the 
maxPointsInLeafNode. This multiplication is done on the integer space so if the 
partition point is bigger than Integer.MAX_VALUE it will overflow. 
   
   This may happen for high dimension cases (numDims > 1) and when documents 
are multivalued. 
   
   This PR moves the multiplication to the long space so it doesn't overflow.
   
   In order to test it I modify the test `Test2BBKDPoints` to index 4Billion 
points instead and test is renamed accordingly. That should be fine, this test 
was developed before we improve the efficiency of the tree so CI should be ok 
running it in Monster runs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10678) computing the partition point on a BKD tree merge can overflow

2022-08-10 Thread Ignacio Vera (Jira)

Ignacio Vera created LUCENE-10678:
-

 Summary: computing the partition point on a BKD tree merge can 
overflow
 Key: LUCENE-10678
 URL: https://issues.apache.org/jira/browse/LUCENE-10678
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ignacio Vera


I just discover a bad bug in the BKD tree when doing merges. Before calling the 
BKDTreeRadix selector we need to compute the partition point which is dome 
multiplying two integers. If the partition point is > Integer.MAX_VALUE then it 
will overflow.

https://github.com/apache/lucene/blob/35ca2d79f73c6dfaf5e648fe241f7e0b37084a90/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java#L2021

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-08-10 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577970#comment-17577970
 ] 

Michael Sokolov commented on LUCENE-10471:
--

> Maybe I do not understand the code base of Lucene well enough, but wouldn't 
> it be possible to have a default limit of 1024 or 2028 and that one can set a 
> different limit programmable on the IndexWriter/Reader/Searcher?

I think the idea is to protect ourselves from accidental booboos; this could 
eventually get exposed in some shared configuration file, and then if somebody 
passes MAX_INT it could lead to allocating huge buffers somewhere and taking 
down a service shared by many people/groups? Hypothetical, but it's basically 
following the principle that we should be strict to help stop people shooting 
themselves and others in the feet. We may also want to preserve our ability to 
introduce optimizations that rely on some limits to the size, which would 
become difficult if usage of larger sizes became entrenched. (We can't so 
easily take it back once it's out there). Having said that I still feel a 16K 
limit, while allowing for models that are beyond reasonable, wouldn't cause any 
of these sort of issues, so that's the number I'm advocating.

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-10 Thread David Turner (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577853#comment-17577853
 ] 

David Turner edited comment on LUCENE-10677 at 8/10/22 9:42 AM:


> I'm opposed to the use of string.intern by the lucene library here. It is 
> inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using 
here. And yet it does seem awfully wasteful to burn so much heap on these 
things. "Buy more RAM" is not a great answer (implicitly this means "... or go 
and find a cheaper alternative elsewhere" and folks are indeed willing to do 
that). The next scaling limit in this dimension appears to be quite far off 
which is why we think this is worth addressing. (edit to add: these strings 
appear to roughly double the heap needed for each `SegmentReader` object)

Are there any other approaches you'd suggest? It looks like we might be able to 
intercept the relevant calls to `DataInput#readString` ourselves, although 
adding support for compound segments introduces an enormous amount of extra 
complexity to that approach. Would it work to introduce some simpler way for an 
application to hook in some kind of string deduplication mechanism even if it 
goes unused in pure Lucene by default?


was (Author: david turner):
> I'm opposed to the use of string.intern by the lucene library here. It is 
> inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using 
here. And yet it does seem awfully wasteful to burn so much heap on these 
things. "Buy more RAM" is not a great answer (implicitly this means "... or go 
and find a cheaper alternative elsewhere" and folks are indeed willing to do 
that). The next scaling limit in this dimension appears to be quite far off 
which is why we think this is worth addressing.

Are there any other approaches you'd suggest? It looks like we might be able to 
intercept the relevant calls to `DataInput#readString` ourselves, although 
adding support for compound segments introduces an enormous amount of extra 
complexity to that approach. Would it work to introduce some simpler way for an 
application to hook in some kind of string deduplication mechanism even if it 
goes unused in pure Lucene by default?

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-10 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577859#comment-17577859
 ] 

Dawid Weiss commented on LUCENE-10677:
--

> It looks like we might be able to intercept the relevant calls to 
> `DataInput#readString` ourselves, although adding support for compound 
> segments introduces an enormous amount of extra complexity to that approach. 

With the right tools it shouldn't be a problem. A hot-mode aspectj aspect that 
would deduplicate those strings selectively, where it matters, comes to mind.

This said, perhaps there are cleaner solutions to solve this elegantly. Feel 
free to propose a patch (but no String.intern, please...).

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

2022-08-10 Thread David Turner (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577853#comment-17577853
 ] 

David Turner commented on LUCENE-10677:
---

> I'm opposed to the use of string.intern by the lucene library here. It is 
> inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using 
here. And yet it does seem awfully wasteful to burn so much heap on these 
things. "Buy more RAM" is not a great answer (implicitly this means "... or go 
and find a cheaper alternative elsewhere" and folks are indeed willing to do 
that). The next scaling limit in this dimension appears to be quite far off 
which is why we think this is worth addressing.

Are there any other approaches you'd suggest? It looks like we might be able to 
intercept the relevant calls to `DataInput#readString` ourselves, although 
adding support for compound segments introduces an enormous amount of extra 
complexity to that approach. Would it work to introduce some simpler way for an 
application to hook in some kind of string deduplication mechanism even if it 
goes unused in pure Lucene by default?

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
>Reporter: Armin Braun
>Priority: Minor
>  Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-08-10 Thread Marcus Eagan (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577849#comment-17577849
 ] 

Marcus Eagan commented on LUCENE-10471:
---

[~michi] You are free to increase the dimension limit as it is a static 
variable and Lucene is your oyster. However, [~ehatcher] has Seared in my mind 
that this a long term fork ok Lucene is a bad idea for many reasons.

#[~rcmuir] I agree with you on "whatever shitty models." They are here, and 
more are coming. With respect to the vector API, Oracle is doing an interesting 
bit of work in Open JDK 17 to improve their vector API. They've added support 
for Intel's short vector math library, which will improve. The folk at OpenJDK 
exploit the Panama APIs. There are several hardware accelerations they are yet 
to exploit, and many operations will fall back to scalar code. 

My argument is for increasing the limit of dimensions is not to suggest that 
there is a better fulcrum in the performance tradeoff balancer, but that more 
users testing Lucene is good for improving the feature.

Open AI's Da Vinci is one such model but not the only

I've had customers ask for 4096 based on the performance they observe with 
question an answering. I'm waiting on the model and will  share when I know. If 
customers want to introduce rampant numerical errors in their systems, there is 
little we can do for them. Don't take my word on any of this yet. I need to 
bring data and complete evidence. I'm asking my customers why they cannot do 
dimensional reduction.

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-08-10 Thread Michael Wechner (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577811#comment-17577811
 ] 

Michael Wechner commented on LUCENE-10471:
--

Maybe I do not understand the code base of Lucene well enough, but wouldn't it 
be possible to have a default limit of 1024 or 2028 and that one can set a 
different limit programmable on the IndexWriter/Reader/Searcher?

> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10577) Enable quantization of HNSW vectors to 8 bits

[GitHub] [lucene] rmuir commented on pull request #1057: LUCENE-10670: Add a codec class to track merge time of each index part

[jira] [Commented] (LUCENE-10577) Quantize vector values

[GitHub] [lucene] msokolov merged pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

[jira] [Commented] (LUCENE-10678) computing the partition point on a BKD tree merge can overflow

[GitHub] [lucene] iverase opened a new pull request, #1065: LUCENE-10678: Fix possible overflow when computing the partition point on the BKD tree

[jira] [Created] (LUCENE-10678) computing the partition point on a BKD tree merge can overflow

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

[jira] [Comment Edited] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

13 matches

Site Navigation

Mail list logo

Footer information