Closing the poll after one week, these are the results:
Option 2-4: 9 votes
make the limit configurable, potentially moving the limit to the
appropriate place
Option 3: 5 votes
keep it as it is (1024) but move it lower level in HNSW-specific
implementation
Option 1: 0 votes
keep it as it is
I vote for option 3.
Then with a follow up work to have a simple extension codec in the "codecs"
package which is
1- not backward compatible, and 2- has a higher or configurable limit. That
way users can directly use this codec without any additional code.
Thanks to everyone involved so far!
I confirm that a proper subject should have been [POLL] rather than [VOTE],
apologies for the confusion.
We are in the middle of the poll and this is the summary so far (ordered by
preference):
Option 2-4: 9 votes
make the limit configurable, potentially
Difficult to keep up with this topic when it's spread across issues, PRs,
and email lists. My poll response is option 3. -1 to option 2, I think the
configuration should be moved to the HNSW specific implementation. At this
point of technical maturity, it doesn't make sense (to me) to have the
Am 18.05.23 um 12:22 schrieb Michael McCandless:
I love all the energy and passion going into debating all the ways to
poke at this limit, but please let's also spend some of this passion
on actually improving the scalability of our aKNN implementation!
E.g. Robert opened an exciting
It is basically the code which Michael Sokolov posted at
https://markmail.org/message/kf4nzoqyhwacb7ri
except
- that I have replaced KnnVectorField by KnnFloatVectorField, because
KnnVectorField is deprecated.
- that I don't hard code the dimension as 2048 and the metric as
EUCLIDEAN, but
This isn't really a VOTE (no specific code change is being proposed), but
rather a poll?
Anyway, I would prefer Option 3: put the limit check into the HNSW
algorithm itself. This is the right place for the limit check, since HNSW
has its own scaling behaviour. It might have other limits, like
That's great and a good plan B, but let's try to focus this thread of
collecting votes for a week (let's keep discussions on the nice PR opened
by David or the discussion thread we have in the mailing list already :)
On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
wrote:
> That sounds
That sounds promising, Michael. Can you share scripts/steps/code to
reproduce this?
On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
wrote:
> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
> which is using 1536 dimensions and it works very fine :-)
>
> Thanks
>
>
I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and it works very
fine :-)
Thanks
Michael
Am 18.05.23 um 00:29 schrieb Michael Wechner:
IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using
IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using float as vector values, right?
Am 17.05.23 um 16:41 schrieb Michael Sokolov:
see https://markmail.org/message/kf4nzoqyhwacb7ri
On Wed, May 17, 2023 at 10:09 AM David Smiley wrote:
> easily be
Thanks Michael for sharing your code snippet on how to circumvent the
limit. My reaction to this is the same as Alessandro.
I just created a PR to make the limit configurable:
https://github.com/apache/lucene/pull/12306
If there is to be a veto presented to the PR, it should include technical
I try to better understand the code, so IIUC vector MAX_DIMENSIONS is
currently used inside
lucene/core/src/java/org/apache/lucene/document/FieldType.java
lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java
Alessandro,
Thanks for raising the code of conduct; it is very discouraging and
intimidating to participate in discussions where such language is used
especially by senior members.
Michael S.,
thanks for your suggestion and that's what we used in Elasticsearch to
raise dims limit, and Alessandro,
Thanks, Michael,
that example backs even more strongly the need of cleaning it up and making
the limit configurable without the need for custom field types I guess (I
was taking a look at the code again, and it seems the limit is also checked
twice: in
see https://markmail.org/message/kf4nzoqyhwacb7ri
On Wed, May 17, 2023 at 10:09 AM David Smiley wrote:
> > easily be circumvented by a user
>
> This is a revelation to me and others, if true. Michael, please then
> point to a test or code snippet that shows the Lucene user community what
>
> easily be circumvented by a user
This is a revelation to me and others, if true. Michael, please then point
to a test or code snippet that shows the Lucene user community what they
want to see so they are unblocked from their explorations of vector search.
~ David Smiley
Apache Lucene/Solr
I think I've said before on this list we don't actually enforce the limit
in any way that can't easily be circumvented by a user. The codec already
supports any size vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int sized vectors
As a reminder this isn't the Disney Plus channel and I'll use strong
language if I fucking want to.
On Wed, May 17, 2023, 4:45 AM Alessandro Benedetti
wrote:
> Robert,
> A gentle reminder of the
> https://www.apache.org/foundation/policies/conduct.html.
> I've read many e-mails about this
Robert,
A gentle reminder of the
https://www.apache.org/foundation/policies/conduct.html.
I've read many e-mails about this topic that ended up in a tone that is not
up to the standard of a healthy community.
To be specific and pragmatic how you addressed Gus here, how you addressed
the rest of
We agree backwards compatibility with the index should be maintained and
that checkIndex should work. And we agree on a number of other things, but
I want to focus on configurability.
As long as the index contains the number of dimensions actually used in a
specific segment & field, why couldn't
Hi Robert,
If you read the issue I opened more carefully you'll see I had all the
service loading stuff sorted just fine. It's the silent eating of the
security exceptions by URLClassPath that I think is a useful thing to point
out. If anything, that ticket is more about being surprised by
My problem is that it impacts the default codec which is supported by our
backwards compatibility policy for many years. We can't just let the user
determine backwards compatibility with a sysprop. how will checkindex work?
We have to have bounds and also allow for more performant implementations
Robert, I have not heard from you (or anyone) an argument against System
property based configurability (as I described in Option 4 via a System
property). Uwe notes wisely some care must be taken to ensure it actually
works. Sure, of course. What concerns do you have with this?
~ David Smiley
by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
hsnw-specific code.
This way, someone can write alternative codec with vectors using some other
completely different approach that incorporates a different more
appropriate limit (maybe lower, maybe higher) depending upon their
Gus, I think i explained myself multiple times on issues and in this
thread. the performance is unacceptable, everyone knows it, but nobody is
talking about.
I don't need to explain myself time and time again here.
You don't seem to understand the technical issues (at least you sure as
fuck don't
Hi all,
Great to have this discussion!
My votes are for 2 and 4!
Best,
Pandu
On 2023/05/16 08:50:24 Alessandro Benedetti wrote:
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
Hi all,
Great to have this discussion!
My votes are for 2 and 4!
Best,
Pandu
On 2023/05/16 08:50:24 Alessandro Benedetti wrote:
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
My non-binding vote:
Option 2 = Option 4 > Option 1 > Option 3
Explanation: Lucene's somewhat arbitrary limit of 1024 does not currently
affect the raw, low-level HNSW, which is what I am plugging into
Cassandra. The only option that would break this code is option 3.
P.S. I mentioned this in
Even if the options can be basically summarised in two groups: make it
configurable VS not making it configurable and leave it be, when I
collected the options from people I ended up with these four and I didn't
want to collapse any of them (potentially making the proposer feel
diminished).
Actually, I had wondered if this is a proper vote thread or not, normally
those are yes/no on a single option.
On Tue, May 16, 2023 at 10:47 AM Alessandro Benedetti
wrote:
> Hi Marcus,
> I am afraid at this stage Robert's opinion counts just as any other
> opinion, a single vote for option 1.
>
Hi Marcus,
I am afraid at this stage Robert's opinion counts just as any other
opinion, a single vote for option 1.
We are collecting a community's feedback here, we are not changing any code
nor voting for a yes/no.
Once the voting is finished, we'll operate an action depending on the
community's
Given that Robert has put in his veto, aren’t we clear on what we need to
do for him to change his mind? He’s been pretty clear and the rules of veto
are cut and dry.
Most of the people that have contributed to kNN vectors recently are not
even on the thread. I think improving the feature should
+1 on the combination of #3 and #4.
Also good things to make sure of Uwe, thanks for calling those out.
(Especially about the limit only being used on write, not on read).
- Houston
On Tue, May 16, 2023 at 9:57 AM Uwe Schindler wrote:
> I agree with Dawid,
>
> I am +1 for those two options in
I agree with Dawid,
I am +1 for those two options in combination:
* option 3 (make limit an HNSW specific thing). New formats may use
other limits (lower or higher).
* option 4 (make a system property with HNSW prefix). Adding the
system property must be done in same way like new
I'm for option 3 (limit at algorithm level), with the default there tunable
via property (option 4).
I understand Robert's concerns and I'd love to contribute a faster
implementation but the reality is - I can't do it at the moment. I feel
like experiments are good though and we shouldn't just
My vote is for option 3. Prevents Lucene from having the limit increased.
Allows others who implement a different codec to set a limit of their
choosing.
Though I don't know the historical reasons for putting specific
configuration items at the codec level. This limit is performance related
and
+1 to Gus' reply.
I think that Robert's veto or anyone else's veto is fair enough, but I
also think that anyone who is vetoing should be very clear about the
objectives / goals to be achieved, in order to get a +1.
If no clear objectives / goals can be defined and agreed on, then the
whole
Robert,
Can you explain in clear technical terms the standard that must be met for
performance? A benchmark that must run in X time on Y hardware for example
(and why that test is suitable)? Or some other reproducible criteria? So
far I've heard you give an *opinion* that it's unusable, but
my non-binding vote goes to Option 2 resp. Option 4
Thanks
Michael Wechner
Am 16.05.23 um 10:51 schrieb Alessandro Benedetti:
My vote goes to *Option 4*.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail:
i still feel -1 (veto) on increasing this limit. sending more emails does
not change the technical facts or make the veto go away.
On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
wrote:
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for
For simplicity's sake, let's consider Option 2 and 4 as equivalent as they
are not mutually exclusive and just differ on a minor implementation
detail.
On Tue, 16 May 2023, 10:24 Alessandro Benedetti,
wrote:
> Option 4 also aim to refactor the limit in an appropriate place for the
> code (short
Option 4 also aim to refactor the limit in an appropriate place for the
code (short answer is Yes, implementation details)
Cheers
On Tue, 16 May 2023, 10:04 Michael Wechner,
wrote:
> Hi Alessandro
>
> Thank you very much for summarizing and starting the vote.
>
> I am not sure whether I really
Hi Alessandro
Thank you very much for summarizing and starting the vote.
I am not sure whether I really understand the difference between Option
2 and Option 4, or is it just about implementation details?
Thanks
Michael
Am 16.05.23 um 10:50 schrieb Alessandro Benedetti:
Hi all,
we have
My vote goes to *Option 4*.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*
e-mail: a.benede...@sease.io
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io
45 matches
Mail list logo