Hi Adrien, I thought we had another week? I looked back at Old emails and
thought you had targeted SEP 22 for feature freeze?
On Fri, Sep 13, 2024, 7:45 AM Adrien Grand wrote:
> Hello everyone,
>
> As previously discussed, I plan on feature freezing Lucene 9.12 and Lucene
> 10.0 next week. Prac
Hi, I've been looking into Adrien's suggestion to migrate
(Byte/Float)VectorValues to an unabashedly random-access API. We can
easily enough support iteration on top of that (which we use
extensively during indexing). I think this would represent a great
simplification; preliminary implementation s
Maybe getSlices has some side effect that messes up create Weight?
On Fri, Aug 16, 2024, 7:10 AM Michael Sokolov wrote:
> That is super weird. I wonder if changing the names of variables will make
> a difference. Have you verified that this effect is observable during all
> lunar phas
That is super weird. I wonder if changing the names of variables will make
a difference. Have you verified that this effect is observable during all
lunar phases?
I assume we liked at any profiler do offs we could get our hands on? If
not, maybe some for would show up there.
On Thu, Aug 15, 2024,
(
TooComplexToDeterminizeException.class,
() -> {
new RegexpQuery(new Term("stringvalue", "(.*a){2000}"));
});
}
On Tue, Aug 6, 2024 at 10:56 AM Michael Sokolov wrote:
>
> Yes, I think degenerate regexes like *a* are potentially costly.
> Actually some
Yes, I think degenerate regexes like *a* are potentially costly.
Actually something like *Ⱗ* is probably worse since yeah it would need
to scan the entire FST (which probably has some a's in it?)
I don't see any way around that aside from: (1) telling user don't do
that, or (2) putting some accoun
Welcome Armin!
On Fri, Jul 26, 2024 at 7:24 PM Greg Miller wrote:
>
> Welcome Armin!
>
> On Fri, Jul 26, 2024 at 10:51 AM Patrick Zhai wrote:
>>
>> Congrats and welcome, Armin!
>>
>>
>> On Fri, Jul 26, 2024, 10:30 Vigya Sharma wrote:
>>>
>>> Congratulations and welcome, Armin! Volunteering as a
ah that helps, thanks
On Tue, Jul 2, 2024 at 2:41 PM Robert Muir wrote:
>
> On Tue, Jul 2, 2024 at 1:59 PM Michael Sokolov wrote:
> >
> > Hi all - I wonder if anyone else is observing weird email behavior
> > from Github. I'm starting to see emails generated fro
Hi all - I wonder if anyone else is observing weird email behavior
from Github. I'm starting to see emails generated from PRs and issues
that are wildly out of date. Like one dated yesterday that was
generated from a comment that is weeks old. And I am missing many
current updates -- as if there is
SUCCESS! [0:55:48.190137]
(tested w/Corretto JDK)
+1
On Mon, Jun 24, 2024 at 8:01 AM Benjamin Trent wrote:
>
> SUCCESS! [0:40:46.898514]
>
> +1
>
> On Mon, Jun 24, 2024 at 1:29 AM Ignacio Vera wrote:
> >
> > Please vote for release candidate 1 for Lucene 9.11.1
> >
> >
> > The artifacts can be
Thanks for digging into this Dawid - I think it's important to keep an
IDE dev path pretty clear of underbrush in order to encourage new
joiners, even if it is not the primary or best means of building and
testing
On Thu, Jun 13, 2024 at 2:01 PM Dawid Weiss wrote:
>
>
> Hi Mike,
>
> Just FYI - I
then re-scan to do the actual quantization?
>
> I am not sure what you mean here by "merge the float vectors". If you
> mean simply reading the individual float vector files and combining
> them into a single file, we already do that separately from
> quantizing.
>
>
Hi folks. I've been experimenting with our new scalar quantization
support - yay, thanks for adding it! I'm finding that when I index a
large number of large vectors, enabling quantization (vs simply
indexing the full-width floats) requires more heap - I keep getting
OOMs and have to increase heap
If I set IJ build/test to "gradle" and then right click on "core" in
the Project tab -- it gives an option like "run tests in
lucene-root.lucene.core" which works. At the very top (lucene
[lucene-root]) of the hierarchy you can right-click and select "run
all tests", but this fails with "Error runn
>
> Yet I feel certain I have been able to run all tests in IJ before.
>
>
>
> I don't think this was ever the case with intellij. Or maybe you ran those
> tests via gradle?
When I say "run in IJ" I mean I right clicked a button somewhere and said
"run all tests" :) I expect it was with the gradl
OK, I can see how the directory structure might be at odds
w/intellij's view of the world.Yet I feel certain I have been able to
run all tests in IJ before.
Just to disconfirm my insanity I tried again building and running all
tests in core on branch_9x/main using both intellij and gradle
build/te
hould work.
>
> Running via gradle is slow for me not just with Lucene but also with other
> projects... I can take a look but I'm pessimistic I can do any wonders here.
>
> Dawid
>
> On Fri, Jun 7, 2024 at 6:06 PM Michael Sokolov wrote:
>>
>
ule permissions thing
controlling the visibility of these symbols?
On Fri, Jun 7, 2024 at 11:53 AM Michael Sokolov wrote:
>
> hm I found FakeCharFilterFactory in src/test/META-INF.services -- it's
> in a "test sources root" folder and won't allow itself to be set as
ssing. This
can't be this hard!
On Fri, Jun 7, 2024 at 11:44 AM Michael Sokolov wrote:
>
> hmm so after playing around with this Intellij build for a bit I ran
> into some trouble -- all the tests relying on SPI seemed to start
> failing. So then I switched back to build with G
n 7, 2024 at 10:40 AM Michael Sokolov wrote:
>
> ok, life must be scary for developers on windows!
>
> On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss wrote:
> >
> >
> > Certain regenerate tasks do require perl and python indeed.
> >
> > On Fri, Jun 7, 2024 a
ok, life must be scary for developers on windows!
On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss wrote:
>
>
> Certain regenerate tasks do require perl and python indeed.
>
> On Fri, Jun 7, 2024 at 2:23 PM Michael Sokolov wrote:
>>
>> While editing this CONTRIBUTI
While editing this CONTRIBUTING.md I found the following statement:
Some build tasks (in particular `./gradlew check`) require Perl
and Python 3.
Is it actually true that we require Perl?
On Fri, Jun 7, 2024 at 8:11 AM Michael Sokolov wrote:
>
> So I'm glad we have a fix for thi
me problem and it seems better now. Thank you, Dawid!
>
> On Thu, 6 Jun 2024 at 12:20, Michael Sokolov wrote:
>>
>> Oh! TIL! so much better, thanks. And now I have the "Repeat" option
>> back in the test runner
>>
>> On Thu, Jun 6, 2024 at
ly. Switch it to compile and run using its
> own built-in method - much faster.
>
>
>
> Dawid
>
> On Thu, Jun 6, 2024 at 12:10 PM Michael Sokolov wrote:
>>
>> Hi, I wonder how many of us are using intellij to run Lucene tests, and if
>> you are, have you notic
Hi, I wonder how many of us are using intellij to run Lucene tests, and if
you are, have you noticed it having gotten really quite slow? It seems to
take a long time doing... Something... Before the test starts running. I
have a suspicion that we are using gradle in a way that forces it to
rebuild
+1
(tested w/Amazon Corretto JVM)
SUCCESS! [0:46:40.066524]
On Mon, Jun 3, 2024 at 7:30 AM Benjamin Trent wrote:
>
> Please vote for release candidate 1 for Lucene 9.11.0
>
> The artifacts can be downloaded from:
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.0-RC1-rev-d433394b292e3
I misread this as "Lucene 911" as in "Lucene Emergency!!!" -- might
not land for everyone - someday we will Have Lucene 11.2? But ... no
concerns from me aside from the things you mentioned - thanks for
pushing, Ben
On Tue, May 28, 2024 at 9:58 AM Benjamin Trent wrote:
>
> Hey y'all,
>
> I am pla
I'm pretty sure it's only in core that we follow the no dependencies rule.
On Sat, May 18, 2024, 11:25 AM Bruno Roustant
wrote:
> The facet module has a dependency on com.carrotsearch:hppc.
>
> Is it possible to add the same dependency to the join module ? What is the
> rule ?
>
> Thanks
>
> Bru
We use it Amazon. I can't really read it so I'm not sure, but I think
it's used to encode terms that come up that aren't handled well by the
standard dictionary.
On Sat, May 18, 2024 at 8:39 AM Bruno Roustant wrote:
>
> Hi,
>
> While looking at the various usages of Map with Integer keys, I found
Thanks for the explanation. It makes sense that we start with a given
seed and then each iteration is different because it re-uses the same
Random instance (or whatever static state?) without re-initialization?
On Wed, Apr 3, 2024 at 6:09 PM Dawid Weiss wrote:
>
>
>> Now I just need to understand
t; <https://github.com/apache/lucene/blob/main/gradle/testing/beasting.gradle#L62-L66>
>> in beasting.gradle
>> <https://github.com/apache/lucene/blob/main/gradle/testing/beasting.gradle>
>> .
>>
>> - Shubham
>>
>> On Wed, Apr 3, 2024 at 1:49 AM Mi
14 PM Michael Sokolov wrote:
>
> Is there a convenient way to run a test multiple times with different
> seeds? Do I need to write my own script? I feel like I used to be able
> to do this in IntelliJ, but that option seems to have vanished, and I
> don't see any such option in
Is there a convenient way to run a test multiple times with different
seeds? Do I need to write my own script? I feel like I used to be able
to do this in IntelliJ, but that option seems to have vanished, and I
don't see any such option in gradle testOpts either. I tried
-tests.iter but that seems
This TestBooleanMinShouldMatch.testRandomQueries failure did not
reproduce for me on branch_9x, with JDK 11 or JDK 17 or JDK 21. I ran
it a few times.
TestByteVectorSimilarityQuery.testSomeDeletes reproduces reliably -
I'll see if I can find out why it's unstable
On Mon, Apr 1, 2024 at 9:50 AM Po
timing makes sense to me. +1 for having a deadline to reduce
procrastination, but Adrien I don't honestly believe anyone who is
paying attention thinks that is what you have been doing!
On Wed, Mar 13, 2024 at 10:40 AM Adrien Grand wrote:
>
> Hello everyone!
>
> It's been ~2.5 years since we rele
Chrome on a Macbook, it's super dark. I can make
> it out but I gotta stare for a bit ... do they make light and dark mode
> .ico files in one!?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Feb 25, 2024 at 6:05 PM Michael Sokolov
> wrote:
&
Welcome and congratulations, Chao!
On Sat, Feb 24, 2024 at 8:51 PM Christian Moen wrote:
>
> Congrats, Chao!
>
> On Wed, Feb 21, 2024 at 2:28 AM Adrien Grand wrote:
>>
>> I'm pleased to announce that Zhang Chao has accepted the PMC's
>> invitation to become a committer.
>>
>> Chao, the tradition
+1
On Fri, Feb 23, 2024 at 7:08 PM Stefan Vodita wrote:
>
> +1
>
> On Fri, 23 Feb 2024 at 11:24, Chris Hegarty
> wrote:
>>
>> Hi,
>>
>> Since the discussion on bumping the Lucene main branch to Java 21 is winding
>> down, let's hold a vote on this important change.
>>
>> Once bumped, the next
here is a favicon you might want to try: I cropped the "VL" from the
Apache Lucene logo (ok I guess it's an AL) -- if you save it as
favicon.ico in the root of your website (ie as url /favicon.ico) it
should show up in bookmarks, browser toolbars, etc as a handy memory
aid. Of course you might have
I love the gray all text UI. Don't change it! But I wonder if it's time for
a favicon?
On Tue, Feb 20, 2024, 4:40 AM Adrien Grand wrote:
> Very cool, thank you Mike!
>
> On Mon, Feb 19, 2024 at 5:40 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Team,
>>
>> ~1.5 years ago (A
Hello Stefan, welcome!
On Fri, Jan 19, 2024 at 10:41 AM Martin Gainty wrote:
> Congratulations Stefan!
>
> I look forward to reading your posts
>
> ~martin
> --
> *From:* Michael McCandless
> *Sent:* Thursday, January 18, 2024 10:53 AM
> *To:* dev@lucene.apache.org
+1
SUCCESS! [0:50:50.776559]
Note: we did get some test fails on the mailing list this morning, but I
believe they are not real bugs and will be resolved by tightening up our
test assumptions
On Thu, Dec 14, 2023 at 7:08 AM Guo Feng wrote:
> +1
>
> SUCCESS! [3:38:43.833896]
>
> On 2023/12/14 1
SUCCESS! [0:46:20.693134]
+1
On Thu, Nov 30, 2023 at 5:50 PM Tomás Fernández Löbbe
wrote:
> SUCCESS! [0:52:49.337126]
>
> +1
>
> On Thu, Nov 30, 2023 at 12:05 PM Benjamin Trent
> wrote:
>
>> SUCCESS! [0:44:05.132154]
>>
>> +1
>>
>> On Thu, Nov 30, 2023 at 1:09 PM Chris Hegarty
>> wrote:
>>
>>
for the sake of posterity, I did get a successful smoketest:
SUCCESS! [1:00:06.512261]
but +0 to release I guess since it's moot...
On Thu, Nov 30, 2023 at 10:38 AM Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Thu, Nov 30, 2023 at 9:56 AM Chris Hegarty
> wrote:
>
> P.S. I’m less
Another way is to ensure that all documents get updated on a regular
cadence whether there are changes in the underlying data or not. Or,
regenerating the index from scratch all the time. Of course these
approaches might be more costly for an index that has intrinsically low
update rates, but they
+1 thanks for volunteering!
Hijacking the thread a bit, sorry, I started looking into whether this is a
good time to start looking ahead to 10? I know we had some rumblings about
releasing that so we can start requiring newer JDKs. But looking at CHANGES
it feels like we already back-ported most o
did you add to the sandbox META-INF file? It looks like maybe sandbox is
not included in the scope of the test, but you didn't say which test it
was. Is the test also in the sandbox module?
On Mon, Nov 20, 2023 at 6:56 PM Dongyu Xu wrote:
> Hi devs,
>
> I tried to plug in my experimental Posting
Welcome, Patrick!
On Sun, Nov 12, 2023, 2:12 AM Ignacio Vera wrote:
> Welcome Patrick!
>
> On Sat, Nov 11, 2023 at 3:29 PM Uwe Schindler wrote:
>
>> Welcome Patrick!
>>
>> Uwe
>>
>>
>> Am 10. November 2023 21:04:32 MEZ schrieb Michael McCandless <
>> luc...@mikemccandless.com>:
>>
>>> I'm happy
Can you require the user to specify missing: true or missing: false
semantics. With that you can decide what to do with the missing values
On Thu, Nov 9, 2023, 7:55 AM Mikhail Khludnev wrote:
> Hello Michael.
> This optimization "NOT the less common value" assumes that boolean field
> is require
It's not just you - we have an internal JDK11 fork at BIG COMPANY for some
folks that can't get off the stick. To be fair it's challenging because
they have to shift all their dependencies. I think Spark was the one
mentioned by one group, but there is a JDK17-based release of Spark, so
clearly not
Personally for me it's about how meaningful the commit messages (and
contents) are vs whether we use merge commits or not. If it;s a long series
of "fixed bug" "reformatted" "did stuff" "more stuff" "it finally works"
and so on ... that doesn't smell good to me, but you know we all have done
that f
Welcome, gf2121!
On Wed, Oct 25, 2023, 3:03 AM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:
> Congratulations and welcome, Feng!
>
> On Tue, 24 Oct 2023 at 22:35, Adrien Grand wrote:
>
>> I'm pleased to announce that Guo Feng has accepted an invitation to join
>> the Lucene PMC!
>>
>
Congratulations and welcome, Luca!
On Sun, Oct 22, 2023 at 1:42 PM Julie Tibshirani wrote:
>
> Congratulations Luca!!
>
> On Fri, Oct 20, 2023 at 1:45 AM Bruno Roustant
> wrote:
>>
>> Welcome, congratulations!
>>
>> Le ven. 20 oct. 2023 à 10:02, Dawid Weiss a écrit :
>>>
>>>
>>> Congratulation
moment.
>
> Uwe
>
> Am 22.10.2023 um 01:37 schrieb Michael Sokolov:
> > Thanks for digging into this. I do think it will be helpful for
> > developers that blithely access the IndexInput from multiple threads
> > :)
> >
> > On Sat, Oct 21, 2023 at 3:53
Thanks for digging into this. I do think it will be helpful for
developers that blithely access the IndexInput from multiple threads
:)
On Sat, Oct 21, 2023 at 3:53 PM Chris Hostetter
wrote:
>
>
> Uwe: In your PR, you should add these details to the javadocs of
> ByteBufferIndexInput.alreadyClose
I was messing around with something that was resulting in
AlreadyClosedException being thrown and I noticed that we weren't
tracking the exception that caused it. I found this in
ByteBufferIndexInput:
// the unused parameter is just to silence javac about unused variables
AlreadyClosedExcept
ityManager has done everything it should do: It detected an
> illegal access. Mission achieved! You have to report this issue and patch
> your tool so it works correctly with SecurityManager.
>
> Uwe
>
> Am 24.09.2023 um 23:52 schrieb Michael Sokolov:
>
> I ran the s
ok, I re-ran without the pesky log4j-thingy running and
SUCCESS! [0:55:54.865250]
+1
On Sun, Sep 24, 2023 at 5:52 PM Michael Sokolov wrote:
>
> I ran the smoketester and had a failure. It seems related to some
> log4j hot patch script we are required to run at work which i
I ran the smoketester and had a failure. It seems related to some
log4j hot patch script we are required to run at work which is somehow
conflicting with the security manager? I'm killing that and trying
again, but I wonder if this is going to cause problems at runtime as
well? How do we enable the
+1 for a release soon, and thanks for volunteering, Patrick!
On Tue, Sep 12, 2023 at 2:08 AM Patrick Zhai wrote:
>
> Hi all,
> It's been a while since the last release and we have quite a few good changes
> including new APIs, improvements and bug fixes. Should we release the 9.8?
>
> If there's
I have /tmp symlinked to /local/tmp (to get more space) and this seems
to cause some issue:
On Thu, Jun 22, 2023 at 7:07 PM Michael Sokolov wrote:
>
> +0
>
> I had some test failures. Maybe a problem with my setup? I'll see if I can
> repro
>
> gradlew :luce
+0
I had some test failures. Maybe a problem with my setup? I'll see if I can repro
gradlew :lucene:replicator:test --tests
"org.apache.lucene.replicator.nrt.TestNRTReplication.testCrashPrimary1"
-Ptests.jvms=8 "-Ptests.jv
margs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC
-XX:ActiveProcessorCount=
Welcome Chris!
On Mon, Jun 19, 2023, 7:31 AM Michael McCandless
wrote:
> Welcome aboard Chris!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Jun 19, 2023 at 7:16 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> Congratulations Chris!
>>
>> On Mon, 19 Jun,
community what
> they want to see so they are unblocked from their explorations of vector
> search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
> wrote:
>
&
I think I've said before on this list we don't actually enforce the limit
in any way that can't easily be circumvented by a user. The codec already
supports any size vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int sized vectors
a
e you need to specify the development lucene version differently
>> than other dependencies...
>>
>> - Houston
>>
>> On Sat, May 13, 2023 at 10:14 AM Michael Sokolov wrote:
>>>
>>> doh I actually read your email and you said you already checked tha
doh I actually read your email and you said you already checked that -
I'm going to send out one of those "sokolov would like to retract the
previous email" emails. Does GMail even pretend to do that? I don't
know what's going on there! sorry
On Sat, May 13, 2023 at
sorry - META-INF not WEB-INF
On Sat, May 13, 2023 at 10:12 AM Michael Sokolov wrote:
>
> You are probably missing the contents of WEB-INF in your custom jar?
> Roughly speaking the files in there define run-time-bound "services"
> that are looked up by name by the JD
You are probably missing the contents of WEB-INF in your custom jar?
Roughly speaking the files in there define run-time-bound "services"
that are looked up by name by the JDK's service-loader API.
On Sat, May 13, 2023 at 9:33 AM Gus Heck wrote:
>
> Cross posting to lucene on the possibility that
ieldWriter, is that handled somewhere else? Or is it just up to the user to
> make sure no documents end up with duplicate vectors?
>
> On Wed, Apr 19, 2023 at 5:07 AM Michael Sokolov wrote:
>>
>> Oh identical vectors. Basically unsupported. If you create a large index
>
I think that in BooleanQuery and related classes we mostly aggregate
child scores by summing (although there is DisjunctionMaxScorer which
doesn't exactly take the max?). I have a use case where I want to take
the min score from a bunch of required terms. To do this I had to
write a new query and f
he constructor does not need to contain any values up front. Specifically,
> Lucene95HnswVectorsWriter.FieldWriter adds vectors incrementally to the RAVV
> that it gives to the builder as addValue is called.
>
> On Wed, Apr 19, 2023 at 1:37 PM Michael Sokolov wrote:
>>
>>
g at the paper by Malkov and Yashunin, it looks like the algorithm
> allows for building the hnsw graph incrementally. Why does our
> implementation require specifying all the vectors up front to
> HnswGraphBuilder.create?
>
> On Wed, Apr 19, 2023 at 3:04 AM Michael Sokolov wrote
Yes, thanks Alan!
On Wed, Apr 19, 2023 at 3:41 PM Michael Wechner
wrote:
>
> +1
>
> Thanks!
>
> Michael
>
> Am 19.04.23 um 18:09 schrieb Benjamin Trent:
>
> +1 !
>
> You rock Alan!
>
> On Wed, Apr 19, 2023, 9:54 AM Ignacio Vera wrote:
>>
>> +1
>>
>> Thanks Alan!
>>
>> On Wed, Apr 19, 2023 at 1:2
Oh identical vectors. Basically unsupported. If you create a large index
filled with identical vectors it leads to pathological behavior. Seems to
be a weakness in the algorithm. If you have any idea how to improve that,
it would be welcome. But in real world scenarios, it doesn't seem to arise?
O
These vector values have internal buffers they use to return the vectors.
In order to compare two vectors we need to use two independent sources so
that one doesn't overwrite this internal state when fetching the second
vector.
Sorry I forgot the second question and can't see it on my phone. Brb
03795722872018814
>>>
>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>
>>> -0.007012288551777601,-0.02666585
;> year, that might be worth it... (hyperbole of course).
>>>
>>> Things I would consider to have technical merit that I don't hear:
>>>
>>> Impact on the speed of **other** indexing operations. (devaluation of other
>>> functionality)
>>&g
ingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco
I did see some other larger-dimensional model, but they all seem to
involve images+text.
On Mon, Apr 10, 2023 at 9:54 AM Michael Sokolov wrote:
>
> I think concatenating word-embedding vectors is a reasonable thing to
> do. It c
I think concatenating word-embedding vectors is a reasonable thing to
do. It captures information about the sequence of tokens which is
being lost by the current approach (summing them). Random article I
found in a search
https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54
://github.com/apache/lucene/pull/11905
> > >>
> > >> Attacking me isn't helping the situation.
> > >>
> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> > >> any kind of demeaning fashion really. I meant
the current
>>> state of usability with respect to indexing a few million docs with
>>> high dimensions. You can scroll up the thread and see that at least
>>> one other committer on the project experienced similar pain as me.
>>> Then, think about users wh
What you said about increasing dimensions requiring a bigger ram buffer on
merge is wrong. That's the point I was trying to make. Your concerns about
merge costs are not wrong, but your conclusion that we need to limit
dimensions is not justified.
You complain that hnsw sucks it doesn't scale, but
one more data point:
32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)
On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov wrote:
>
> I also want to add that we do impose some other limits on graph
> construction to help ensure that HNSW-based vector fiel
I also want to add that we do impose some other limits on graph
construction to help ensure that HNSW-based vector fields remain
manageable; M is limited to <= 512, and maximum segment size also
helps limit merge costs
On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov wrote:
>
> Thanks
el Wechner
>> wrote:
>>>
>>> Great, thank you!
>>>
>>> How much RAM; etc. did you run this test on?
>>>
>>> Do the vectors really have to be based on real data for testing the
>>> indexing?
>>> I understand, if you want
I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to
> Thanks
>
> Michael
>
>
>
> Am 06.04.23 um 16:11 schrieb Michael Sokolov:
> > re: how does this HNSW stuff scale - I think people are calling out
> > indexing memory usage here, so let's discuss some facts. During
> > initial indexing we hold in RAM all the v
re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter'
RE_DOCS in my wrapping Query to
assert this, and I can see it has some effect.
Anyway I am seeing *some* skipping, which is tantalizing.
On Sat, Apr 1, 2023 at 10:00 AM Michael Sokolov wrote:
>
> Hi, I've been working on seeing whether we can make use of impacts in
> Amazon search a
Hi, I've been working on seeing whether we can make use of impacts in
Amazon search and I have some questions. To date, we haven't used
Lucene's scoring APIs at all; all of our queries are constant score,
we early terminate based on a sorted index rank and then re-rank using
custom non-Lucene ranki
I'm also in favor of raising this limit. We do see some datasets with
higher than 1024 dims. I also think we need to keep a limit. For example we
currently need to keep all the vectors in RAM while indexing and we want to
be able to support reasonable numbers of vectors in an index segment. Also
we
Using directio with nfs makes no sense at all to me, I think that is the
problem in a nutshell. Directio tries to bypass the operating systems
buffers, but that's not going to play nicely with nfs.
On Wed, Mar 22, 2023, 4:38 PM david-sitsky (via GitHub)
wrote:
>
> david-sitsky commented on issue
Welcome, Ben! Congratulations
On Fri, Jan 27, 2023 at 4:52 PM Anshum Gupta wrote:
>
> Congratulations and welcome, Ben!
>
> On Fri, Jan 27, 2023 at 7:18 AM Adrien Grand wrote:
>>
>> I'm pleased to announce that Ben Trent has accepted the PMC's
>> invitation to become a committer.
>>
>> Ben, the
+1 trying to coordinate multiple writers running independently will
not work. My 2c for availability: you can have a single primary active
writer with a backup one waiting, receiving all the segments from the
primary. Then if the primary goes down, the secondary one has the most
recent commit repli
ted signature, but it seems
> like it's due to "can't connect to the agent: IPC connect call failed"
> actually, which suggests an issue with the GPG agent?
>
> On Fri, Nov 18, 2022 at 3:00 PM Michael Sokolov wrote:
>>
>> I got this message when in
What I have in mind would be to implement entirely in the
KnnVectorQuery. Since results are sorted by score, they can easily be
post-filtered there: no need to implement anything at the codec layer
I think.
On Thu, Nov 17, 2022 at 10:10 AM GitBox wrote:
>
>
> rmuir commented on PR #11946:
> URL:
I got this message when initially downloading the artifacts:
Downloading
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db/lucene/lucene-9.4.2-src.tgz.asc
File:
/tmp/smoke_lucene_9.4.2_858d9b437047a577fa9457089afff43eefa461db/lucene.lucen
gt; it would be hard to predict whether a given radius would actually match a
>>> small set of vectors. Should the query still require a `k` value in
>>> addition to the radius to make sure it doesn't go wild?
>>>
>>> On Tue, Nov 8, 2022 at 7:26 AM Alexey Go
+1 makes sense. I do think given this is the second similar-flavored
bug we've found that we should be thorough and try to get them all
rather than having a 9.4.3 ...
On Wed, Nov 9, 2022 at 10:25 AM Julie Tibshirani wrote:
>
> +1 from me for a bugfix release once we've solidified testing. Thanks
1 - 100 of 527 matches
Mail list logo