Re: Feature freeze for Lucene 9.12 and Lucene 10.0

2024-09-13 Thread Michael Sokolov
Hi Adrien, I thought we had another week? I looked back at Old emails and
thought you had targeted SEP  22 for feature freeze?

On Fri, Sep 13, 2024, 7:45 AM Adrien Grand  wrote:

> Hello everyone,
>
> As previously discussed, I plan on feature freezing Lucene 9.12 and Lucene
> 10.0 next week. Practically, this means that:
>  - Our main branch will become 11.x.
>  - I will sunset branch_9x the same way we did with branch_8x
> .
>  - Branches branch_10x, branch_10_0 and branch_9_12 will be created.
>
> I will send an email when these branches get created.
>
> The intention is to release 9.12 the week of September 23rd and 10.0 the
> week of September 30th if all goes well.
>
> There are two changes that I am tracking and plan on allowing to merge
> after feature freeze:
>  - Remove 8-bit quantization for HNSW indexing:
> https://github.com/apache/lucene/pull/13767,
>  - Give VectorValues a random-access API:
> https://github.com/apache/lucene/pull/13779.
>
> Safe bug fixes are obviously welcome as well. Let me know if there are
> other changes that I should be aware of for inclusion in 9.12 or 10.0.
>
> --
> Adrien
>


Re: Lucene 10.0 and 9.12 blockers

2024-09-09 Thread Michael Sokolov
Hi, I've been looking into Adrien's suggestion to migrate
(Byte/Float)VectorValues to an unabashedly random-access API. We can
easily enough support iteration on top of that (which we use
extensively during indexing). I think this would represent a great
simplification; preliminary implementation shows a big reduction in
boilerplate code and awkward casting, extra interfaces, and so on. And
it would make binary-partitioning of HNSW graphs more straightforward.
But I doubt this would be ready in time for 10.0 so I wouldn't wait
for it. Just wanted to let you all know since it would be a breaking
change.

On Mon, Sep 9, 2024 at 3:18 PM Luca Cavanna  wrote:
>
> The intra-segment concurrency PR has been pretty close for quite a few days 
> already. I ran benchmarks last week, made adjustments, and just finished 
> addressing comments from Mike's review. My plan would be to merge it tomorrow 
> unless there are objections.
>
> Regarding the usage of the deprecated search method, I agree that we should 
> not delay the release for that, yet I hope that we get that done. I will try 
> to tackle the remaining issues, and Greg has been helping there too, thanks a 
> lot! For anyone else  interested, the description of the issue lists a number 
> of remaining items that need fixing: 
> https://github.com/apache/lucene/issues/12892 .
>
> Cheers
> Luca
>
> On Mon, Sep 9, 2024 at 7:46 PM Adrien Grand  wrote:
>>
>> Thanks to all who replied to this thread and worked on getting these 
>> blockers addressed. In particular I see that support for JDK 23, backporting 
>> the Arena work, and the removal of CollectorOwner are merged.
>>
>> I just reviewed the humongous PR that migrates more classes to records, it 
>> looks pretty good to me. If someone can look at my comments (hopefully much 
>> quicker than reviewing the whole PR!), I would appreciate it.
>>
>> To be transparent, the more usage of the deprecated 
>> IndexSearcher#search(Query, Collector) we can remove, the better, but I 
>> don't plan on delaying the release if it's not finished. Likewise, I don't 
>> plan on delaying 10.0 if support for native dot product is not merged, it 
>> could make it to a minor release later on if needed.
>>
>> I haven't taken a deep look at RaBitQ. If it's ready in time for 9.12 and 
>> 10.0, I'm fine with it getting merged, but as a new feature that doesn't 
>> look like it requires breaking changes to our public API, it could be 
>> introduced in a minor release later on, so I don't plan to treat it as a 
>> blocker.
>>
>> I would like to get support for intra-segment search concurrency in, as it 
>> is breaking enough that we could not easily introduce it in a minor later 
>> on. It seems to be almost ready, so hopefully it will get merged before 
>> feature freeze next week?
>>
>> I'm not clear if someone is actively looking into the recall issues with 
>> 8-bit quantization?
>>
>>
>> On Thu, Sep 5, 2024 at 10:14 PM Shubham Chaudhary  
>> wrote:
>>>
>>> >  Maybe mark it blocker so we don't lose track?
>>>
>>> Hi Mike, it's already linked to the 10.0 milestone. Is there some way to 
>>> mark or track it as a blocker for 10.0? It would be great if I could get 
>>> some reviews on it. The PR has been accumulating merge conflicts over time. 
>>> I'm happy to address the comments and iterate on it to get this done for 
>>> the 10.0 release.
>>>
>>> - Shubham
>>>
>>> On Wed, Sep 4, 2024 at 7:20 PM Michael McCandless 
>>>  wrote:

 On Sat, Aug 31, 2024 at 2:00 PM Shubham Chaudhary  
 wrote:
>
> Hi, regarding the 10.0 release, should we also consider 
> https://github.com/apache/lucene/pull/13328. It was planned for 10.0 
> (https://github.com/apache/lucene/issues/13207) and is waiting on review, 
> so I think it'll be good if we could consider it. Looking forward to 
> views and seeing if there are any concerns with the change I'm unaware of.


 +1

 It looks like this one is super close?  A couple of rounds of feedback 
 from Uwe, folded into the PR.  Maybe mark it blocker so we don't lose 
 track?

 Thanks Shubham.

 Mike McCandless

 http://blog.mikemccandless.com

>
> - Shubham
>
> On Thu, Aug 8, 2024 at 10:20 PM Adrien Grand  wrote:
>>
>> Hello everyone,
>>
>> As previously discussed, I plan on releasing 9.last and 10.0 under the 
>> following timeline:
>> - ~September 15th: 10.0 feature freeze - main becomes 11.0
>> - ~September 22nd: 9.last release,
>> - ~October 1st: 10.0 release.
>>
>> Unless someone shortly volunteers to do a 9.x release, this 9.last 
>> release will likely be 9.12.
>>
>> As these dates are coming shortly, I would like to start tracking 
>> blockers. Please reply to this thread with issues that you know about 
>> that should delay the 9.last or 10.0 releases.
>>
>> Chris, Uwe: I also wanted to check with you if this timeline works well 

Re: Baffling performance regression measured by luceneutil

2024-08-16 Thread Michael Sokolov
Maybe getSlices has some side effect that messes up create Weight?

On Fri, Aug 16, 2024, 7:10 AM Michael Sokolov  wrote:

> That is super weird. I wonder if changing the names of variables will make
> a difference. Have you verified that this effect is observable during all
> lunar phases?
>
> I assume we liked at any profiler do offs we could get our hands on? If
> not, maybe some for would show up there.
>
> On Thu, Aug 15, 2024, 7:22 PM Greg Miller  wrote:
>
>> Hi folks-
>>
>> Egor Potemkin and I have been digging into a baffling performance
>> regression we're seeing in response to a one-line change that doesn't
>> rationally seem like it should have any performance impact what-so-ever.
>> There's more background on why we're trying to understand this, but I'll
>> save the broader context for now and just focus on the confusing issue
>> we're trying to understand.
>>
>> Inside IndexSearcher, we've staged a change that initializes an ArrayList
>> of Collectors slightly earlier than what we do today (see:
>> https://github.com/apache/lucene/pull/13657/files). We end up with code
>> that looks like this (note the isolated line that's initializing
>> `collectors`):
>>
>> ```
>>   public  T search(Query query,
>> CollectorManager collectorManager)
>>   throws IOException {
>> final LeafSlice[] leafSlices = getSlices();
>> final C firstCollector = collectorManager.newCollector();
>> query = rewrite(query, firstCollector.scoreMode().needsScores());
>> final Weight weight = createWeight(query, firstCollector.scoreMode(),
>> 1);
>>
>> final List collectors = new ArrayList<>(leafSlices.length);
>>
>> return search(weight, collectorManager, firstCollector, collectors,
>> leafSlices);
>>   }
>> ```
>>
>> What's baffling is that if we initialize the `collectors` list _after_
>> the call to `createWeight` (as shown here), there's no performance impact
>> at all (as expected). But if all we do is initialize `collectors` _before_
>> the call to `createWeight`, we see a very significant regression on
>> LowTerm, MedTerm, HighTerm tasks in luceneutil (e.g., %15 - 30%). At the
>> other end, we see a significant improvement to OrHighNotLow, OrHighNotMed,
>> OrHighNotHigh (e.g., 7% - 15%). (This is running wikimedium10m on an
>> x86-based AWS ec2 host, but results reproduced separately for Egor and in
>> our nightly benchmark runs; full luceneutil output at the bottom of this
>> email [1]). Some additional context and conversation is captured in this
>> "demo" PR: https://github.com/apache/lucene/pull/13657.
>>
>> My only hunch here is this has something to do with hotspot's decision
>> making or some other such runtime optimization, but I'm getting out of my
>> depth and hoping someone in this community will have ideas on ways to
>> continue this investigation. Anyone have a clue what might be going on? Or
>> any suggestions on other things to look at? This isn't a purely academic
>> exercise for what it's worth. This oddity has caused us to duplicate some
>> code in IndexSearcher to work with a new sandbox faceting module, so it
>> would be nice to figure this out so we can remove the code duplication.
>> (The code duplication is pretty minor, but it's still really frustrating
>> and it's a trap waiting to be hit by someone in the future that tries to
>> consolidate the code duplication and runs into this)
>>
>> Thanks for reading, and thanks in advance for any ideas!
>>
>> Cheers,
>> -Greg
>>
>>
>> [1] Full Lucene util output:
>> ```
>> TaskQPS baseline  StdDevQPS
>> my_modified_version  StdDevPct diff p-value
>>  MedTerm  513.21  (4.9%)  369.43
>>  (4.8%)  -28.0% ( -35% -  -19%) 0.000
>> HighTerm  523.20  (6.9%)  402.11
>>  (5.0%)  -23.1% ( -32% -  -12%) 0.000
>>  LowTerm  837.70  (3.9%)  715.94
>>  (3.9%)  -14.5% ( -21% -   -6%) 0.000
>>BrowseDayOfYearSSDVFacets   11.97 (18.9%)   11.31
>> (11.9%)   -5.5% ( -30% -   31%) 0.273
>> MedTermDayTaxoFacets   23.03  (4.9%)   21.95
>>  (6.4%)   -4.7% ( -15% -6%) 0.009
>>   HighPhrase  143.93  (8.3%)  139.35
>>  (4.7%)   -3.2% ( -14% -   10%) 0.136
>>   Fuzzy2   53.03  (9.0%)   51.50
>>

Re: Baffling performance regression measured by luceneutil

2024-08-16 Thread Michael Sokolov
That is super weird. I wonder if changing the names of variables will make
a difference. Have you verified that this effect is observable during all
lunar phases?

I assume we liked at any profiler do offs we could get our hands on? If
not, maybe some for would show up there.

On Thu, Aug 15, 2024, 7:22 PM Greg Miller  wrote:

> Hi folks-
>
> Egor Potemkin and I have been digging into a baffling performance
> regression we're seeing in response to a one-line change that doesn't
> rationally seem like it should have any performance impact what-so-ever.
> There's more background on why we're trying to understand this, but I'll
> save the broader context for now and just focus on the confusing issue
> we're trying to understand.
>
> Inside IndexSearcher, we've staged a change that initializes an ArrayList
> of Collectors slightly earlier than what we do today (see:
> https://github.com/apache/lucene/pull/13657/files). We end up with code
> that looks like this (note the isolated line that's initializing
> `collectors`):
>
> ```
>   public  T search(Query query,
> CollectorManager collectorManager)
>   throws IOException {
> final LeafSlice[] leafSlices = getSlices();
> final C firstCollector = collectorManager.newCollector();
> query = rewrite(query, firstCollector.scoreMode().needsScores());
> final Weight weight = createWeight(query, firstCollector.scoreMode(),
> 1);
>
> final List collectors = new ArrayList<>(leafSlices.length);
>
> return search(weight, collectorManager, firstCollector, collectors,
> leafSlices);
>   }
> ```
>
> What's baffling is that if we initialize the `collectors` list _after_ the
> call to `createWeight` (as shown here), there's no performance impact at
> all (as expected). But if all we do is initialize `collectors` _before_ the
> call to `createWeight`, we see a very significant regression on LowTerm,
> MedTerm, HighTerm tasks in luceneutil (e.g., %15 - 30%). At the other end,
> we see a significant improvement to OrHighNotLow, OrHighNotMed,
> OrHighNotHigh (e.g., 7% - 15%). (This is running wikimedium10m on an
> x86-based AWS ec2 host, but results reproduced separately for Egor and in
> our nightly benchmark runs; full luceneutil output at the bottom of this
> email [1]). Some additional context and conversation is captured in this
> "demo" PR: https://github.com/apache/lucene/pull/13657.
>
> My only hunch here is this has something to do with hotspot's decision
> making or some other such runtime optimization, but I'm getting out of my
> depth and hoping someone in this community will have ideas on ways to
> continue this investigation. Anyone have a clue what might be going on? Or
> any suggestions on other things to look at? This isn't a purely academic
> exercise for what it's worth. This oddity has caused us to duplicate some
> code in IndexSearcher to work with a new sandbox faceting module, so it
> would be nice to figure this out so we can remove the code duplication.
> (The code duplication is pretty minor, but it's still really frustrating
> and it's a trap waiting to be hit by someone in the future that tries to
> consolidate the code duplication and runs into this)
>
> Thanks for reading, and thanks in advance for any ideas!
>
> Cheers,
> -Greg
>
>
> [1] Full Lucene util output:
> ```
> TaskQPS baseline  StdDevQPS
> my_modified_version  StdDevPct diff p-value
>  MedTerm  513.21  (4.9%)  369.43
>  (4.8%)  -28.0% ( -35% -  -19%) 0.000
> HighTerm  523.20  (6.9%)  402.11
>  (5.0%)  -23.1% ( -32% -  -12%) 0.000
>  LowTerm  837.70  (3.9%)  715.94
>  (3.9%)  -14.5% ( -21% -   -6%) 0.000
>BrowseDayOfYearSSDVFacets   11.97 (18.9%)   11.31
> (11.9%)   -5.5% ( -30% -   31%) 0.273
> MedTermDayTaxoFacets   23.03  (4.9%)   21.95
>  (6.4%)   -4.7% ( -15% -6%) 0.009
>   HighPhrase  143.93  (8.3%)  139.35
>  (4.7%)   -3.2% ( -14% -   10%) 0.136
>   Fuzzy2   53.03  (9.0%)   51.50
>  (7.3%)   -2.9% ( -17% -   14%) 0.265
>  MedSpanNear   50.70  (5.1%)   49.26
>  (3.0%)   -2.8% ( -10% -5%) 0.032
>LowPhrase   70.38  (4.9%)   68.60
>  (5.3%)   -2.5% ( -12% -8%) 0.118
>MedPhrase   88.15  (5.2%)   86.03
>  (4.2%)   -2.4% ( -11% -7%) 0.105
>   OrHighMedDayTaxoFacets7.01  (5.5%)6.86
>  (5.4%)   -2.0% ( -12% -9%) 0.237
> HighSpanNear   28.95  (2.7%)   28.42
>  (2.9%)   -1.8% (  -7% -3%) 0.043
>  MedSloppyPhrase  201.71  (3.3%)  198.58
>  (3.1%)   -1.6% (  -7% -4%) 0.124
> BrowseDateTaxoFacets   23.97 (28.7%)   23.62
> (22.8%)   -1.5% ( -41% -   70%) 0.858
> 

Re: AbstractMultiTermQueryConstantScoreWrapper cost estimates (https://github.com/apache/lucene/issues/13029)

2024-08-06 Thread Michael Sokolov
But actually Patrick Zhai added support for nondeterministic regexes
that might help with cases like that?  There is this in
TestRegexpQuery:

 /** Test worst-case for getCommonSuffix optimization */
  public void testSlowCommonSuffix() throws Exception {
expectThrows(
TooComplexToDeterminizeException.class,
() -> {
  new RegexpQuery(new Term("stringvalue", "(.*a){2000}"));
});
  }

On Tue, Aug 6, 2024 at 10:56 AM Michael Sokolov  wrote:
>
> Yes, I think degenerate regexes like *a* are potentially costly.
> Actually something like *Ⱗ* is probably worse since yeah it would need
> to scan the entire FST (which probably has some a's in it?)
>
> I don't see any way around that aside from: (1) telling user don't do
> that, or (2) putting some accounting on FST so it can early-terminate
>
> On Fri, Aug 2, 2024 at 8:17 PM Michael Froh  wrote:
> >
> > Incidentally, speaking as someone with only a superficial understanding of 
> > how the FSTs work, I'm wondering if there is risk of cost in expanding the 
> > first few terms.
> >
> > Say we have a million terms, but only one contains an 'a'. If someone 
> > searches for '*a*', does that devolve into a term scan? Or can the FST 
> > efficiently identify an arc with an 'a' and efficiently identify terms 
> > containing that arc?
> >
> > Thanks,
> > Froh
> >
> > On Fri, Aug 2, 2024 at 3:50 PM Michael Froh  wrote:
> >>
> >> Exactly!
> >>
> >> My initial implementation added some potential cost. (I think I enumerated 
> >> up to 128 terms before giving up.) Now that Mayya moved the (probably 
> >> tiny) cost of expanding the first 16 terms upfront, my change is 
> >> theoretically "free".
> >>
> >> Froh
> >>
> >> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller  wrote:
> >>>
> >>> Hey Froh-
> >>>
> >>> I got some time to look through your PR (most of the time was actually 
> >>> refreshing my memory on the change history leading up to your PR and 
> >>> digesting the issue described). I think this makes a ton of sense. If I'm 
> >>> understanding properly, the latest version of your PR essentially takes 
> >>> advantage of Mayya's recent change 
> >>> (https://github.com/apache/lucene/pull/13454) in the score supplier 
> >>> behavior that is now doing _some_ up-front work to iterate the first <= 
> >>> 16 terms when building the scoreSupplier and computes a more 
> >>> accurate/reasonable cost based on that already-done work. Am I getting 
> >>> this right? If so, this seems like it has no downsides and all upside.
> >>>
> >>> I'll do a proper pass through the PR here shortly, but I love the idea 
> >>> (assuming I'm understanding it properly on a Friday afternoon after a 
> >>> long-ish week...).
> >>>
> >>> Cheers,
> >>> -Greg
> >>>
> >>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller  wrote:
> >>>>
> >>>> Hi Froh-
> >>>>
> >>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in 
> >>>> June (had some vacation and was generally away). I'd be interested to 
> >>>> see what others think as well, but I'll at least commit to looking 
> >>>> through your PR tomorrow or Monday to get a better handle on what's 
> >>>> being proposed. We went through a few iterations of this originally 
> >>>> before we landed on the current version. One promising approach was to 
> >>>> have a more intelligent query that would load some number of terms 
> >>>> up-front to get a better cost estimate before making a decision, but it 
> >>>> required a custom query implementation that generally didn't get 
> >>>> favorable feedback (it's nice to be able to use the existing 
> >>>> IndexOrDocValuesQuery abstraction instead). I can dig up some of that 
> >>>> conversation if it's helpful, but I'll better understand what you've got 
> >>>> in mind first.
> >>>>
> >>>> Unwinding a bit though, I'm also in favor in general that we should be 
> >>>> able to do a better job estimating cost here. I think the tricky part is 
> >>>> how we go about doing that effectively. Thanks again for kicking off 
> >>>> this thread!
>

Re: AbstractMultiTermQueryConstantScoreWrapper cost estimates (https://github.com/apache/lucene/issues/13029)

2024-08-06 Thread Michael Sokolov
Yes, I think degenerate regexes like *a* are potentially costly.
Actually something like *Ⱗ* is probably worse since yeah it would need
to scan the entire FST (which probably has some a's in it?)

I don't see any way around that aside from: (1) telling user don't do
that, or (2) putting some accounting on FST so it can early-terminate

On Fri, Aug 2, 2024 at 8:17 PM Michael Froh  wrote:
>
> Incidentally, speaking as someone with only a superficial understanding of 
> how the FSTs work, I'm wondering if there is risk of cost in expanding the 
> first few terms.
>
> Say we have a million terms, but only one contains an 'a'. If someone 
> searches for '*a*', does that devolve into a term scan? Or can the FST 
> efficiently identify an arc with an 'a' and efficiently identify terms 
> containing that arc?
>
> Thanks,
> Froh
>
> On Fri, Aug 2, 2024 at 3:50 PM Michael Froh  wrote:
>>
>> Exactly!
>>
>> My initial implementation added some potential cost. (I think I enumerated 
>> up to 128 terms before giving up.) Now that Mayya moved the (probably tiny) 
>> cost of expanding the first 16 terms upfront, my change is theoretically 
>> "free".
>>
>> Froh
>>
>> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller  wrote:
>>>
>>> Hey Froh-
>>>
>>> I got some time to look through your PR (most of the time was actually 
>>> refreshing my memory on the change history leading up to your PR and 
>>> digesting the issue described). I think this makes a ton of sense. If I'm 
>>> understanding properly, the latest version of your PR essentially takes 
>>> advantage of Mayya's recent change 
>>> (https://github.com/apache/lucene/pull/13454) in the score supplier 
>>> behavior that is now doing _some_ up-front work to iterate the first <= 16 
>>> terms when building the scoreSupplier and computes a more 
>>> accurate/reasonable cost based on that already-done work. Am I getting this 
>>> right? If so, this seems like it has no downsides and all upside.
>>>
>>> I'll do a proper pass through the PR here shortly, but I love the idea 
>>> (assuming I'm understanding it properly on a Friday afternoon after a 
>>> long-ish week...).
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller  wrote:

 Hi Froh-

 Thanks for raising this and sorry I missed your tag in GH#13201 back in 
 June (had some vacation and was generally away). I'd be interested to see 
 what others think as well, but I'll at least commit to looking through 
 your PR tomorrow or Monday to get a better handle on what's being 
 proposed. We went through a few iterations of this originally before we 
 landed on the current version. One promising approach was to have a more 
 intelligent query that would load some number of terms up-front to get a 
 better cost estimate before making a decision, but it required a custom 
 query implementation that generally didn't get favorable feedback (it's 
 nice to be able to use the existing IndexOrDocValuesQuery abstraction 
 instead). I can dig up some of that conversation if it's helpful, but I'll 
 better understand what you've got in mind first.

 Unwinding a bit though, I'm also in favor in general that we should be 
 able to do a better job estimating cost here. I think the tricky part is 
 how we go about doing that effectively. Thanks again for kicking off this 
 thread!

 Cheers,
 -Greg

 On Thu, Aug 1, 2024 at 5:58 PM Michael Froh  wrote:
>
> Hi there,
>
> For a few months, some of us have been running into issues with the cost 
> estimate from AbstractMultiTermQueryConstantScoreWrapper. 
> (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300)
>
> In https://github.com/apache/lucene/issues/13029, the problem was raised 
> in terms of queries not being cached, because the estimated cost was too 
> high.
>
> We've also run into problems in OpenSearch, since we started wrapping 
> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated 
> cost estimate, so IndexOrDocValueQuery decides it should be a DV query, 
> even though the MTQ would really only match a handful of docs (and should 
> be lead iterator).
>
> I opened a PR back in March (https://github.com/apache/lucene/pull/13201) 
> to try to handle the case where a MultiTermQuery matches a small number 
> of terms. Since Mayya pulled the rewrite logic that expands up to 16 
> terms (to rewrite as a Boolean disjunction) earlier in the workflow (in 
> https://github.com/apache/lucene/pull/13454), we get the better cost 
> estimate for MTQs on few terms "for free".
>
> What do folks think?
>
> Thanks,
> Froh

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional c

Re: Welcome Armin Braun as Lucene comitter

2024-07-27 Thread Michael Sokolov
Welcome Armin!

On Fri, Jul 26, 2024 at 7:24 PM Greg Miller  wrote:
>
> Welcome Armin!
>
> On Fri, Jul 26, 2024 at 10:51 AM Patrick Zhai  wrote:
>>
>> Congrats and welcome, Armin!
>>
>>
>> On Fri, Jul 26, 2024, 10:30 Vigya Sharma  wrote:
>>>
>>> Congratulations and welcome, Armin! Volunteering as a firefighter is 
>>> amazing, respect!
>>>
>>> On Fri, Jul 26, 2024 at 1:46 AM Ignacio Vera  wrote:

 Welcome Armin!

 On Fri, Jul 26, 2024 at 10:16 AM Chris Hegarty
  wrote:
 >
 > Welcome Armin!
 >
 > -Chris.
 >
 > > On 26 Jul 2024, at 05:24, Anshum Gupta  wrote:
 > >
 > > Congratulations and welcome, Armin!
 > >
 > > On Thu, Jul 25, 2024 at 2:10 AM Luca Cavanna  
 > > wrote:
 > > I'm pleased to announce that Armin Braun has accepted the PMC's 
 > > invitation to become a Lucene committer.
 > >
 > > Armin, the tradition is that new committers introduce themselves with 
 > > a brief bio.
 > >
 > > Thanks for your contributions so far and looking forward to the 
 > > upcoming ones :)
 > >
 > > Congratulations and welcome!
 > >
 > >
 > > --
 > > Anshum Gupta
 >
 >
 > -
 > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 > For additional commands, e-mail: dev-h...@lucene.apache.org
 >

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

>>>
>>>
>>> --
>>> - Vigya

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: github notification delay

2024-07-02 Thread Michael Sokolov
ah that helps, thanks

On Tue, Jul 2, 2024 at 2:41 PM Robert Muir  wrote:
>
> On Tue, Jul 2, 2024 at 1:59 PM Michael Sokolov  wrote:
> >
> > Hi all - I wonder if anyone else is observing weird email behavior
> > from Github. I'm starting to see emails generated from PRs and issues
> > that are wildly out of date. Like one dated yesterday that was
> > generated from a comment that is weeks old. And I am missing many
> > current updates -- as if there is a giant lossy email backlog
> > somewhere that is spewing out occasional misdated updates?!
> >
>
> I'm seeing this too. I'm using https://github.com/notifications in my
> browser as a workaround.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



github notification delay

2024-07-02 Thread Michael Sokolov
Hi all - I wonder if anyone else is observing weird email behavior
from Github. I'm starting to see emails generated from PRs and issues
that are wildly out of date. Like one dated yesterday that was
generated from a comment that is weeks old. And I am missing many
current updates -- as if there is a giant lossy email backlog
somewhere that is spewing out occasional misdated updates?!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.11.1 RC1

2024-06-24 Thread Michael Sokolov
SUCCESS! [0:55:48.190137]

(tested w/Corretto JDK)

+1

On Mon, Jun 24, 2024 at 8:01 AM Benjamin Trent  wrote:
>
> SUCCESS! [0:40:46.898514]
>
> +1
>
> On Mon, Jun 24, 2024 at 1:29 AM Ignacio Vera  wrote:
> >
> > Please vote for release candidate 1 for Lucene 9.11.1
> >
> >
> > The artifacts can be downloaded from:
> >
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.1-RC1-rev-0c087dfdd10e0f6f3f6faecc6af4415e671a9e69
> >
> >
> > You can run the smoke tester directly with this command:
> >
> >
> > python3 -u dev-tools/scripts/smokeTestRelease.py \
> >
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.1-RC1-rev-0c087dfdd10e0f6f3f6faecc6af4415e671a9e69
> >
> >
> > The vote will be open for at least 72 hours i.e. until 2024-06-27 07:00 UTC.
> >
> >
> > [ ] +1  approve
> >
> > [ ] +0  no opinion
> >
> > [ ] -1  disapprove (and reason why)
> >
> >
> > Here is my +1
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-13 Thread Michael Sokolov
Thanks for digging into this Dawid - I think it's important to keep an
IDE dev path pretty clear of underbrush in order to encourage new
joiners, even if it is not the primary or best means of building and
testing

On Thu, Jun 13, 2024 at 2:01 PM Dawid Weiss  wrote:
>
>
> Hi Mike,
>
> Just FYI - I confirm something is odd with the configuration evaluation. The 
> times vary wildly on my machine. I don't know why it's the case and I 
> couldn't pinpoint a clear cause. Once the daemon is running, things are 
> faster - perhaps you should increase the default daemon timeout (it also 
> applies to the IDE, I think):
>
> # timeout after 15 mins of inactivity.
> org.gradle.daemon.idletimeout=90
>
> I'll try to improve things by refreshing some of the build scripts. I really 
> liked gradle when it started - mostly for its simplicity. I don't like how it 
> turned from a build system to a distributed cache of prebuilt artefacts... eh.
>
> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: scalar quantization heap usage during merge

2024-06-12 Thread Michael Sokolov
 Empirically I thought I saw the need to increase JVM heap with this,
but let me do some more testing to narrow down what is going on. It's
possible the same heap requirements exist for the non-quantized case
and I am just seeing some random vagary of the merge process happening
to tip over a limit. It's also possible I messed something up in
https://github.com/apache/lucene/pull/13469 which I am trying to use
in order to index quantized vectors without building an HNSW graph.

On Wed, Jun 12, 2024 at 10:24 AM Benjamin Trent  wrote:
>
> Heya Michael,
>
> > the first one I traced was referenced by vector writers involved in a merge 
> > (Lucene99FlatVectorsWriter.FieldsWriter.vectors). Is this expected?
>
> Yes, that is holding the raw floats before flush. You should see
> nearly the exact same overhead there as you would indexing raw
> vectors. I would be surprised if there is a significant memory usage
> difference due to Lucene99FlatVectorsWriter when using quantized vs.
> not.
>
> The flow is this:
>
>  - Lucene99FlatVectorsWriter gets the float[] vector and makes a copy
> of it (does this no matter what) and passes on to the next part of the
> chain
>  - If quantizing, the next part of the chain is
> Lucene99ScalarQuantizedVectorsWriter.FieldsWriter, which only keeps a
> REFERENCE to the array, it doesn't copy it. The float vector array is
> then passed to the HNSW indexer (if its being used), which also does
> NOT copy, but keeps a reference.
>  - If not quantizing but indexing, Lucene99FlatVectorsWriter will pass
> it directly to the hnsw indexer, which does not copy it, but does add
> it to the HNSW graph
>
> > I wonder if there is an opportunity to move some of this off-heap?
>
> I think we could do some things off-heap in the ScalarQuantizer. Maybe
> even during "flush", but we would have to adjust the interfaces some
> so that the scalarquantizer can know where the vectors are being
> stored after the initial flush. Right now there is no way to know the
> file nor file handle.
>
> > I can imagine that when we requantize we need to scan all the vectors to 
> > determine the new quantization settings?
>
> We shouldn't be scanning every vector. We do take a sampling, though
> that sampling can be large. There is here an opportunity for off-heap
> action if possible. Though I don't know how we could do that before
> flush. I could see the off-heap idea helping on merge.
>
> > Maybe we could do two passes - merge the float vectors while recalculating, 
> > and then re-scan to do the actual quantization?
>
> I am not sure what you mean here by "merge the float vectors". If you
> mean simply reading the individual float vector files and combining
> them into a single file, we already do that separately from
> quantizing.
>
> Thank you for digging into this. Glad others are experimenting!
>
> Ben
>
> On Wed, Jun 12, 2024 at 8:57 AM Michael Sokolov  wrote:
> >
> > Hi folks. I've been experimenting with our new scalar quantization
> > support - yay, thanks for adding it! I'm finding that when I index a
> > large number of large vectors, enabling quantization (vs simply
> > indexing the full-width floats) requires more heap - I keep getting
> > OOMs and have to increase heap size. I took a heap dump, and not
> > surprisingly I found some big arrays of floats and bytes, and the
> > first one I traced was referenced by vector writers involved in a
> > merge (Lucene99FlatVectorsWriter.FieldsWriter.vectors). Is this
> > expected? I wonder if there is an opportunity to move some of this
> > off-heap?  I can imagine that when we requantize we need to scan all
> > the vectors to determine the new quantization settings?  Maybe we
> > could do two passes - merge the float vectors while recalculating, and
> > then re-scan to do the actual quantization?
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



scalar quantization heap usage during merge

2024-06-12 Thread Michael Sokolov
Hi folks. I've been experimenting with our new scalar quantization
support - yay, thanks for adding it! I'm finding that when I index a
large number of large vectors, enabling quantization (vs simply
indexing the full-width floats) requires more heap - I keep getting
OOMs and have to increase heap size. I took a heap dump, and not
surprisingly I found some big arrays of floats and bytes, and the
first one I traced was referenced by vector writers involved in a
merge (Lucene99FlatVectorsWriter.FieldsWriter.vectors). Is this
expected? I wonder if there is an opportunity to move some of this
off-heap?  I can imagine that when we requantize we need to scan all
the vectors to determine the new quantization settings?  Maybe we
could do two passes - merge the float vectors while recalculating, and
then re-scan to do the actual quantization?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-10 Thread Michael Sokolov
If I set IJ build/test to "gradle" and then right click on "core" in
the Project tab -- it gives an option like "run tests in
lucene-root.lucene.core" which works. At the very top (lucene
[lucene-root]) of the hierarchy you can right-click and select "run
all tests", but this fails with "Error running 'All in lucene-root':
No junit.jar". I thought this had once worked, but maybe I was only
running tests in core?

On Mon, Jun 10, 2024 at 9:37 AM Dawid Weiss  wrote:
>>
>> When I say "run in IJ" I mean I right clicked a button somewhere and said 
>> "run all tests" :) I expect it was with the gradle runner selected.
>
>
> When you find that button, let me know. It's probably right next to the Holy 
> Grail. ;)
>
> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-10 Thread Michael Sokolov
>
> Yet I feel certain I have been able to run all tests in IJ before.
>
>
>
> I don't think this was ever the case with intellij. Or maybe you ran those
> tests via gradle?


When I say "run in IJ" I mean I right clicked a button somewhere and said
"run all tests" :) I expect it was with the gradle runner selected.


On Mon, Jun 10, 2024 at 6:38 AM Dawid Weiss  wrote:

>
> Yet I feel certain I have been able to run all tests in IJ before.
>>
>
> I don't think this was ever the case with intellij. Or maybe you ran those
> tests via gradle?
>
>
>> There are a few oddities that happen in intellij that require you to
>> fiddle with the build in odd ways, but I wonder if these will be
>> reproducible or if they maybe happen because there is some bad state:
>>
>
> Intellij changes from version to version so there is no "one" version,
> unfortunately. Also, sometimes
> some of the settings intellij sets up on the initial import persist and
> 'reloading' the gradle plugin does not
> help to update them. An occasional import from scratch is a good way to
> check if something like this happens.
>
> 1. Building branch_9x with intellij builder selected in the gradle
>> settings failed to build the benchmark module due to some modules not
>> being visible to it (e.g. icu). So I "unload module benchmark"
>> effectively skipping building that, and then I am able to build the
>> rest of lucene. YMMV
>>
>
> I can't reproduce this on main, haven't tried 9x.
>
>
>> 2. After switching back to main branch, I got a build failure  "error:
>> Annotation generator had thrown the exception.
>> javax.annotation.processing.FilerException: Attempt to recreate a file
>> for type
>> org.apache.lucene.benchmark.jmh.jmh_generated.ExpressionsBenchmark_expression_jmhTest".
>> I see there are some generated classes in
>> lucene/benchmark_jmh/src/java/generated, that show up in git status,
>> so I remove that folder and then everything is fine - some cruft left
>> from a previous build?
>>
>
> This is intellij's compiler emitting annotation processor output (jmh) to
> an incorrect location.
> It's javac's '-s' option. Not sure how to configure this option so that
> intellij picks it up from the gradle build model.
>
>
>> Side note: when running all tests in "intellij" mode you cannot do it
>> by selecting the "core" module - you have to navigate down to the
>> "tests" folder.
>
>
> Correct, This is the classpath-container unit which contains tests.
>
>
>> Also I observed that when running tests in "gradle"
>> mode I no longer observed the slow startup times? Really unsure what
>> that means. Maybe some networking thing?
>>
>
> Or the daemon starting for the first time - this is relatively expensive.
> Once the daemon is up, launch times should
> be faster.
>
>
>> But the main thing I learned is that while running tests using
>> intellij builder mostly works, MemorySegmentIndexInputProvider fails
>> to get loaded and any test using MMapDirectory will fail, regardless
>> of whether I run a single test or a whole suite. This is true on both
>> 9x and main branches and causes 1/3-1/2 of tests to fail in core.
>>
>
> This class is in src/java21 - it's not picked up as a source folder by
> intellij. And if you add it manually, you'll get errors related to
> compilation because of the way the gradle build "cheats" javac and bypasses
> explicitly importing
> jdk.incubator.vector and not declaring --enable-preview...
>
> In short: yes, it'll be difficult to work around that, especially with an
> automatic project import in intellij (perhaps you could hand-craft
> configuration files so that it works, I'm not sure).
>
>
>> At this point I'm reluctant to recommend using the intellij build
>> mode. Maybe it will become viable again if we can figure out how to
>> get MMapDirectory tests to work with it?
>>
>
> I use it because it's much faster. Whenever I need something  more
> complex, I set up a dedicated gradle launch configuration for that - like
> so:
>
> [image: image.png]
>
> I'm neither gradle nor intellij expert though, it's mostly a
> trial-and-error of what works and what doesn't...
>
> D.
>


Re: Intellij build/test times

2024-06-09 Thread Michael Sokolov
OK, I can see how the directory structure might be at odds
w/intellij's view of the world.Yet I feel certain I have been able to
run all tests in IJ before.

Just to disconfirm my insanity I tried again building and running all
tests in core on branch_9x/main using both intellij and gradle
build/test options (both via in intellij).

There are a few oddities that happen in intellij that require you to
fiddle with the build in odd ways, but I wonder if these will be
reproducible or if they maybe happen because there is some bad state:

1. Building branch_9x with intellij builder selected in the gradle
settings failed to build the benchmark module due to some modules not
being visible to it (e.g. icu). So I "unload module benchmark"
effectively skipping building that, and then I am able to build the
rest of lucene. YMMV
2. After switching back to main branch, I got a build failure  "error:
Annotation generator had thrown the exception.
javax.annotation.processing.FilerException: Attempt to recreate a file
for type 
org.apache.lucene.benchmark.jmh.jmh_generated.ExpressionsBenchmark_expression_jmhTest".
I see there are some generated classes in
lucene/benchmark_jmh/src/java/generated, that show up in git status,
so I remove that folder and then everything is fine - some cruft left
from a previous build?

Side note: when running all tests in "intellij" mode you cannot do it
by selecting the "core" module - you have to navigate down to the
"tests" folder. Also I observed that when running tests in "gradle"
mode I no longer observed the slow startup times? Really unsure what
that means. Maybe some networking thing?

But the main thing I learned is that while running tests using
intellij builder mostly works, MemorySegmentIndexInputProvider fails
to get loaded and any test using MMapDirectory will fail, regardless
of whether I run a single test or a whole suite. This is true on both
9x and main branches and causes 1/3-1/2 of tests to fail in core.

At this point I'm reluctant to recommend using the intellij build
mode. Maybe it will become viable again if we can figure out how to
get MMapDirectory tests to work with it?

On Sat, Jun 8, 2024 at 4:06 PM Dawid Weiss  wrote:
>
>
>> By the way, the
>> classpath problems seem to occur with either method (gradle or
>> intellij) when running entire suite - I just confused while switching
>> back and forth. This is on main, haven't tried 9x recently
>
>
> Some of these headaches are caused by Lucene's folder structure and have been 
> there forever - resources mixed with source classes. I don't know if you can 
> make intellij use a folder as a resource and as source directory at the same 
> time - I don't think it's possible. If so, tests that rely on these resources 
> will fail. It's been this way since I remember - nothing has changed here.
>
> There is also a lot of trickery involving modular paths etc. I don't think 
> it'll be easy to simulate this in Intellij. Then, 99% of test cases will run 
> just fine from intellij without any special hacks (I think)...
>
> I'd say - run individual tests from intellij, add a test launching config 
> redirecting to gradle for the whole suite - it should also be faster this way 
> since tests will run in parallel between modules.
>
> D.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-08 Thread Michael Sokolov
Indeed, it's when I run multiple tests that I see the problems.
Running single test classes seems to work OK. In the past I have been
able to run the entire test suite, but I agree this is less critical
than being able to debug single tests. Cursory internet search
indicates the problem is widespread and others propose using the same
plan - don't use gradle test runner in intellij. By the way, the
classpath problems seem to occur with either method (gradle or
intellij) when running entire suite - I just confused while switching
back and forth. This is on main, haven't tried 9x recently

On Fri, Jun 7, 2024 at 4:05 PM Dawid Weiss  wrote:
>
>
> Hi Mike,
>
> Are you trying to run all the tests from Lucene from IntelliJ? I admit I 
> haven't tried that... :) I usually use intellij for running/ debugging 
> isolated classes, then rerun the full suite from command line (increased 
> parallelism). I don't think everything will work - if something needs a 
> specific setup done by gradle tasks or has resources under src, where they're 
> not seed as resources by intellij and thus not copied - tough luck. But most 
> stuff should work.
>
> Running via gradle is slow for me not just with Lucene but also with other 
> projects... I can take a look but I'm pessimistic I can do any wonders here.
>
> Dawid
>
> On Fri, Jun 7, 2024 at 6:06 PM Michael Sokolov  wrote:
>>
>> I'm also getting errors like:
>>
>> Caused by: java.lang.ExceptionInInitializerError: Exception
>> java.lang.LinkageError: MemorySegmentIndexInputProvider is missing in
>> Lucene JAR file [in thread
>> "TEST-TestDemo.testDemo-seed#[872544629C2881C6]"]
>>
>> I wonder if this is due to some kind of module permissions thing
>> controlling the visibility of these symbols?
>>
>> On Fri, Jun 7, 2024 at 11:53 AM Michael Sokolov  wrote:
>> >
>> > hm I found FakeCharFilterFactory in src/test/META-INF.services -- it's
>> > in a "test sources root" folder and won't allow itself to be set as a
>> > resources folder? hm even after fiddling with this - I finally get to
>> > mark it as "test resources root" my test is still not passing. This
>> > can't be this hard!
>> >
>> > On Fri, Jun 7, 2024 at 11:44 AM Michael Sokolov  wrote:
>> > >
>> > > hmm so after playing around with this Intellij build for a bit I ran
>> > > into some trouble -- all the tests relying on SPI seemed to start
>> > > failing. So then I switched back to build with Gradle and rebuild the
>> > > project and these tests passed. Just to double check there wasn't some
>> > > strange stale build problem, I think switched back again to IntelliJ
>> > > builder and I still see the same failures; example is like:
>> > >
>> > > NOTE: reproduce with: gradlew test --tests
>> > > TestAnalysisSPILoader.testLookupCharFilter
>> > > -Dtests.seed=88A2DA17C6510A33 -Dtests.locale=en-PR
>> > > -Dtests.timezone=Etc/GMT-9 -Dtests.asserts=true
>> > > -Dtests.file.encoding=UTF-8
>> > >
>> > > java.lang.IllegalArgumentException: A SPI class of type
>> > > org.apache.lucene.analysis.CharFilterFactory with name 'Fake' does not
>> > > exist. You need to add the corresponding JAR file supporting this SPI
>> > > to your classpath. The current classpath supports the following names:
>> > > []
>> > >
>> > > I guess there must be some setup required in order to expose the SPI
>> > > resource files to the build? So I checked some of the resources
>> > > folders like lucene/analysis/common/src/resources and sure enough it
>> > > is labeled as a resources folder in intellij UI. So ... what am I
>> > > missing?
>> > >
>> > > On Fri, Jun 7, 2024 at 10:40 AM Michael Sokolov  
>> > > wrote:
>> > > >
>> > > > ok, life must be scary for developers on windows!
>> > > >
>> > > > On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss  
>> > > > wrote:
>> > > > >
>> > > > >
>> > > > > Certain regenerate tasks do require perl and python indeed.
>> > > > >
>> > > > > On Fri, Jun 7, 2024 at 2:23 PM Michael Sokolov  
>> > > > > wrote:
>> > > > >>
>> > > > >> While editing this CONTRIBUTING.md I found the following statement:
>> > > > >>
>> > > > >> Some build 

Re: Intellij build/test times

2024-06-07 Thread Michael Sokolov
I'm also getting errors like:

Caused by: java.lang.ExceptionInInitializerError: Exception
java.lang.LinkageError: MemorySegmentIndexInputProvider is missing in
Lucene JAR file [in thread
"TEST-TestDemo.testDemo-seed#[872544629C2881C6]"]

I wonder if this is due to some kind of module permissions thing
controlling the visibility of these symbols?

On Fri, Jun 7, 2024 at 11:53 AM Michael Sokolov  wrote:
>
> hm I found FakeCharFilterFactory in src/test/META-INF.services -- it's
> in a "test sources root" folder and won't allow itself to be set as a
> resources folder? hm even after fiddling with this - I finally get to
> mark it as "test resources root" my test is still not passing. This
> can't be this hard!
>
> On Fri, Jun 7, 2024 at 11:44 AM Michael Sokolov  wrote:
> >
> > hmm so after playing around with this Intellij build for a bit I ran
> > into some trouble -- all the tests relying on SPI seemed to start
> > failing. So then I switched back to build with Gradle and rebuild the
> > project and these tests passed. Just to double check there wasn't some
> > strange stale build problem, I think switched back again to IntelliJ
> > builder and I still see the same failures; example is like:
> >
> > NOTE: reproduce with: gradlew test --tests
> > TestAnalysisSPILoader.testLookupCharFilter
> > -Dtests.seed=88A2DA17C6510A33 -Dtests.locale=en-PR
> > -Dtests.timezone=Etc/GMT-9 -Dtests.asserts=true
> > -Dtests.file.encoding=UTF-8
> >
> > java.lang.IllegalArgumentException: A SPI class of type
> > org.apache.lucene.analysis.CharFilterFactory with name 'Fake' does not
> > exist. You need to add the corresponding JAR file supporting this SPI
> > to your classpath. The current classpath supports the following names:
> > []
> >
> > I guess there must be some setup required in order to expose the SPI
> > resource files to the build? So I checked some of the resources
> > folders like lucene/analysis/common/src/resources and sure enough it
> > is labeled as a resources folder in intellij UI. So ... what am I
> > missing?
> >
> > On Fri, Jun 7, 2024 at 10:40 AM Michael Sokolov  wrote:
> > >
> > > ok, life must be scary for developers on windows!
> > >
> > > On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss  wrote:
> > > >
> > > >
> > > > Certain regenerate tasks do require perl and python indeed.
> > > >
> > > > On Fri, Jun 7, 2024 at 2:23 PM Michael Sokolov  
> > > > wrote:
> > > >>
> > > >> While editing this CONTRIBUTING.md I found the following statement:
> > > >>
> > > >> Some build tasks (in particular `./gradlew check`) require Perl
> > > >> and Python 3.
> > > >>
> > > >> Is it actually true that we require Perl?
> > > >>
> > > >> On Fri, Jun 7, 2024 at 8:11 AM Michael Sokolov  
> > > >> wrote:
> > > >> >
> > > >> > So I'm glad we have a fix for this, but it's making me realize that
> > > >> > any new joiner that uses intellij (probably most of them?) will have
> > > >> > this problem and have no idea what to do about it. They will just
> > > >> > conclude - running Lucene tests in intellij sucks. If we revived that
> > > >> > intellij target maybe that would help - but .. you would have to know
> > > >> > to run it! So then I went to look at our project web page to see what
> > > >> > kind of developer docs we have that a new contributor might find.
> > > >> >
> > > >> > The first place Google sent me was to our github page
> > > >> > https://github.com/apache/lucene/?tab=readme-ov-file-- that one has
> > > >> > some very brief description about how to build, but nothing about
> > > >> > intellij. It does have a prominent link to "Developer documentation"
> > > >> > which is here: https://github.com/apache/lucene/tree/main/dev-docs 
> > > >> > but
> > > >> > that folder is mostly empty; it has a few somewhat esoteric bits of
> > > >> > info, but again nothing basic about building and testing; no
> > > >> > discussion of all the myriad gradle tasks and deep help info that
> > > >> > exists there.
> > > >> >
> > > >> > Next I tried looking on apache.org, but actually it is quite hard to
> > > >> > find an

Re: Intellij build/test times

2024-06-07 Thread Michael Sokolov
hm I found FakeCharFilterFactory in src/test/META-INF.services -- it's
in a "test sources root" folder and won't allow itself to be set as a
resources folder? hm even after fiddling with this - I finally get to
mark it as "test resources root" my test is still not passing. This
can't be this hard!

On Fri, Jun 7, 2024 at 11:44 AM Michael Sokolov  wrote:
>
> hmm so after playing around with this Intellij build for a bit I ran
> into some trouble -- all the tests relying on SPI seemed to start
> failing. So then I switched back to build with Gradle and rebuild the
> project and these tests passed. Just to double check there wasn't some
> strange stale build problem, I think switched back again to IntelliJ
> builder and I still see the same failures; example is like:
>
> NOTE: reproduce with: gradlew test --tests
> TestAnalysisSPILoader.testLookupCharFilter
> -Dtests.seed=88A2DA17C6510A33 -Dtests.locale=en-PR
> -Dtests.timezone=Etc/GMT-9 -Dtests.asserts=true
> -Dtests.file.encoding=UTF-8
>
> java.lang.IllegalArgumentException: A SPI class of type
> org.apache.lucene.analysis.CharFilterFactory with name 'Fake' does not
> exist. You need to add the corresponding JAR file supporting this SPI
> to your classpath. The current classpath supports the following names:
> []
>
> I guess there must be some setup required in order to expose the SPI
> resource files to the build? So I checked some of the resources
> folders like lucene/analysis/common/src/resources and sure enough it
> is labeled as a resources folder in intellij UI. So ... what am I
> missing?
>
> On Fri, Jun 7, 2024 at 10:40 AM Michael Sokolov  wrote:
> >
> > ok, life must be scary for developers on windows!
> >
> > On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss  wrote:
> > >
> > >
> > > Certain regenerate tasks do require perl and python indeed.
> > >
> > > On Fri, Jun 7, 2024 at 2:23 PM Michael Sokolov  wrote:
> > >>
> > >> While editing this CONTRIBUTING.md I found the following statement:
> > >>
> > >> Some build tasks (in particular `./gradlew check`) require Perl
> > >> and Python 3.
> > >>
> > >> Is it actually true that we require Perl?
> > >>
> > >> On Fri, Jun 7, 2024 at 8:11 AM Michael Sokolov  
> > >> wrote:
> > >> >
> > >> > So I'm glad we have a fix for this, but it's making me realize that
> > >> > any new joiner that uses intellij (probably most of them?) will have
> > >> > this problem and have no idea what to do about it. They will just
> > >> > conclude - running Lucene tests in intellij sucks. If we revived that
> > >> > intellij target maybe that would help - but .. you would have to know
> > >> > to run it! So then I went to look at our project web page to see what
> > >> > kind of developer docs we have that a new contributor might find.
> > >> >
> > >> > The first place Google sent me was to our github page
> > >> > https://github.com/apache/lucene/?tab=readme-ov-file-- that one has
> > >> > some very brief description about how to build, but nothing about
> > >> > intellij. It does have a prominent link to "Developer documentation"
> > >> > which is here: https://github.com/apache/lucene/tree/main/dev-docs but
> > >> > that folder is mostly empty; it has a few somewhat esoteric bits of
> > >> > info, but again nothing basic about building and testing; no
> > >> > discussion of all the myriad gradle tasks and deep help info that
> > >> > exists there.
> > >> >
> > >> > Next I tried looking on apache.org, but actually it is quite hard to
> > >> > find any info about Lucene there - Apache just has too many projects.
> > >> > I did finally find this page though
> > >> > https://projects.apache.org/project.html?lucene-core and it links to
> > >> > https://lucene.apache.org/core/. From there, I see a "Developer" link,
> > >> > again this page has a paucity of info; basically it links you to
> > >> > github, jenkins, and to the wiki. The "wiki" link actually just takes
> > >> > you to a different github page -- and *this* one actually has some
> > >> > useful info on how to build -- I think it's our best "intro" page for
> > >> > a new developer. However all it says about IntelliJ is: "Int

Re: Intellij build/test times

2024-06-07 Thread Michael Sokolov
hmm so after playing around with this Intellij build for a bit I ran
into some trouble -- all the tests relying on SPI seemed to start
failing. So then I switched back to build with Gradle and rebuild the
project and these tests passed. Just to double check there wasn't some
strange stale build problem, I think switched back again to IntelliJ
builder and I still see the same failures; example is like:

NOTE: reproduce with: gradlew test --tests
TestAnalysisSPILoader.testLookupCharFilter
-Dtests.seed=88A2DA17C6510A33 -Dtests.locale=en-PR
-Dtests.timezone=Etc/GMT-9 -Dtests.asserts=true
-Dtests.file.encoding=UTF-8

java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.analysis.CharFilterFactory with name 'Fake' does not
exist. You need to add the corresponding JAR file supporting this SPI
to your classpath. The current classpath supports the following names:
[]

I guess there must be some setup required in order to expose the SPI
resource files to the build? So I checked some of the resources
folders like lucene/analysis/common/src/resources and sure enough it
is labeled as a resources folder in intellij UI. So ... what am I
missing?

On Fri, Jun 7, 2024 at 10:40 AM Michael Sokolov  wrote:
>
> ok, life must be scary for developers on windows!
>
> On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss  wrote:
> >
> >
> > Certain regenerate tasks do require perl and python indeed.
> >
> > On Fri, Jun 7, 2024 at 2:23 PM Michael Sokolov  wrote:
> >>
> >> While editing this CONTRIBUTING.md I found the following statement:
> >>
> >> Some build tasks (in particular `./gradlew check`) require Perl
> >> and Python 3.
> >>
> >> Is it actually true that we require Perl?
> >>
> >> On Fri, Jun 7, 2024 at 8:11 AM Michael Sokolov  wrote:
> >> >
> >> > So I'm glad we have a fix for this, but it's making me realize that
> >> > any new joiner that uses intellij (probably most of them?) will have
> >> > this problem and have no idea what to do about it. They will just
> >> > conclude - running Lucene tests in intellij sucks. If we revived that
> >> > intellij target maybe that would help - but .. you would have to know
> >> > to run it! So then I went to look at our project web page to see what
> >> > kind of developer docs we have that a new contributor might find.
> >> >
> >> > The first place Google sent me was to our github page
> >> > https://github.com/apache/lucene/?tab=readme-ov-file-- that one has
> >> > some very brief description about how to build, but nothing about
> >> > intellij. It does have a prominent link to "Developer documentation"
> >> > which is here: https://github.com/apache/lucene/tree/main/dev-docs but
> >> > that folder is mostly empty; it has a few somewhat esoteric bits of
> >> > info, but again nothing basic about building and testing; no
> >> > discussion of all the myriad gradle tasks and deep help info that
> >> > exists there.
> >> >
> >> > Next I tried looking on apache.org, but actually it is quite hard to
> >> > find any info about Lucene there - Apache just has too many projects.
> >> > I did finally find this page though
> >> > https://projects.apache.org/project.html?lucene-core and it links to
> >> > https://lucene.apache.org/core/. From there, I see a "Developer" link,
> >> > again this page has a paucity of info; basically it links you to
> >> > github, jenkins, and to the wiki. The "wiki" link actually just takes
> >> > you to a different github page -- and *this* one actually has some
> >> > useful info on how to build -- I think it's our best "intro" page for
> >> > a new developer. However all it says about IntelliJ is: "IntelliJ -
> >> > IntelliJ idea can import and build gradle-based projects out of the
> >> > box." true, sort of.
> >> >
> >> > So I think I will (1) add a note about this IJ build setting to that
> >> > page, and (2) consolidate some of the other links to go here instead
> >> > of routing folks through a twisty maze of web pages
> >> >
> >> > On Fri, Jun 7, 2024 at 7:45 AM Stefan Vodita  
> >> > wrote:
> >> > >
> >> > > +1, I had the same problem and it seems better now. Thank you, Dawid!
> >> > >
> >> > > On Thu, 6 Jun 2024 at 12:20, Michael Sokolov  
> >> > > wrote:
> >> > >>
> >> > >> Oh! 

Re: Intellij build/test times

2024-06-07 Thread Michael Sokolov
ok, life must be scary for developers on windows!

On Fri, Jun 7, 2024 at 10:33 AM Dawid Weiss  wrote:
>
>
> Certain regenerate tasks do require perl and python indeed.
>
> On Fri, Jun 7, 2024 at 2:23 PM Michael Sokolov  wrote:
>>
>> While editing this CONTRIBUTING.md I found the following statement:
>>
>> Some build tasks (in particular `./gradlew check`) require Perl
>> and Python 3.
>>
>> Is it actually true that we require Perl?
>>
>> On Fri, Jun 7, 2024 at 8:11 AM Michael Sokolov  wrote:
>> >
>> > So I'm glad we have a fix for this, but it's making me realize that
>> > any new joiner that uses intellij (probably most of them?) will have
>> > this problem and have no idea what to do about it. They will just
>> > conclude - running Lucene tests in intellij sucks. If we revived that
>> > intellij target maybe that would help - but .. you would have to know
>> > to run it! So then I went to look at our project web page to see what
>> > kind of developer docs we have that a new contributor might find.
>> >
>> > The first place Google sent me was to our github page
>> > https://github.com/apache/lucene/?tab=readme-ov-file-- that one has
>> > some very brief description about how to build, but nothing about
>> > intellij. It does have a prominent link to "Developer documentation"
>> > which is here: https://github.com/apache/lucene/tree/main/dev-docs but
>> > that folder is mostly empty; it has a few somewhat esoteric bits of
>> > info, but again nothing basic about building and testing; no
>> > discussion of all the myriad gradle tasks and deep help info that
>> > exists there.
>> >
>> > Next I tried looking on apache.org, but actually it is quite hard to
>> > find any info about Lucene there - Apache just has too many projects.
>> > I did finally find this page though
>> > https://projects.apache.org/project.html?lucene-core and it links to
>> > https://lucene.apache.org/core/. From there, I see a "Developer" link,
>> > again this page has a paucity of info; basically it links you to
>> > github, jenkins, and to the wiki. The "wiki" link actually just takes
>> > you to a different github page -- and *this* one actually has some
>> > useful info on how to build -- I think it's our best "intro" page for
>> > a new developer. However all it says about IntelliJ is: "IntelliJ -
>> > IntelliJ idea can import and build gradle-based projects out of the
>> > box." true, sort of.
>> >
>> > So I think I will (1) add a note about this IJ build setting to that
>> > page, and (2) consolidate some of the other links to go here instead
>> > of routing folks through a twisty maze of web pages
>> >
>> > On Fri, Jun 7, 2024 at 7:45 AM Stefan Vodita  
>> > wrote:
>> > >
>> > > +1, I had the same problem and it seems better now. Thank you, Dawid!
>> > >
>> > > On Thu, 6 Jun 2024 at 12:20, Michael Sokolov  wrote:
>> > >>
>> > >> Oh! TIL! so much better, thanks. And now I have the "Repeat" option
>> > >> back in the test runner
>> > >>
>> > >> On Thu, Jun 6, 2024 at 6:18 AM Dawid Weiss  
>> > >> wrote:
>> > >> >
>> > >> >
>> > >> > Don't know what's causing this... but I never run IntelliJ builds or 
>> > >> > tests through its gradle launcher, actually. Switch it to compile and 
>> > >> > run using its own built-in method - much faster.
>> > >> >
>> > >> >
>> > >> >
>> > >> > Dawid
>> > >> >
>> > >> > On Thu, Jun 6, 2024 at 12:10 PM Michael Sokolov  
>> > >> > wrote:
>> > >> >>
>> > >> >> Hi, I wonder how many of us are using intellij to run Lucene tests, 
>> > >> >> and if you are, have you noticed it having gotten really quite slow? 
>> > >> >> It seems to take a long time doing... Something... Before the test 
>> > >> >> starts running. I have a suspicion that we are using gradle in a way 
>> > >> >> that forces it to rebuild its cache every time or something like 
>> > >> >> that. Once upon a time we had an intellij build setup target that 
>> > >> >> set things up in a more intellij friendly way, according gradle, 
>> > >> >> didn't we? Does that still exist?
>> > >>
>> > >> -
>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-07 Thread Michael Sokolov
While editing this CONTRIBUTING.md I found the following statement:

Some build tasks (in particular `./gradlew check`) require Perl
and Python 3.

Is it actually true that we require Perl?

On Fri, Jun 7, 2024 at 8:11 AM Michael Sokolov  wrote:
>
> So I'm glad we have a fix for this, but it's making me realize that
> any new joiner that uses intellij (probably most of them?) will have
> this problem and have no idea what to do about it. They will just
> conclude - running Lucene tests in intellij sucks. If we revived that
> intellij target maybe that would help - but .. you would have to know
> to run it! So then I went to look at our project web page to see what
> kind of developer docs we have that a new contributor might find.
>
> The first place Google sent me was to our github page
> https://github.com/apache/lucene/?tab=readme-ov-file-- that one has
> some very brief description about how to build, but nothing about
> intellij. It does have a prominent link to "Developer documentation"
> which is here: https://github.com/apache/lucene/tree/main/dev-docs but
> that folder is mostly empty; it has a few somewhat esoteric bits of
> info, but again nothing basic about building and testing; no
> discussion of all the myriad gradle tasks and deep help info that
> exists there.
>
> Next I tried looking on apache.org, but actually it is quite hard to
> find any info about Lucene there - Apache just has too many projects.
> I did finally find this page though
> https://projects.apache.org/project.html?lucene-core and it links to
> https://lucene.apache.org/core/. From there, I see a "Developer" link,
> again this page has a paucity of info; basically it links you to
> github, jenkins, and to the wiki. The "wiki" link actually just takes
> you to a different github page -- and *this* one actually has some
> useful info on how to build -- I think it's our best "intro" page for
> a new developer. However all it says about IntelliJ is: "IntelliJ -
> IntelliJ idea can import and build gradle-based projects out of the
> box." true, sort of.
>
> So I think I will (1) add a note about this IJ build setting to that
> page, and (2) consolidate some of the other links to go here instead
> of routing folks through a twisty maze of web pages
>
> On Fri, Jun 7, 2024 at 7:45 AM Stefan Vodita  wrote:
> >
> > +1, I had the same problem and it seems better now. Thank you, Dawid!
> >
> > On Thu, 6 Jun 2024 at 12:20, Michael Sokolov  wrote:
> >>
> >> Oh! TIL! so much better, thanks. And now I have the "Repeat" option
> >> back in the test runner
> >>
> >> On Thu, Jun 6, 2024 at 6:18 AM Dawid Weiss  wrote:
> >> >
> >> >
> >> > Don't know what's causing this... but I never run IntelliJ builds or 
> >> > tests through its gradle launcher, actually. Switch it to compile and 
> >> > run using its own built-in method - much faster.
> >> >
> >> >
> >> >
> >> > Dawid
> >> >
> >> > On Thu, Jun 6, 2024 at 12:10 PM Michael Sokolov  
> >> > wrote:
> >> >>
> >> >> Hi, I wonder how many of us are using intellij to run Lucene tests, and 
> >> >> if you are, have you noticed it having gotten really quite slow? It 
> >> >> seems to take a long time doing... Something... Before the test starts 
> >> >> running. I have a suspicion that we are using gradle in a way that 
> >> >> forces it to rebuild its cache every time or something like that. Once 
> >> >> upon a time we had an intellij build setup target that set things up in 
> >> >> a more intellij friendly way, according gradle, didn't we? Does that 
> >> >> still exist?
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-07 Thread Michael Sokolov
So I'm glad we have a fix for this, but it's making me realize that
any new joiner that uses intellij (probably most of them?) will have
this problem and have no idea what to do about it. They will just
conclude - running Lucene tests in intellij sucks. If we revived that
intellij target maybe that would help - but .. you would have to know
to run it! So then I went to look at our project web page to see what
kind of developer docs we have that a new contributor might find.

The first place Google sent me was to our github page
https://github.com/apache/lucene/?tab=readme-ov-file-- that one has
some very brief description about how to build, but nothing about
intellij. It does have a prominent link to "Developer documentation"
which is here: https://github.com/apache/lucene/tree/main/dev-docs but
that folder is mostly empty; it has a few somewhat esoteric bits of
info, but again nothing basic about building and testing; no
discussion of all the myriad gradle tasks and deep help info that
exists there.

Next I tried looking on apache.org, but actually it is quite hard to
find any info about Lucene there - Apache just has too many projects.
I did finally find this page though
https://projects.apache.org/project.html?lucene-core and it links to
https://lucene.apache.org/core/. From there, I see a "Developer" link,
again this page has a paucity of info; basically it links you to
github, jenkins, and to the wiki. The "wiki" link actually just takes
you to a different github page -- and *this* one actually has some
useful info on how to build -- I think it's our best "intro" page for
a new developer. However all it says about IntelliJ is: "IntelliJ -
IntelliJ idea can import and build gradle-based projects out of the
box." true, sort of.

So I think I will (1) add a note about this IJ build setting to that
page, and (2) consolidate some of the other links to go here instead
of routing folks through a twisty maze of web pages

On Fri, Jun 7, 2024 at 7:45 AM Stefan Vodita  wrote:
>
> +1, I had the same problem and it seems better now. Thank you, Dawid!
>
> On Thu, 6 Jun 2024 at 12:20, Michael Sokolov  wrote:
>>
>> Oh! TIL! so much better, thanks. And now I have the "Repeat" option
>> back in the test runner
>>
>> On Thu, Jun 6, 2024 at 6:18 AM Dawid Weiss  wrote:
>> >
>> >
>> > Don't know what's causing this... but I never run IntelliJ builds or tests 
>> > through its gradle launcher, actually. Switch it to compile and run using 
>> > its own built-in method - much faster.
>> >
>> >
>> >
>> > Dawid
>> >
>> > On Thu, Jun 6, 2024 at 12:10 PM Michael Sokolov  wrote:
>> >>
>> >> Hi, I wonder how many of us are using intellij to run Lucene tests, and 
>> >> if you are, have you noticed it having gotten really quite slow? It seems 
>> >> to take a long time doing... Something... Before the test starts running. 
>> >> I have a suspicion that we are using gradle in a way that forces it to 
>> >> rebuild its cache every time or something like that. Once upon a time we 
>> >> had an intellij build setup target that set things up in a more intellij 
>> >> friendly way, according gradle, didn't we? Does that still exist?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Intellij build/test times

2024-06-06 Thread Michael Sokolov
Oh! TIL! so much better, thanks. And now I have the "Repeat" option
back in the test runner

On Thu, Jun 6, 2024 at 6:18 AM Dawid Weiss  wrote:
>
>
> Don't know what's causing this... but I never run IntelliJ builds or tests 
> through its gradle launcher, actually. Switch it to compile and run using its 
> own built-in method - much faster.
>
>
>
> Dawid
>
> On Thu, Jun 6, 2024 at 12:10 PM Michael Sokolov  wrote:
>>
>> Hi, I wonder how many of us are using intellij to run Lucene tests, and if 
>> you are, have you noticed it having gotten really quite slow? It seems to 
>> take a long time doing... Something... Before the test starts running. I 
>> have a suspicion that we are using gradle in a way that forces it to rebuild 
>> its cache every time or something like that. Once upon a time we had an 
>> intellij build setup target that set things up in a more intellij friendly 
>> way, according gradle, didn't we? Does that still exist?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Intellij build/test times

2024-06-06 Thread Michael Sokolov
Hi, I wonder how many of us are using intellij to run Lucene tests, and if
you are, have you noticed it having gotten really quite slow? It seems to
take a long time doing... Something... Before the test starts running. I
have a suspicion that we are using gradle in a way that forces it to
rebuild its cache every time or something like that. Once upon a time we
had an intellij build setup target that set things up in a more intellij
friendly way, according gradle, didn't we? Does that still exist?


Re: [VOTE] Release Lucene 9.11.0 RC1

2024-06-03 Thread Michael Sokolov
+1

(tested w/Amazon Corretto JVM)
SUCCESS! [0:46:40.066524]

On Mon, Jun 3, 2024 at 7:30 AM Benjamin Trent  wrote:
>
> Please vote for release candidate 1 for Lucene 9.11.0
>
> The artifacts can be downloaded from:
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.0-RC1-rev-d433394b292e3562e0bb34222f7dd4f307e2b8ca
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.0-RC1-rev-d433394b292e3562e0bb34222f7dd4f307e2b8ca
>
> The vote will be open for at least 72 hours i.e. until 2024-06-06 12:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
>
> Thanks!
>
> Ben Trent

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.11

2024-05-28 Thread Michael Sokolov
I misread this as "Lucene 911" as in "Lucene Emergency!!!" -- might
not land for everyone - someday we will Have Lucene 11.2? But ... no
concerns from me aside from the things you mentioned - thanks for
pushing, Ben

On Tue, May 28, 2024 at 9:58 AM Benjamin Trent  wrote:
>
> Hey y'all,
>
> I am planning on starting the release process tomorrow (May 29).
>
> I am in the Eastern USA time zone, so I will start the process around noon 
> UTC.
>
> I noticed one PR from Stefan. I can wait for that one if I need to.
>
> Did we figure out the hppc concerns? I saw some PR activity, wanted to make 
> sure we are all still good with starting the release process this week.
>
> Anything else I should be aware of or wait for?
>
> Thanks!
>
> Ben Trent
>
> On Wed, May 15, 2024, 3:58 AM Chris Hegarty 
>  wrote:
>>
>> +1
>>
>> -Chris.
>>
>> > On 14 May 2024, at 16:10, Adrien Grand  wrote:
>> >
>> > +1 the 9.11 changelog looks great!
>> >
>> > On Tue, May 14, 2024 at 4:50 PM Benjamin Trent  
>> > wrote:
>> > Hey y'all,
>> >
>> > Looking at changes for 9.11, we are building a significant list. I propose 
>> > we do a release in the next couple of weeks.
>> >
>> > While this email is a little early (I am about to go on vacation for a 
>> > bit), I volunteer myself as release manager.
>> >
>> > Unless there are objections, I plan on kicking off the release process May 
>> > 28th.
>> >
>> > Thanks!
>> >
>> > Ben
>> >
>> >
>> > --
>> > Adrien
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Join module dependency

2024-05-19 Thread Michael Sokolov
I'm pretty sure it's only in core that we follow the no dependencies rule.

On Sat, May 18, 2024, 11:25 AM Bruno Roustant 
wrote:

> The facet module has a dependency on com.carrotsearch:hppc.
>
> Is it possible to add the same dependency to the join module ? What is the
> rule ?
>
> Thanks
>
> Bruno
>


Re: How much is ja.dict.UserDictionary used?

2024-05-18 Thread Michael Sokolov
We use it Amazon. I can't really read it so I'm not sure, but I think
it's used to encode terms that come up that aren't handled well by the
standard dictionary.

On Sat, May 18, 2024 at 8:39 AM Bruno Roustant  wrote:
>
> Hi,
>
> While looking at the various usages of Map with Integer keys, I found 
> ja.dict.UserDictionary with its lookup() method where there is a TODO: can we 
> avoid this treemap/toIndexArray?
>
> I could propose something, but I would like to know how much it is used, and 
> if it is worth improving it.
>
> Thanks
>
> Bruno

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: beasting tests

2024-04-04 Thread Michael Sokolov
Thanks for the explanation. It makes sense that we start with a given
seed and then each iteration is different because it re-uses the same
Random instance (or whatever static state?) without re-initialization?

On Wed, Apr 3, 2024 at 6:09 PM Dawid Weiss  wrote:
>
>
>> Now I just need to understand why the test failure is no longer reproducing 
>> lol.
>
>
> This is indeed the hard part!
>
>>
>> Also it's mildly confusing that when you specify tests.iters it prints a 
>> single test seed if it is actually going to use many different ones?
>
>
> It prints a single seed because it starts from that seed (the static 
> initialization, that is). But each test has its own starting point derived 
> from the main seed and the test name (if I recall right). So when you pass 
> tests.iters=100 and run a single test, any random call in that test 
> (excluding static hooks and static blocks) should be different on each 
> iteration. You can try it by adding:
>
> assumeTrue("", RandomizedTest.randomBoolean());
>
> For example (I added it to TestSearch.java):
>
> ./gradlew -p lucene/core test --tests TestSearch -Ptests.iters=100
> ...
> :lucene:core:test (SUCCESS): 100 test(s), 52 skipped
>
> if you modify this to assertTrue, you'll get to see the hierarchical seed 
> chain for each test that failed - note the first part is constant, the second 
> is derived for each test iteration:
>
> ./gradlew -p lucene/core test --tests TestSearch -Ptests.iters=100 
> -Ptests.seed=deadbeef
>
> and two example failures:
>
>   - org.apache.lucene.TestSearch.testSearch 
> {seed=[DEADBEEF:3EDB7869EFFD5034]} (:lucene:core)
> Test output: 
> /Users/dweiss/work/lucene/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.TestSearch.txt
> Reproduce with: gradlew :lucene:core:test --tests 
> "org.apache.lucene.TestSearch.testSearch {seed=[DEADBEEF:3EDB7869EFFD5034]}" 
> -Ptests.jvms=4 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC 
> -XX:ActiveProcessorCount=1" -Ptests.seed=deadbeef -Ptests.iters=100 
> -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=512
>
>   - org.apache.lucene.TestSearch.testSearch 
> {seed=[DEADBEEF:F44F3D10E8B98D27]} (:lucene:core)
> Test output: 
> /Users/dweiss/work/lucene/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.TestSearch.txt
> Reproduce with: gradlew :lucene:core:test --tests 
> "org.apache.lucene.TestSearch.testSearch {seed=[DEADBEEF:F44F3D10E8B98D27]}" 
> -Ptests.jvms=4 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC 
> -XX:ActiveProcessorCount=1" -Ptests.seed=deadbeef -Ptests.iters=100 
> -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=512
>
> If you'd like to repeat tests with *the same* starting seed for each test, 
> you need to pass the full chain, including the second part of the seed. For 
> example, this will fail 100 times (and not approximately 50% of the times):
>
> ./gradlew -p lucene/core test --tests TestSearch -Ptests.iters=100 
> -Ptests.seed=deadbeef:F44F3D10E8B98D27
>
> It may seem a bit complicated but it really isn't... I hope!  And for 99% of 
> tests, you'd probably rerun with the first part of the seed and it'd be 
> sufficient to locate the problem.
>
> The 'beast' task is a bit different because it physically re-launches the 
> test infrastructure so if you don't fix the initial seed, each started JVM 
> will have a different "starting" seed for static initializers and hooks. This 
> may matter for locale randomization, jvm issues or static initializers that 
> rely on randomness. But most isolated test methods should only rely on their 
> starting seed (not the "global starting seed").
>
> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: beasting tests

2024-04-02 Thread Michael Sokolov
Thank you! Now I just need to understand why the test failure is no longer
reproducing lol. Also it's mildly confusing that when you specify
tests.iters it prints a single test seed if it is actually going to use
many different ones? Anyway I will read more docs I am probably still
confusing beast and test?

On Tue, Apr 2, 2024, 6:27 PM Dawid Weiss  wrote:

>
> This section of the help file for testing explains the difference between
> 'beast', 'test' and various reiteration methods -
>
> https://github.com/apache/lucene/blob/main/help/tests.txt#L89-L123
>
> In *most* cases, tests.iters will be just as good as beasting (and much
> faster). The only difference is when you want class-level settings to be
> randomized differently (static initializers, for example).
>
> D.
>
> On Tue, Apr 2, 2024 at 10:54 PM Shubham Chaudhary 
> wrote:
>
>> I think you could try this:
>>
>> ./gradlew -p lucene/core beast -Ptests.dups=10 --tests
>> TestByteVectorSimilarityQuery
>>
>> I confirmed it uses a different seed (long value) for each run by
>> printing the seed here
>> <https://github.com/apache/lucene/blob/main/gradle/testing/beasting.gradle#L62-L66>
>> in beasting.gradle
>> <https://github.com/apache/lucene/blob/main/gradle/testing/beasting.gradle>
>> .
>>
>> - Shubham
>>
>> On Wed, Apr 3, 2024 at 1:49 AM Michael Sokolov 
>> wrote:
>>
>>> oh! I overlooked tests.dups -- but it doesn't seem to be doing what I
>>> expected. EG I tried
>>>
>>> ./gradlew -p lucene/core test --tests TestByteVectorSimilarityQuery
>>> -Ptests.dups=1000  -Ptests.multiplier=3
>>>
>>> and it completes very quickly reporting having run only 13 tests
>>>
>>> On Tue, Apr 2, 2024 at 4:14 PM Michael Sokolov 
>>> wrote:
>>> >
>>> > Is there  a convenient way to run a test multiple times with different
>>> > seeds? Do I need to write my own script? I feel like I used to be able
>>> > to do this in IntelliJ, but that option seems to have vanished, and I
>>> > don't see any such option in gradle testOpts either. I tried
>>> > -tests.iter but that seems to run the same test multiple times with
>>> > the same seed,
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>


Re: beasting tests

2024-04-02 Thread Michael Sokolov
oh! I overlooked tests.dups -- but it doesn't seem to be doing what I
expected. EG I tried

./gradlew -p lucene/core test --tests TestByteVectorSimilarityQuery
-Ptests.dups=1000  -Ptests.multiplier=3

and it completes very quickly reporting having run only 13 tests

On Tue, Apr 2, 2024 at 4:14 PM Michael Sokolov  wrote:
>
> Is there  a convenient way to run a test multiple times with different
> seeds? Do I need to write my own script? I feel like I used to be able
> to do this in IntelliJ, but that option seems to have vanished, and I
> don't see any such option in gradle testOpts either. I tried
> -tests.iter but that seems to run the same test multiple times with
> the same seed,

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



beasting tests

2024-04-02 Thread Michael Sokolov
Is there  a convenient way to run a test multiple times with different
seeds? Do I need to write my own script? I feel like I used to be able
to do this in IntelliJ, but that option seems to have vanished, and I
don't see any such option in gradle testOpts either. I tried
-tests.iter but that seems to run the same test multiple times with
the same seed,

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-9.x-Linux (64bit/hotspot/jdk-17.0.9) - Build # 15969 - Unstable!

2024-04-01 Thread Michael Sokolov
This TestBooleanMinShouldMatch.testRandomQueries failure did not
reproduce for me on branch_9x, with JDK 11 or JDK 17 or JDK 21. I ran
it a few times.

TestByteVectorSimilarityQuery.testSomeDeletes reproduces reliably -
I'll see if I can find out why it's unstable

On Mon, Apr 1, 2024 at 9:50 AM Policeman Jenkins Server
 wrote:
>
> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/15969/
> Java: 64bit/hotspot/jdk-17.0.9 -XX:+UseCompressedOops -XX:+UseG1GC
>
> 2 tests failed.
> FAILED:  org.apache.lucene.search.TestBooleanMinShouldMatch.testRandomQueries
>
> Error Message:
> java.lang.AssertionError: Doc 6 scores don't match
> TopDocs totalHits=6 hits top=6
> 0) doc=4score=9.264864
> 1) doc=0score=9.2647295
> 2) doc=6score=7.4225388
> 3) doc=1score=7.019951
> 4) doc=7score=7.019951
> 5) doc=3score=7.019923
> TopDocs totalHits=6 hits top=6
> 0) doc=4score=9.264864
> 1) doc=0score=9.2647295
> 2) doc=6score=7.422538
> 3) doc=1score=7.019951
> 4) doc=7score=7.019951
> 5) doc=3score=7.0199237
> for query:((data:6 data:foo data:Z data:3) -data:w* data:5 data:"Z A"~10 
> data:w* data:6 data:6 -data:Z)~2 expected:<7.422538> but was:<7.4225388>
>
> Stack Trace:
> java.lang.AssertionError: Doc 6 scores don't match
> TopDocs totalHits=6 hits top=6
> 0) doc=4score=9.264864
> 1) doc=0score=9.2647295
> 2) doc=6score=7.4225388
> 3) doc=1score=7.019951
> 4) doc=7score=7.019951
> 5) doc=3score=7.019923
> TopDocs totalHits=6 hits top=6
> 0) doc=4score=9.264864
> 1) doc=0score=9.2647295
> 2) doc=6score=7.422538
> 3) doc=1score=7.019951
> 4) doc=7score=7.019951
> 5) doc=3score=7.0199237
> for query:((data:6 data:foo data:Z data:3) -data:w* data:5 data:"Z A"~10 
> data:w* data:6 data:6 -data:Z)~2 expected:<7.422538> but was:<7.4225388>
> at 
> __randomizedtesting.SeedInfo.seed([D2206C8810CE6B9D:8C0BDC6428144603]:0)
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:577)
> at 
> org.apache.lucene.search.TestBooleanMinShouldMatch.assertSubsetOfSameScores(TestBooleanMinShouldMatch.java:384)
> at 
> org.apache.lucene.search.TestBooleanMinShouldMatch.testRandomQueries(TestBooleanMinShouldMatch.java:357)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at 
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at 
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>

Re: Lucene 10

2024-03-14 Thread Michael Sokolov
timing makes sense to me. +1 for having a deadline to reduce
procrastination, but Adrien I don't honestly believe anyone who is
paying attention thinks that is what you have been doing!

On Wed, Mar 13, 2024 at 10:40 AM Adrien Grand  wrote:
>
> Hello everyone!
>
> It's been ~2.5 years since we released Lucene 9.0 (December 2021) and I'd 
> like us to start working towards Lucene 10.0. I'm volunteering for being the 
> release manager and propose the following timeline:
>  - ~September 15th: main gets bumped to 11.x, branch_10x gets created
>  - ~September 22nd: Do a last 9.x minor release.
>  - ~October 1st: Release 10.0.
>
> This may sound like a long notice period. My motivation is that there are a 
> few changes I have on my mind that are likely worthy of a major release, and 
> I plan on taking advantage of a date being set to stop procrastinating and 
> finally start moving these enhancements forward. These are not blockers, only 
> my wish list for Lucene 10.0, if they are not ready in time we can have 
> discussions about letting them slip until the next major.
>  - Greater I/O concurrency. Can Lucene better utilize modern disks that are 
> plenty concurrent?
>  - Decouple search concurrency from index geometry. Can Lucene better utilize 
> modern CPUs that are plenty concurrent?
>  - "Sparse indexing" / "zone indexing" for sorted indexes. This is one of the 
> most efficient techniques that OLAP databases take advantage of to make 
> search fast. Let's bring it to Lucene.
>
> This list isn't meant to be an exhaustive list of release highlights for 
> Lucene 10, feel free to add your own. There are also a number of cleanups we 
> may want to consider. I wanted to share this list for visibility though in 
> case you have thoughts on these enhancements and/or would like to help.
>
> --
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Announcing githubsearch!

2024-02-27 Thread Michael Sokolov
No I think you only get one version. Maybe we can try adding the green
background out regular making it gray and keeping the transparent
background?

On Mon, Feb 26, 2024, 2:53 PM Michael McCandless 
wrote:

> Done!  Deployed!  Thank you Mike S.
>
> Though on my "dark mode" Chrome on a Macbook, it's super dark.  I can make
> it out but I gotta stare for a bit ... do they make light and dark mode
> .ico files in one!?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Feb 25, 2024 at 6:05 PM Michael Sokolov 
> wrote:
>
>> here is a favicon you might want to try: I cropped the "VL" from the
>> Apache Lucene logo (ok I guess it's an AL) -- if you save it as
>> favicon.ico in the root of your website (ie as url /favicon.ico) it
>> should show up in bookmarks, browser toolbars, etc as a handy memory
>> aid. Of course you might have other ideas for a picture - it's
>> actually pretty easy to make the favicon once you have a picture you
>> like; I followed the instructions here
>>
>> https://www.logikfabrik.se/blog/how-to-create-a-multisize-favicon-using-gimp/
>>
>> On Thu, Feb 22, 2024 at 10:48 AM Zhang Chao <80152...@qq.com.invalid>
>> wrote:
>> >
>> > Great job! Thanks Mike!
>> >
>> > 2024年2月22日 22:31,Alessandro Benedetti  写道:
>> >
>> > That's cool Mike! Well done!
>> >
>> > On Wed, 21 Feb 2024, 22:02 Anshum Gupta, 
>> wrote:
>> >>
>> >> This is great! Like always, thank you Mike!
>> >>
>> >> On Mon, Feb 19, 2024 at 8:40 AM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>> >>>
>> >>> Hi Team,
>> >>>
>> >>> ~1.5 years ago (August 2022) we migrated our Lucene issue tracking
>> from Jira to GitHub. Thank you Tomoko for all the hard work doing such a
>> complex, multi-phased, high-fidelity migration!
>> >>>
>> >>> I finally finished also migrating jirasearch to GitHub:
>> githubsearch.mikemccandless.com. It was tricky because GitHub issues/PRs
>> are fundamentally more complex than Jira's data model, and the GitHub REST
>> API is also quite rich / heavily normalized. All of the source code for
>> githubsearch lives here. The UI remains its barebones self ;)
>> >>>
>> >>> Githubsearch is dog food for us: it showcases Lucene (currently
>> 9.8.0), and many of its fun features like infix autosuggest, block join
>> queries (each comment is a sub-document on the issue/PR), DrillSideways
>> faceting, near-real-time indexing/searching, synonyms (try “oome”),
>> expressions, non-relevance and blended-relevance sort, etc.  (This old blog
>> post goes into detail.)  Plus, it’s meta-fun to use Lucene to search its
>> own issues, to help us be more productive in improving Lucene!  Nicely
>> recursive.
>> >>>
>> >>> In addition to good ol’ searching by text, githubsearch has some
>> new/fun features:
>> >>>
>> >>> Drill down to just PRs or issues
>> >>> Filter by “review requested” for a given user: poor Adrien has 8
>> (open) now (sorry)! Or see your mentions (Robert is mentioned in 27 open
>> issues/PRs). Or PRs that you reviewed (Uwe has reviewed 9 still-open PRs).
>> Or issues and PRs where a user has had any involvement at all (Dawid has
>> interacted on 197 issues/PRs).
>> >>> Find still-open PRs that were created by a New Contributor (an author
>> who has no changes merged into our repository) or Contributor
>> (non-committer who has had some changes merged into our repository) or
>> Member
>> >>> Here are the uber-stale (last touched more than a month ago) open PRs
>> by outside contributors. We should ideally keep this at 0, but it’s 83 now!
>> >>> “Link to this search” to get a short-er, more permanent URL (it is
>> NOT a URL shortener, though!)
>> >>> Save named searches you frequently run (they just save to local
>> cookie state on that one browser)
>> >>>
>> >>> I’m sure there are exciting bugs, feedback/patches welcome!  If you
>> see problems, please reply to this email or file an issue here.
>> >>>
>> >>> Note that jirasearch remains running, to search Solr, Tika and Infra
>> issues.
>> >>>
>> >>> Happy Searching,
>> >>>
>> >>> Mike McCandless
>> >>>
>> >>> http://blog.mikemccandless.com
>> >>
>> >>
>> >>
>> >> --
>> >> Anshum Gupta
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Welcome Zhang Chao as Lucene committer

2024-02-25 Thread Michael Sokolov
Welcome and congratulations, Chao!

On Sat, Feb 24, 2024 at 8:51 PM Christian Moen  wrote:
>
> Congrats, Chao!
>
> On Wed, Feb 21, 2024 at 2:28 AM Adrien Grand  wrote:
>>
>> I'm pleased to announce that Zhang Chao has accepted the PMC's
>> invitation to become a committer.
>>
>> Chao, the tradition is that new committers introduce themselves with a
>> brief bio.
>>
>> Congratulations and welcome!
>>
>> --
>> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Vote] Bump the Lucene main branch to Java 21

2024-02-25 Thread Michael Sokolov
+1

On Fri, Feb 23, 2024 at 7:08 PM Stefan Vodita  wrote:
>
> +1
>
> On Fri, 23 Feb 2024 at 11:24, Chris Hegarty 
>  wrote:
>>
>> Hi,
>>
>> Since the discussion on bumping the Lucene main branch to Java 21 is winding 
>> down, let's hold a vote on this important change.
>>
>> Once bumped, the next major release of Lucene (whenever that will be) will 
>> require a version of Java greater than or equal to Java 21.
>>
>> The vote will be open for at least 72 hours (and allow some additional time 
>> for the weekend) i.e. until 2024-02-28 12:00 UTC.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>>
>> -Chris.
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Announcing githubsearch!

2024-02-25 Thread Michael Sokolov
here is a favicon you might want to try: I cropped the "VL" from the
Apache Lucene logo (ok I guess it's an AL) -- if you save it as
favicon.ico in the root of your website (ie as url /favicon.ico) it
should show up in bookmarks, browser toolbars, etc as a handy memory
aid. Of course you might have other ideas for a picture - it's
actually pretty easy to make the favicon once you have a picture you
like; I followed the instructions here
https://www.logikfabrik.se/blog/how-to-create-a-multisize-favicon-using-gimp/

On Thu, Feb 22, 2024 at 10:48 AM Zhang Chao <80152...@qq.com.invalid> wrote:
>
> Great job! Thanks Mike!
>
> 2024年2月22日 22:31,Alessandro Benedetti  写道:
>
> That's cool Mike! Well done!
>
> On Wed, 21 Feb 2024, 22:02 Anshum Gupta,  wrote:
>>
>> This is great! Like always, thank you Mike!
>>
>> On Mon, Feb 19, 2024 at 8:40 AM Michael McCandless 
>>  wrote:
>>>
>>> Hi Team,
>>>
>>> ~1.5 years ago (August 2022) we migrated our Lucene issue tracking from 
>>> Jira to GitHub. Thank you Tomoko for all the hard work doing such a 
>>> complex, multi-phased, high-fidelity migration!
>>>
>>> I finally finished also migrating jirasearch to GitHub: 
>>> githubsearch.mikemccandless.com. It was tricky because GitHub issues/PRs 
>>> are fundamentally more complex than Jira's data model, and the GitHub REST 
>>> API is also quite rich / heavily normalized. All of the source code for 
>>> githubsearch lives here. The UI remains its barebones self ;)
>>>
>>> Githubsearch is dog food for us: it showcases Lucene (currently 9.8.0), and 
>>> many of its fun features like infix autosuggest, block join queries (each 
>>> comment is a sub-document on the issue/PR), DrillSideways faceting, 
>>> near-real-time indexing/searching, synonyms (try “oome”), expressions, 
>>> non-relevance and blended-relevance sort, etc.  (This old blog post goes 
>>> into detail.)  Plus, it’s meta-fun to use Lucene to search its own issues, 
>>> to help us be more productive in improving Lucene!  Nicely recursive.
>>>
>>> In addition to good ol’ searching by text, githubsearch has some new/fun 
>>> features:
>>>
>>> Drill down to just PRs or issues
>>> Filter by “review requested” for a given user: poor Adrien has 8 (open) now 
>>> (sorry)! Or see your mentions (Robert is mentioned in 27 open issues/PRs). 
>>> Or PRs that you reviewed (Uwe has reviewed 9 still-open PRs). Or issues and 
>>> PRs where a user has had any involvement at all (Dawid has interacted on 
>>> 197 issues/PRs).
>>> Find still-open PRs that were created by a New Contributor (an author who 
>>> has no changes merged into our repository) or Contributor (non-committer 
>>> who has had some changes merged into our repository) or Member
>>> Here are the uber-stale (last touched more than a month ago) open PRs by 
>>> outside contributors. We should ideally keep this at 0, but it’s 83 now!
>>> “Link to this search” to get a short-er, more permanent URL (it is NOT a 
>>> URL shortener, though!)
>>> Save named searches you frequently run (they just save to local cookie 
>>> state on that one browser)
>>>
>>> I’m sure there are exciting bugs, feedback/patches welcome!  If you see 
>>> problems, please reply to this email or file an issue here.
>>>
>>> Note that jirasearch remains running, to search Solr, Tika and Infra issues.
>>>
>>> Happy Searching,
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>
>>
>>
>> --
>> Anshum Gupta
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Announcing githubsearch!

2024-02-20 Thread Michael Sokolov
I love the gray all text UI. Don't change it! But I wonder if it's time for
a favicon?

On Tue, Feb 20, 2024, 4:40 AM Adrien Grand  wrote:

> Very cool, thank you Mike!
>
> On Mon, Feb 19, 2024 at 5:40 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Team,
>>
>> ~1.5 years ago (August 2022) we migrated our Lucene issue tracking from
>> Jira to GitHub. Thank you Tomoko for all the hard work doing such a
>> complex, multi-phased, high-fidelity migration!
>>
>> I finally finished also migrating jirasearch to GitHub:
>> githubsearch.mikemccandless.com. It was tricky because GitHub issues/PRs
>> are fundamentally more complex than Jira's data model, and the GitHub REST
>> API is also quite rich / heavily normalized. All of the source code for
>> githubsearch lives here
>> .
>> The UI remains its barebones self ;)
>>
>> Githubsearch
>> 
>> is dog food for us: it showcases Lucene (currently 9.8.0), and many of its
>> fun features like infix autosuggest, block join queries (each comment is a
>> sub-document on the issue/PR), DrillSideways faceting, near-real-time
>> indexing/searching, synonyms (try “oome
>> ”),
>> expressions, non-relevance and blended-relevance sort, etc.  (This old
>> blog post
>> 
>>  goes
>> into detail.)  Plus, it’s meta-fun to use Lucene to search its own issues,
>> to help us be more productive in improving Lucene!  Nicely recursive.
>>
>> In addition to good ol’ searching by text, githubsearch
>>  has some new/fun features:
>>
>>- Drill down to just PRs or issues
>>- Filter by “review requested” for a given user: poor Adrien has 8
>>(open) now
>>
>> 
>>(sorry)! Or see your mentions (Robert is mentioned in 27 open
>>issues/PRs
>>
>> ).
>>Or PRs that you reviewed (Uwe has reviewed 9 still-open PRs
>>
>> ).
>>Or issues and PRs where a user has had any involvement at all (Dawid
>>has interacted on 197 issues/PRs
>>
>> 
>>).
>>- Find still-open PRs that were created by a New Contributor
>>
>> 
>>(an author who has no changes merged into our repository) or
>>Contributor
>>
>> 
>>(non-committer who has had some changes merged into our repository) or
>>Member
>>
>> 
>>- Here are the uber-stale (last touched more than a month ago) open
>>PRs by outside contributors
>>
>> .
>>We should ideally keep this at 0, but it’s 83 now!
>>- “Link to this search” to get a short-er, more permanent URL (it is
>>NOT a URL shortener, though!)
>>- Save named searches you frequently run (they just save to local
>>cookie state on that one browser)
>>
>> I’m sure there are exciting bugs, feedback/patches welcome!  If you see
>> problems, please reply to this email or file an issue here
>> .
>>
>> Note that jirasearch 
>> remains running, to search Solr, Tika and Infra issues.
>>
>> Happy Searching,
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>
>
> --
> Adrien
>


Re: Welcome Stefan Vodita as Lucene committter

2024-01-19 Thread Michael Sokolov
Hello Stefan, welcome!

On Fri, Jan 19, 2024 at 10:41 AM Martin Gainty  wrote:

> Congratulations Stefan!
>
> I look forward to reading your posts
>
> ~martin
> --
> *From:* Michael McCandless 
> *Sent:* Thursday, January 18, 2024 10:53 AM
> *To:* dev@lucene.apache.org 
> *Subject:* Welcome Stefan Vodita as Lucene committter
>
> Hi Team,
>
> I'm pleased to announce that Stefan Vodita has accepted the Lucene PMC's
> invitation to become a committer!
>
> Stefan, the tradition is that new committers introduce themselves with a
> brief bio.
>
> Congratulations, welcome, and thank you for all your improvements to
> Lucene and our community,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


Re: [VOTE] Release Lucene 9.9.1 RC1

2023-12-14 Thread Michael Sokolov
+1

SUCCESS! [0:50:50.776559]

Note: we did get some test fails on the mailing list this morning, but I
believe they are not real bugs and will be resolved by tightening up our
test assumptions

On Thu, Dec 14, 2023 at 7:08 AM Guo Feng  wrote:

> +1
>
> SUCCESS! [3:38:43.833896]
>
> On 2023/12/14 10:44:18 Michael McCandless wrote:
> > +1
> >
> > SUCCESS! [0:14:52.296147]
> >
> >
> > I also cracked a bit of rust off our Monster tests and all but one
> passed:
> > https://github.com/apache/lucene/pull/12942
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Dec 13, 2023 at 4:24 PM Benjamin Trent 
> > wrote:
> >
> > > SUCCESS! [1:06:02.232333]
> > >
> > > + 1!
> > >
> > > On Wed, Dec 13, 2023 at 3:26 PM Greg Miller 
> wrote:
> > >
> > >> SUCCESS! [2:27:01.875939]
> > >>
> > >> +1
> > >>
> > >> Thanks!
> > >> -Greg
> > >>
> > >> On Wed, Dec 13, 2023 at 3:58 AM Chris Hegarty
> > >>  wrote:
> > >>
> > >>> And (short) release note:
> > >>>
> > >>>
> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_1
> > >>>
> > >>> -Chris.
> > >>>
> > >>> > On 13 Dec 2023, at 11:55, Chris Hegarty <
> > >>> christopher.hega...@elastic.co> wrote:
> > >>> >
> > >>> > Hi,
> > >>> >
> > >>> > Please vote for release candidate 1 for Lucene 9.9.1
> > >>> >
> > >>> > The artifacts can be downloaded from:
> > >>> >
> > >>>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.1-RC1-rev-eee32cbf5e072a8c9d459c349549094230038308
> > >>> >
> > >>> > You can run the smoke tester directly with this command:
> > >>> >
> > >>> > python3 -u dev-tools/scripts/smokeTestRelease.py \
> > >>> >
> > >>>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.1-RC1-rev-eee32cbf5e072a8c9d459c349549094230038308
> > >>> >
> > >>> > The vote will be open for at least 72 hours i.e. until 2023-12-16
> > >>> 12:00 UTC.
> > >>> >
> > >>> > [ ] +1  approve
> > >>> > [ ] +0  no opinion
> > >>> > [ ] -1  disapprove (and reason why)
> > >>> >
> > >>> > Here is my +1
> > >>> >
> > >>> > -Chris.
> > >>> >
> > >>>
> > >>>
> > >>> -
> > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> > >>>
> > >>>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [VOTE] Release Lucene 9.9.0 RC2

2023-11-30 Thread Michael Sokolov
SUCCESS! [0:46:20.693134]

+1

On Thu, Nov 30, 2023 at 5:50 PM Tomás Fernández Löbbe 
wrote:

> SUCCESS! [0:52:49.337126]
>
> +1
>
> On Thu, Nov 30, 2023 at 12:05 PM Benjamin Trent 
> wrote:
>
>> SUCCESS! [0:44:05.132154]
>>
>> +1
>>
>> On Thu, Nov 30, 2023 at 1:09 PM Chris Hegarty
>>  wrote:
>>
>>> Please vote for release candidate 2 for Lucene 9.9.0
>>>
>>>
>>> The artifacts can be downloaded from:
>>>
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC2-rev-06070c0dceba07f0d33104192d9ac98ca16fc500
>>>
>>>
>>> You can run the smoke tester directly with this command:
>>>
>>>
>>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>>
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC2-rev-06070c0dceba07f0d33104192d9ac98ca16fc500
>>>
>>>
>>> The vote will be open for at least 72 hours, and given the weekend in
>>> between, let’s keep it open until 2023-12-04 12:00 UTC.
>>>
>>> [ ] +1  approve
>>>
>>> [ ] +0  no opinion
>>>
>>> [ ] -1  disapprove (and reason why)
>>>
>>>
>>> Here is my +1
>>>
>>>
>>> -Chris.
>>>
>>>


Re: [VOTE] Release Lucene 9.9.0 RC1

2023-11-30 Thread Michael Sokolov
for the sake of posterity, I did get a successful smoketest:

SUCCESS! [1:00:06.512261]

but +0 to release I guess since it's moot...

On Thu, Nov 30, 2023 at 10:38 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Thu, Nov 30, 2023 at 9:56 AM Chris Hegarty
>  wrote:
>
> P.S. I’m less sure about this, but the RC 2 starts a 72hr voting time
>> again? (Just so I know what TTL to put on that)
>>
>
> Yeah a new 72 hour clock starts with each new RC :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


Re: GDPR compliance

2023-11-29 Thread Michael Sokolov
Another way is to ensure that all documents get updated on a regular
cadence whether there are changes in the underlying data or not. Or,
regenerating the index from scratch all the time. Of course these
approaches might be more costly for an index that has intrinsically low
update rates, but they do keep the index fresh without the need for any
special tracking.

On Tue, Nov 28, 2023, 8:45 PM Patrick Zhai  wrote:

> It's not that insane, it's about several weeks however the big segment can
> stay there for quite long if there's not enough update for a merge policy
> to pick it up
>
> On Tue, Nov 28, 2023, 17:14 Dongyu Xu  wrote:
>
>> What is the expected grace time for the data-deletion request to take
>> place?
>>
>> I'm not expert about the policy but I think something like "I need my
>> data to be gone in next 2 second" is unreasonable.
>>
>> Tony X
>>
>> --
>> *From:* Robert Muir 
>> *Sent:* Tuesday, November 28, 2023 11:52 AM
>> *To:* dev@lucene.apache.org 
>> *Subject:* Re: GDPR compliance
>>
>> I don't think there's any problem with GDPR, and I don't think users
>> should be running unnecessary "optimize". GDRP just says data should
>> be erased without "undue" delay. waiting for a merge to nuke the
>> deleted docs isn't "undue", there is a good reason for it.
>>
>> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
>> >
>> > Hi Folks,
>> > In LinkedIn we need to comply with GDPR for a large part of our data,
>> and an important part of it is that we need to be sure we have completely
>> deleted the data the user requested to delete within a certain period of
>> time.
>> > The way we have come up with so far is to:
>> > 1. Record the segment creation time somewhere (not decided yet, maybe
>> index commit userinfo, maybe some other place outside of lucene)
>> > 2. Create a new merge policy which delegate most operations to a normal
>> MP, like TieredMergePolicy, and then add extra single-segment (merge from 1
>> segment to 1 segment, basically only do deletion) merges if it finds any
>> segment is about to violate the GDPR time frame.
>> >
>> > So here's my question:
>> > 1. Is there a better/existing way to do this?
>> > 2. I would like to directly contribute to Lucene about such a merge
>> policy since I think GDPR is more or less a common thing. Would like to
>> know whether people feel like it's necessary or not?
>> > 3. It's also nice if we can store the segment creation time to the
>> index directly by IndexWriter (maybe write to SegmentInfo?), I can try to
>> do that but would like to ask whether there's any objections?
>> >
>> > Best
>> > Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: Lucene 9.9.0 Release

2023-11-22 Thread Michael Sokolov
+1 thanks for volunteering!

Hijacking the thread a bit, sorry, I started looking into whether this is a
good time to start looking ahead to 10? I know we had some rumblings about
releasing that so we can start requiring newer JDKs. But looking at CHANGES
it feels like we already back-ported most of the good stuff and lack a
truly compelling reason to move ahead (other than JDK requirement and fact
that it has been 2 years since 9). So maybe we wait a bit longer?

On Tue, Nov 21, 2023 at 11:04 AM Patrick Zhai  wrote:

> +1, thank you Chris!
>
> On Tue, Nov 21, 2023, 06:49 Benjamin Trent  wrote:
>
>> +1 9.9 will be a stellar release!
>>
>> Thank you Chris!
>>
>> On Tue, Nov 21, 2023 at 7:31 AM Adrien Grand  wrote:
>>
>>> +1 9.9 has plenty of great changes indeed! Thanks for volunteering as a
>>> RM, Chris.
>>>
>>> It would be good to try and fix the PKLookup regression that was
>>> introduced since 9.8:
>>> http://people.apache.org/~mikemccand/lucenebench/PKLookup.html. Is it
>>> just about getting #12699 
>>> merged?
>>>
>>> Separately, I have a PR that does a small change to the file format of
>>> postings and skip lists. It's certainly not a blocker for 9.9, but it would
>>> be convenient to get it into 9.9 since we already changed file formats for
>>> the switch from PFOR to FOR. Does someone have time to take a look?
>>> #12810 
>>>
>>> On Tue, Nov 21, 2023 at 11:16 AM Michael McCandless <
>>> luc...@mikemccandless.com> wrote:
>>>
 +1

 Thank you for volunteering as RC Chris!

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Nov 21, 2023 at 4:52 AM Chris Hegarty
  wrote:

> Hi,
>
> It's been a while since the 9.8.0 release and we’ve accumulated quite
> a few changes. I’d like to propose that we release 9.9.0.
>
> If there's no objections, I volunteer to be the release manager and
> will cut the feature branch a week from now, 12:00 28th Nov UTC.
>
> -Chris.
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
>>>
>>> --
>>> Adrien
>>>
>>


Re: Test framework can't find SPI implementations from module sandbox

2023-11-21 Thread Michael Sokolov
did you add to the sandbox META-INF file? It looks like maybe sandbox is
not included in the scope of the test, but you didn't say which test it
was. Is the test also in the sandbox module?

On Mon, Nov 20, 2023 at 6:56 PM Dongyu Xu  wrote:

> Hi devs,
>
> I tried to plug in my experimental PostingsFormat implementation to all
> the existing unit tests. I've registered it under META-INF.services as well
> as in the moudle-info.java. However, the test still fails like the
> following.
>
> java.lang.IllegalArgumentException: An SPI class of type
> org.apache.lucene.codecs.PostingsFormat with name 'Lucene99RandomAccess'
> does not exist.  You need to add the corresponding JAR file supporting this
> SPI to your classpath.  The current classpath supports the following names:
> [Lucene99, MockRandom, RAMOnly, LuceneFixedGap, LuceneVarGapFixedInterval,
> LuceneVarGapDocFreqInterval, TestBloomFilteredLucenePostings, Asserting,
> UniformSplitRot13, STUniformSplitRot13, BlockTreeOrds, BloomFilter, Direct,
> FST50, UniformSplit, SharedTermsUniformSplit, Lucene50, Lucene84, Lucene90]
>
> It fails to recognize the one of the existing PostingsFormat, IDVersion
> ,
> from sandbox, too.
>
> Is this a known issue?
>
> Tony X
>
>
>
>
>


Re: Welcome Patrick Zhai to the Lucene PMC

2023-11-12 Thread Michael Sokolov
Welcome, Patrick!

On Sun, Nov 12, 2023, 2:12 AM Ignacio Vera  wrote:

> Welcome Patrick!
>
> On Sat, Nov 11, 2023 at 3:29 PM Uwe Schindler  wrote:
>
>> Welcome Patrick!
>>
>> Uwe
>>
>>
>> Am 10. November 2023 21:04:32 MEZ schrieb Michael McCandless <
>> luc...@mikemccandless.com>:
>>
>>> I'm happy to announce that Patrick Zhai has accepted an invitation to
>>> join the Lucene Project Management Committee (PMC)!
>>>
>>> Congratulations Patrick, thank you for all your hard work improving
>>> Lucene's community and source code, and welcome aboard!
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de
>>
>


Re: Boolean field type

2023-11-09 Thread Michael Sokolov
Can you require the user to specify missing: true or missing: false
semantics. With that you can decide what to do with the missing values

On Thu, Nov 9, 2023, 7:55 AM Mikhail Khludnev  wrote:

> Hello Michael.
> This optimization "NOT the less common value" assumes that boolean field
> is required, but how to enforce this mandatory field constraint in Lucene?
> I'm not aware of something like Solr schema or mapping.
> If saying foo:true is common, it means that the posting list goes like
> dense sequentially increasing numbers 1,2,3,4,5.. May it already be
> compressed by codecs like
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
> ?
>
> On Thu, Nov 9, 2023 at 3:31 AM Michael Froh  wrote:
>
>> Hey,
>>
>> I've been musing about ideas for a "clever" Boolean field type on Lucene
>> for a while, and I think I might have an idea that could work. That said,
>> this popped into my head this afternoon and has not been fully-baked. It
>> may not be very clever at all.
>>
>> My experience is that Boolean fields tend to be overwhelmingly true or
>> overwhelmingly false. I've had pretty good luck with using a keyword-style
>> field, where the only term represents the more sparse value. (For example,
>> I did a thing years ago with explicit tombstones, where versioned deletes
>> would have the field "deleted" with a value of "true", and live
>> documents didn't have the deleted field at all. Every query would add a
>> filter on "NOT deleted:true".)
>>
>> That's great when you know up-front what the sparse value is going to be.
>> Working on OpenSearch, I just created an issue suggesting that we take a
>> hint from users for which value they think is going to be more common so we
>> only index the less common one:
>> https://github.com/opensearch-project/OpenSearch/issues/11143
>>
>> At the Lucene level, though, we could index a Boolean field type as the
>> less common term when we flush (by counting the values and figuring out
>> which is less common). Then, per segment, we can rewrite any query for the
>> more common value as NOT the less common value.
>>
>> You can compute upper/lower bounds on the value frequencies cheaply
>> during a merge, so I think you could usually write the doc IDs for the less
>> common value directly (without needing to count them first), even when
>> input segments disagree on which is the more common value.
>>
>> If your Boolean field is not overwhelmingly lopsided, you might even want
>> to split segments to be 100% true or 100% false, such that queries against
>> the Boolean field become match-all or match-none. On a retail website,
>> maybe you have some toggle for "only show me results with property X" -- if
>> all your property X products are in one segment or a handful of segments,
>> you can drop the property X clause from the matching segments and skip the
>> other segments.
>>
>> I guess one icky part of this compared to the usual Lucene field model is
>> that I'm assuming a Boolean field is never missing (or I guess missing
>> implies "false" by default?). Would that be a deal-breaker?
>>
>> Thanks,
>> Froh
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Bump minimum Java version requirement to 21

2023-11-06 Thread Michael Sokolov
It's not just you - we have an internal JDK11 fork at BIG COMPANY for some
folks that can't get off the stick. To be fair it's challenging because
they have to shift all their dependencies. I think Spark was the one
mentioned by one group, but there is a JDK17-based release of Spark, so
clearly not a blocker, OTOH if you have to upgrade JDK, Lucene, Spark, who
knows what else, all at the same time, it becomes challenging. Still I
agree it's no reason to lag behind; we have to keep pushing forward
together. +1 to release 10 - easy for me to say, we need a RM to volunteer
and it will happen

On Mon, Nov 6, 2023 at 8:19 AM Gus Heck  wrote:

> For perspective, I'm still seeing java 11 as the norm for clients... 17 is
> uncommon. Anything requiring 21 is likely to be difficult to sell. I am
> however a small shop, and "migrating off of solr 6" and "trying out solr
> cloud" is still a thing for some clients.
>
> Just a datapoint/anecdote, possibly skewed.
>
> On Mon, Nov 6, 2023 at 7:41 AM Chris Hegarty
>  wrote:
>
>> Hi Robert,
>>
>> > On 6 Nov 2023, at 12:24, Robert Muir  wrote:
>> >
>> >> …
>> >> The only concern I have with no.2 is that it could be considered an
>> “aggressive” adoption of Java 21 - adoption sooner than the ecosystem can
>> handle, e.g. are environments in which Lucene is deployed, and their
>> transitive dependencies, ready to run on Java 21? By the time we’re ready
>> to release 10.0.0, say March 2023, then I expect no issue with this.
>> >
>> > The problem is worse, historically jdk version X isn't adopted as a
>> > minimum until it is already EOL. And the lucene major versions take an
>> > eternity to get out there, code just sits in "main" branch for years
>> > unreleased to nobody. It is really discouraging as a contributor to
>> > contribute code that literally sits on the shelf for years, for no
>> > good reason at all.
>>
>> Agreed. I also feel discouraged by this approach too, and also wanna
>> avoid the “backport the world”, since it’s counterproductive.
>>
>> > So why delay?
>> >
>> > The argument of "moving sooner than ecosystem can handle" is also
>> > bogus in the same way. You mean versus the code sitting on the shelf
>> > and being released to nobody?
>>
>> Yes - sitting on the shelf is no good to anyone.
>>
>> Ok, what I’m hearing are good arguments for releasing 10.0.0 *now*, with
>> a Java 17 minimum - this is what is in _main_ today.
>>
>> If we do that, then we can follow up with _main_ later (after the 10.x
>> branch is created). That is, 1) bump _main_ to Java 21, and 2) decide
>> when a Lucene 11 is to be released (I would to see Lucene 11 ~1yr after
>> Lucene 10).
>>
>> This is Uwe’s proposal, earlier in this thread.
>>
>> -Chris.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>


Re: Squash vs merge of PRs

2023-11-04 Thread Michael Sokolov
Personally for me it's about how meaningful the commit messages (and
contents) are vs whether we use merge commits or not. If it;s a long series
of "fixed bug" "reformatted" "did stuff" "more stuff" "it finally works"
and so on ... that doesn't smell good to me, but you know we all have done
that from time to time too, either by accident or because we're in a rush
and didn't practice perfect hygiene. I guess the commit branching/linear
purity debate is mostly a matter of taste; we can try to have some
standards, but we should be forgiving and not try to dictate with
automation. Honestly I didn't look at whatever Robert's commits were that
started this discussion since it seems to have metastasized into a general
commit history health discussion so just throwing another opinion into the
mix here, maybe getting off topic sorry.

On Sat, Nov 4, 2023 at 11:18 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> I didn't realize the community had decided squashing (rewriting history)
> was our standard.
>
> > Comparing histories between branches with git-bisect to find bugs is
> just one example.
>
> But if the bug was introduced in one of the N local commits the developer
> had done, wouldn't that be helpful?  You could see that one commit instead
> of all N squashed, and get better context on how/why the bug was introduced?
>
> I would prefer history-preserving commits.  It can reveal/preserve
> important information -- like we tried one approach, and discovered some
> issue, tweaked it to a better approach.  This can be useful in the future
> if someone is working on that part of the code and is trying to understand
> why it was done a certain way.  It preserves the natural and healthy
> iterations we all experience when working closely together.  Why discard
> such possibly helpful history?
>
> Also, one can always wear hazy glasses in the future to "summarize" the
> full history down to a view that's more palatable to them personally, if
> you don't like seeing merge commit branching.  But we cannot do the
> reverse.  Discarding the actual development history is a one-way door.
>
> http://blog.mikemccandless.com
>
>
> On Sat, Nov 4, 2023 at 11:03 AM Gus Heck  wrote:
>
>> Also, since (as noted) this is a previously decided issue, not sure why
>> this is a list email instead of a simple direct query to Robert seeking to
>> understand the specific case? No need to make a public discussion unless
>> it's a long term pattern, actually breaking something, or we want to change
>> something?
>>
>> On Sat, Nov 4, 2023 at 9:37 AM Benjamin Trent 
>> wrote:
>>
>>> TL;DR, forcing non-committers to squash things is a good idea. Enforcing
>>> through some measure for committers is a bad idea.
>>>
>>> Since this thread is now in Robert's spam, I am guessing it won't have
>>> any impact :). I do not think Robert is actively trying hurt the project in
>>> any way. It seems to me that he doesn't think a clean git history is worth
>>> the effort.
>>>
>>> Having a clean git history makes things easier for everyone. Comparing
>>> histories between branches with git-bisect to find bugs is just one
>>> example. Another is simply reading commits to see when
>>> features/bug fixes/etc. were added.
>>>
>>> I do NOT think we should add procedures or branch protections to
>>> actively enforce this.
>>>
>>> Small personal sacrifices (like dealing with commit conflicts) are
>>> necessary for a community. Being part of a community is about buying into
>>> what the community is about and working towards a common goal. Many times
>>> we do things we don't agree with, or make things slightly more difficult
>>> for us, for the community as a whole. This thing being OSS shows that we
>>> all buy into its importance and are willing to put work into the project.
>>>
>>> Having a cultural default of "make things nice for others" is good.
>>> Enforcing this ideology on others is antithesis to its definition.
>>>
>>>
>>>
>>> On Sat, Nov 4, 2023 at 9:02 AM Robert Muir  wrote:
>>>
 This isn't a community issue, it is me avoiding useless unnecessary
 merge conflicts. Word "community" is invoked here to try to make it
 out, like you can hold a vote about what git commands i should type on
 my computer? You know that isn't gonna work. have some humility.

 thread moved to spam.

 On Sat, Nov 4, 2023 at 8:36 AM Mike Drob  wrote:
 >
 > We all agree on using Java though, and using a specific version, and
 even the style output from gradle tidy. Is that nanny state or community
 consensus?
 >
 > On Sat, Nov 4, 2023 at 7:29 AM Robert Muir  wrote:
 >>
 >> example of a nanny state IMO, trying to dictate what git commands to
 >> use, or what editor to use. Maybe this works for you in your
 corporate
 >> hellholes, but I think some folks have a bit of a power issue, are
 >> accustomed to dictacting this stuff to their employees and so on, but
 >> this is open-source. I d

Re: Welcome Guo Feng to the Lucene PMC

2023-10-25 Thread Michael Sokolov
Welcome, gf2121!

On Wed, Oct 25, 2023, 3:03 AM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Congratulations and welcome, Feng!
>
> On Tue, 24 Oct 2023 at 22:35, Adrien Grand  wrote:
>
>> I'm pleased to announce that Guo Feng has accepted an invitation to join
>> the Lucene PMC!
>>
>> Congratulations Feng, and welcome aboard!
>>
>> --
>> Adrien
>>
>


Re: Welcome Luca Cavanna to the Lucene PMC

2023-10-22 Thread Michael Sokolov
Congratulations and welcome, Luca!

On Sun, Oct 22, 2023 at 1:42 PM Julie Tibshirani  wrote:
>
> Congratulations Luca!!
>
> On Fri, Oct 20, 2023 at 1:45 AM Bruno Roustant  
> wrote:
>>
>> Welcome, congratulations!
>>
>> Le ven. 20 oct. 2023 à 10:02, Dawid Weiss  a écrit :
>>>
>>>
>>> Congratulations, Luca!
>>>
>>> On Fri, Oct 20, 2023 at 7:51 AM Adrien Grand  wrote:

 I'm pleased to announce that Luca Cavanna has accepted an invitation to 
 join the Lucene PMC!

 Congratulations Luca, and welcome aboard!

 --
 Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-22 Thread Michael Sokolov
Thanks, Uwe. The underlying exception in my situation was caused by
curFloatBufferViews being allocated and used before it was fully
populated. So I think it was an NPE, yes. I'll check your PR to see if
it would have hidden this?

On Sun, Oct 22, 2023 at 4:57 AM Uwe Schindler  wrote:
>
> Please read my other comments and the PR. The PR filters the cause of
> the NPE, if the NPE is caused by inernals of MMapDirectory it won't be
> exposed to anybody.
>
> If you use it in multiple threads and acidentally close one of the
> indexinputs, AlreadyClosedException is the only correct exception. Any
> cause like an internal signalling NPE is not useful and helps nothing.
> The PR explains this, so we won't add the NPE as cause. If the NPE is
> coming from outside MMapDircetory, it will be rethrown so you see it.
>
> I will merge the PR in a moment.
>
> Uwe
>
> Am 22.10.2023 um 01:37 schrieb Michael Sokolov:
> > Thanks for digging into this. I do think it will be helpful for
> > developers that blithely access the IndexInput from multiple threads
> > :)
> >
> > On Sat, Oct 21, 2023 at 3:53 PM Chris Hostetter
> >  wrote:
> >>
> >> Uwe: In your PR, you should add these details to the javadocs of
> >> ByteBufferIndexInput.alreadyClosed(), so future code spelunkers understand
> >> the choice being made here is intentional :)
> >>
> >> : please don't add the NPE here as cause (except for debugging). The NPE 
> >> is only
> >> : catched to NOT add extra checks in the highly performance sensitive code.
> >> : Actually the NPE is catched to detect the case where the bytebuffer was
> >> : already unset to trigger the already closed. The code uses setting the 
> >> buffers
> >> : to NULL to signal cause, but it does NOT add a NULL check everywhere. 
> >> This
> >> : allows Hotspot to compile this code without any bounds checks and signal 
> >> the
> >> : AlreadyClosedException only when a NPE happens. Adding the NPE as cause 
> >> would
> >>
> >>
> >>
> >> -Hoss
> >> http://www.lucidworks.com/
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Michael Sokolov
Thanks for digging into this. I do think it will be helpful for
developers that blithely access the IndexInput from multiple threads
:)

On Sat, Oct 21, 2023 at 3:53 PM Chris Hostetter
 wrote:
>
>
> Uwe: In your PR, you should add these details to the javadocs of
> ByteBufferIndexInput.alreadyClosed(), so future code spelunkers understand
> the choice being made here is intentional :)
>
> : please don't add the NPE here as cause (except for debugging). The NPE is 
> only
> : catched to NOT add extra checks in the highly performance sensitive code.
> : Actually the NPE is catched to detect the case where the bytebuffer was
> : already unset to trigger the already closed. The code uses setting the 
> buffers
> : to NULL to signal cause, but it does NOT add a NULL check everywhere. This
> : allows Hotspot to compile this code without any bounds checks and signal the
> : AlreadyClosedException only when a NPE happens. Adding the NPE as cause 
> would
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-17 Thread Michael Sokolov
I was messing around with something that was resulting in
AlreadyClosedException being thrown and I noticed that we weren't
tracking the exception that caused it. I found this in
ByteBufferIndexInput:

   // the unused parameter is just to silence javac about unused variables
   AlreadyClosedException alreadyClosed(RuntimeException unused) {
-return new AlreadyClosedException("Already closed: " + this);
+return new AlreadyClosedException("Already closed: " + this, unused);
   }

and added the cause there, which helped me find and fix my wicked
ways. Is there a reason we decided not to wrap the "unused"
RuntimeException there?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.8.0 RC1

2023-09-25 Thread Michael Sokolov
thanks for the detailed info, folks. I'll see if I can understand why
I may have been running an old instance of that agent. The host I was
running on was created recently.

On Mon, Sep 25, 2023 at 6:09 AM Chris Hegarty
 wrote:
>
> Hi,
>
>   2>at Log4jHotPatch.asmVersion(Log4jHotPatch.java:71)
>
>
> This coming from Amazon’s Log4Shell hot patch [1], which I believe was 
> deployed by default on many (all?) JVM’s running on Amazon instances. Well… 
> that was almost 2yrs ago, not sure why it’s still showing up in some places 
> now - it should not be needed.
>
> In fact, I do remember seeing and reporting this issue back in late 2021. The 
> hot patcher initially used the JDK’s internal ASM library, which is the root 
> cause of the security exception. The hot patcher was subsequently fixed to 
> not do this - it bundles/shades ASM itself. This fix was made in late 2021.
>
> I have no idea why the system in question is running an old version of the 
> hot patcher. @Michael, you should probably take a look at that system, maybe 
> it needs some updates or something?
>
> -Chris.
>
> [1] https://github.com/corretto/hotpatch-for-apache-log4j2/tree/main
>
> On 25 Sep 2023, at 09:22, Uwe Schindler  wrote:
>
> Hi,
>
> as Lucene does not use Log4j, it is unclear why it wants to patch anything. 
> The problem in indeed caused by SecurityManager which is enabled for running 
> Lucene tests. Actually it detects that something tries to access some 
> internals of ASM, not sure what it exactly does. The "injected" Agent code 
> must possibly use AccessController#doPrivileged and the security context must 
> allow patching of classes.
>
> In short: SecurityManager has done everything it should do: It detected an 
> illegal access. Mission achieved! You have to report this issue and patch 
> your tool so it works correctly with SecurityManager.
>
> Uwe
>
> Am 24.09.2023 um 23:52 schrieb Michael Sokolov:
>
> I ran the smoketester and had a failure. It seems related to some
> log4j hot patch script we are required to run at work which is somehow
> conflicting with the security manager? I'm killing that and trying
> again, but I wonder if this is going to cause problems at runtime as
> well? How do we enable the security manager -is it only when running
> tests?
>
> org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat >
> classMethod FAILED
> java.lang.AssertionError: The test or suite printed 15378 bytes to
> stdout and stderr, even though the limit was set to 8192 bytes.
> Increase the limit with @Limit, ignore it
>  completely with @SuppressSysoutChecks or run with
> -Dtests.verbose=true
> at __randomizedtesting.SeedInfo.seed([3E554FE0FEE122B9]:0)
> at 
> org.apache.lucene.tests.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:283)
> at 
> com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterIfSuccessful(TestRuleAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:37)
> at 
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at 
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat >
> test suite's output saved to
> /tmp/smoke_lucene_9.8.0_d914b3722bd5b8ef31ccf7e8ddc638a87fd648db/unpack/lucene-9
> .8.0/lucene/codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat.txt,
> copied below:
>   2> java.lang.reflect.InvocationTargetException
>   2>at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>   2>at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImp

Re: [VOTE] Release Lucene 9.8.0 RC1

2023-09-24 Thread Michael Sokolov
ok, I re-ran without the pesky log4j-thingy running and

SUCCESS! [0:55:54.865250]

+1

On Sun, Sep 24, 2023 at 5:52 PM Michael Sokolov  wrote:
>
> I ran the smoketester and had a failure. It seems related to some
> log4j hot patch script we are required to run at work which is somehow
> conflicting with the security manager? I'm killing that and trying
> again, but I wonder if this is going to cause problems at runtime as
> well? How do we enable the security manager -is it only when running
> tests?
>
> org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat >
> classMethod FAILED
> java.lang.AssertionError: The test or suite printed 15378 bytes to
> stdout and stderr, even though the limit was set to 8192 bytes.
> Increase the limit with @Limit, ignore it
>  completely with @SuppressSysoutChecks or run with
> -Dtests.verbose=true
> at __randomizedtesting.SeedInfo.seed([3E554FE0FEE122B9]:0)
> at 
> org.apache.lucene.tests.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:283)
> at 
> com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterIfSuccessful(TestRuleAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:37)
> at 
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at 
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat >
> test suite's output saved to
> /tmp/smoke_lucene_9.8.0_d914b3722bd5b8ef31ccf7e8ddc638a87fd648db/unpack/lucene-9
> .8.0/lucene/codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat.txt,
> copied below:
>   2> java.lang.reflect.InvocationTargetException
>   2>at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>   2>at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   2>at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   2>at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   2>at 
> java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:513)
>   2>at 
> java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallAgentmain(InstrumentationImpl.java:535)
>   2> Caused by: java.security.AccessControlException: access denied
> ("java.lang.RuntimePermission"
> "accessClassInPackage.jdk.internal.org.objectweb.asm")
>   2>at 
> java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   2>at 
> java.base/java.security.AccessController.checkPermission(AccessController.java:897)
>   2>at 
> java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322)
>   2>at 
> java.base/java.lang.SecurityManager.checkPackageAccess(SecurityManager.java:1238)
>   2>at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:174)
>   2>at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
>   2>at Log4jHotPatch.asmVersion(Log4jHotPatch.java:71)
>   2>at Log4jHotPatch.agentmain(Log4jHotPatch.java:93)
>   2>... 6 more
>
> On Sat, Sep 23, 2023 at 12:46 PM Jan Høydahl  wrote:
> >
> > Smoke tester only
> >
> > SUCCESS! [1:22:37.441415]
> >
> > +1 (binding)
> >
> > Jan
> >
> > 22. sep. 2023 kl. 07:48 skrev Patrick Zhai :
> >
> > Please vote for release candidate 1 for Lucene 9.8.0
> >
> > The artifacts can be downloaded from:
> > https://dist.apache.org/repos/d

Re: [VOTE] Release Lucene 9.8.0 RC1

2023-09-24 Thread Michael Sokolov
I ran the smoketester and had a failure. It seems related to some
log4j hot patch script we are required to run at work which is somehow
conflicting with the security manager? I'm killing that and trying
again, but I wonder if this is going to cause problems at runtime as
well? How do we enable the security manager -is it only when running
tests?

org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat >
classMethod FAILED
java.lang.AssertionError: The test or suite printed 15378 bytes to
stdout and stderr, even though the limit was set to 8192 bytes.
Increase the limit with @Limit, ignore it
 completely with @SuppressSysoutChecks or run with
-Dtests.verbose=true
at __randomizedtesting.SeedInfo.seed([3E554FE0FEE122B9]:0)
at 
org.apache.lucene.tests.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:283)
at 
com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterIfSuccessful(TestRuleAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:37)
at 
org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at 
org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at 
org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at 
org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
at java.base/java.lang.Thread.run(Thread.java:829)

org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat >
test suite's output saved to
/tmp/smoke_lucene_9.8.0_d914b3722bd5b8ef31ccf7e8ddc638a87fd648db/unpack/lucene-9
.8.0/lucene/codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.codecs.simpletext.TestSimpleTextPostingsFormat.txt,
copied below:
  2> java.lang.reflect.InvocationTargetException
  2>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
  2>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  2>at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  2>at java.base/java.lang.reflect.Method.invoke(Method.java:566)
  2>at 
java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:513)
  2>at 
java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallAgentmain(InstrumentationImpl.java:535)
  2> Caused by: java.security.AccessControlException: access denied
("java.lang.RuntimePermission"
"accessClassInPackage.jdk.internal.org.objectweb.asm")
  2>at 
java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
  2>at 
java.base/java.security.AccessController.checkPermission(AccessController.java:897)
  2>at 
java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322)
  2>at 
java.base/java.lang.SecurityManager.checkPackageAccess(SecurityManager.java:1238)
  2>at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:174)
  2>at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
  2>at Log4jHotPatch.asmVersion(Log4jHotPatch.java:71)
  2>at Log4jHotPatch.agentmain(Log4jHotPatch.java:93)
  2>... 6 more

On Sat, Sep 23, 2023 at 12:46 PM Jan Høydahl  wrote:
>
> Smoke tester only
>
> SUCCESS! [1:22:37.441415]
>
> +1 (binding)
>
> Jan
>
> 22. sep. 2023 kl. 07:48 skrev Patrick Zhai :
>
> Please vote for release candidate 1 for Lucene 9.8.0
>
> The artifacts can be downloaded from:
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.8.0-RC1-rev-d914b3722bd5b8ef31ccf7e8ddc638a87fd648db
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.8.0-RC1-rev-d914b3722bd5b8ef31ccf7e8ddc638a87fd648db
>
> The vote will be open for at least 72 hours, as there's a weekend, the vote 
> will last until 2023-09-27 06:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1 (non-binding)
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.8 Release

2023-09-18 Thread Michael Sokolov
+1 for a release soon, and thanks for volunteering, Patrick!

On Tue, Sep 12, 2023 at 2:08 AM Patrick Zhai  wrote:
>
> Hi all,
> It's been a while since the last release and we have quite a few good changes 
> including new APIs, improvements and bug fixes. Should we release the 9.8?
>
> If there's no objections I volunteer to be the release manager and will cut 
> the feature branch a week from now, which is Sep. 18th PST.
>
> Best
> Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.7.0 RC1

2023-06-22 Thread Michael Sokolov
I have /tmp symlinked to /local/tmp (to get more space) and this seems
to cause some issue:

On Thu, Jun 22, 2023 at 7:07 PM Michael Sokolov  wrote:
>
> +0
>
> I had some test failures. Maybe a problem with my setup? I'll see if I can 
> repro
>
> gradlew :lucene:replicator:test --tests
> "org.apache.lucene.replicator.nrt.TestNRTReplication.testCrashPrimary1"
> -Ptests.jvms=8 "-Ptests.jv
> margs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC
> -XX:ActiveProcessorCount=1" -Ptests.seed=789EFCADBD918CC6
> -Ptests.nightly=true -Ptests.badapples=false -Ptest
> s.gui=true -Ptests.file.encoding=UTF-8
>
> On Thu, Jun 22, 2023 at 1:06 PM Tomás Fernández Löbbe
>  wrote:
> >
> > Thanks Adrien!
> >
> > SUCCESS! [0:43:17.143555]
> > +1
> >
> > On Wed, Jun 21, 2023 at 7:37 AM Adrien Grand  wrote:
> >>
> >> Please vote for release candidate 1 for Lucene 9.7.0
> >>
> >> The artifacts can be downloaded from:
> >> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf
> >>
> >> You can run the smoke tester directly with this command:
> >>
> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
> >> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf
> >>
> >> The vote will be open for at least 72 hours i.e. until 2023-06-24 15:00 
> >> UTC.
> >>
> >> [ ] +1  approve
> >> [ ] +0  no opinion
> >> [ ] -1  disapprove (and reason why)
> >>
> >> Here is my +1
> >>
> >> --
> >> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.7.0 RC1

2023-06-22 Thread Michael Sokolov
+0

I had some test failures. Maybe a problem with my setup? I'll see if I can repro

gradlew :lucene:replicator:test --tests
"org.apache.lucene.replicator.nrt.TestNRTReplication.testCrashPrimary1"
-Ptests.jvms=8 "-Ptests.jv
margs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC
-XX:ActiveProcessorCount=1" -Ptests.seed=789EFCADBD918CC6
-Ptests.nightly=true -Ptests.badapples=false -Ptest
s.gui=true -Ptests.file.encoding=UTF-8

On Thu, Jun 22, 2023 at 1:06 PM Tomás Fernández Löbbe
 wrote:
>
> Thanks Adrien!
>
> SUCCESS! [0:43:17.143555]
> +1
>
> On Wed, Jun 21, 2023 at 7:37 AM Adrien Grand  wrote:
>>
>> Please vote for release candidate 1 for Lucene 9.7.0
>>
>> The artifacts can be downloaded from:
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf
>>
>> The vote will be open for at least 72 hours i.e. until 2023-06-24 15:00 UTC.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>>
>> --
>> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Chris Hegarty to the Lucene PMC

2023-06-19 Thread Michael Sokolov
Welcome Chris!

On Mon, Jun 19, 2023, 7:31 AM Michael McCandless 
wrote:

> Welcome aboard Chris!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Jun 19, 2023 at 7:16 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> Congratulations Chris!
>>
>> On Mon, 19 Jun, 2023, 3:23 pm Adrien Grand,  wrote:
>>
>>> I'm pleased to announce that Chris Hegarty has accepted an invitation to
>>> join the Lucene PMC!
>>>
>>> Congratulations Chris, and welcome aboard!
>>>
>>> --
>>> Adrien
>>>
>>


Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-17 Thread Michael Sokolov
see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley  wrote:

> > easily be circumvented by a user
>
> This is a revelation to me and others, if true.  Michael, please then
> point to a test or code snippet that shows the Lucene user community what
> they want to see so they are unblocked from their explorations of vector
> search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov 
> wrote:
>
>> I think I've said before on this list we don't actually enforce the limit
>> in any way that can't easily be circumvented by a user. The codec already
>> supports any size vector - it doesn't impose any limit. The way the API is
>> written you can *already today* create an index with max-int sized vectors
>> and we are committed to supporting that going forward by our backwards
>> compatibility policy as Robert points out. This wasn't intentional, I
>> think, but it is the facts.
>>
>> Given that, I think this whole discussion is not really necessary.
>>
>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability.  Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>


Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-17 Thread Michael Sokolov
I think I've said before on this list we don't actually enforce the limit
in any way that can't easily be circumvented by a user. The codec already
supports any size vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int sized vectors
and we are committed to supporting that going forward by our backwards
compatibility policy as Robert points out. This wasn't intentional, I
think, but it is the facts.

Given that, I think this whole discussion is not really necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti 
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability.  Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io 
> LinkedIn  | Twitter
>  | Youtube
>  | Github
> 
>


Re: Running 10.0 build with a custom lucene 9.5

2023-05-15 Thread Michael Sokolov
random guess - does it have something to do with modules?

On Mon, May 15, 2023 at 11:14 AM Gus Heck  wrote:
>
> I hadn't seen that one. Thanks, I'll look at it. It already looks a bit 
> confusing though since it seems to have options for pointing to a repo, but I 
> appear to be pulling the jars successfully from .m2/repository already... 
> (except then they don't work, so successful means I see them in the classpath 
> of the relevant classloader). And if we can't deploy a valid jar to 
> mavenLocal for some reason (tweaked the solr build so it sees mavenLocal()), 
> (or solr can't consume such a jar) that seems like an issue for whichever one 
> is breaking that.
>
> Debugging: The JDK appears to be attempting to load the services file from 
> modules, but not seeing the lucene module. (just the jdk ones) Also it passes 
> through a block that says:
>
> // not in a package of a module defined to this loader
> for (URL url : findMiscResource(name)) {
>
> (but then iterates 
> jdk.internal.loader.BuiltinClassLoader#nameToModule.values() to load things 
> anyway)
>
> -Gus
>
> On Mon, May 15, 2023 at 10:54 AM Houston Putman  
> wrote:
>>
>> Gus, I haven't done this myself, but are you using the instructions provided 
>> in Solr's "gradle/lucene-dev/lucene-dev-repo-composite.gradle"?
>>
>> It looks like you need to specify the development lucene version differently 
>> than other dependencies...
>>
>> - Houston
>>
>> On Sat, May 13, 2023 at 10:14 AM Michael Sokolov  wrote:
>>>
>>> doh I actually read your email and you said you already checked that -
>>> I'm going to send out one of those "sokolov would like to retract the
>>> previous email" emails. Does GMail even pretend to do that? I don't
>>> know what's going on there! sorry
>>>
>>> On Sat, May 13, 2023 at 10:13 AM Michael Sokolov  wrote:
>>> >
>>> > sorry - META-INF not WEB-INF
>>> >
>>> > On Sat, May 13, 2023 at 10:12 AM Michael Sokolov  
>>> > wrote:
>>> > >
>>> > > You are probably missing the contents of WEB-INF in your custom jar?
>>> > > Roughly speaking the files in there define run-time-bound "services"
>>> > > that are looked up by name by the JDK's service-loader API.
>>> > >
>>> > > On Sat, May 13, 2023 at 9:33 AM Gus Heck  wrote:
>>> > > >
>>> > > > Cross posting to lucene on the possibility that folks here are more 
>>> > > > likely to add customized lucene to Solr and recognize what I'm 
>>> > > > stumbling on? (zero responses on solr list)
>>> > > >
>>> > > > Note that the specific test that I happened to copy is not the issue, 
>>> > > > all tests are doing this (or at least so many tests are failing I 
>>> > > > can't see the ones that are passing easily).
>>> > > >
>>> > > > -- Forwarded message -
>>> > > > From: Gus Heck 
>>> > > > Date: Wed, May 10, 2023 at 6:50 PM
>>> > > > Subject: Running 10.0 build with a custom lucene 9.5
>>> > > > To: 
>>> > > >
>>> > > >
>>> > > > Lucene:
>>> > > >
>>> > > > I made a tweak to lucene for something I'm investigating, gave it a 
>>> > > > new version, deployed to mavenLocal()
>>> > > > I have verified that the jars are built with correct 
>>> > > > META-INF/services files
>>> > > >
>>> > > > Solr:
>>> > > >
>>> > > > I added mavenLocal() in gradle/globals.gradle
>>> > > > I removed the license file sha1 sigs for the default lucene & creates 
>>> > > > signatures for my test version
>>> > > > I updated versions.props
>>> > > > I updated versions.lock
>>> > > >
>>> > > > Now when I run individual solr tests via my ide they seem to pass, 
>>> > > > but virtually every test run via gradle fails with something like:
>>> > > >
>>> > > > org.apache.solr.embedded.TestJettySolrRunner > classMethod FAILED
>>> > > > java.lang.ExceptionInInitializerError
>>> > > > at org.apache.lucene.codecs.Codec.getDefault(Codec.java:141)
&g

Re: Running 10.0 build with a custom lucene 9.5

2023-05-13 Thread Michael Sokolov
doh I actually read your email and you said you already checked that -
I'm going to send out one of those "sokolov would like to retract the
previous email" emails. Does GMail even pretend to do that? I don't
know what's going on there! sorry

On Sat, May 13, 2023 at 10:13 AM Michael Sokolov  wrote:
>
> sorry - META-INF not WEB-INF
>
> On Sat, May 13, 2023 at 10:12 AM Michael Sokolov  wrote:
> >
> > You are probably missing the contents of WEB-INF in your custom jar?
> > Roughly speaking the files in there define run-time-bound "services"
> > that are looked up by name by the JDK's service-loader API.
> >
> > On Sat, May 13, 2023 at 9:33 AM Gus Heck  wrote:
> > >
> > > Cross posting to lucene on the possibility that folks here are more 
> > > likely to add customized lucene to Solr and recognize what I'm stumbling 
> > > on? (zero responses on solr list)
> > >
> > > Note that the specific test that I happened to copy is not the issue, all 
> > > tests are doing this (or at least so many tests are failing I can't see 
> > > the ones that are passing easily).
> > >
> > > -- Forwarded message -
> > > From: Gus Heck 
> > > Date: Wed, May 10, 2023 at 6:50 PM
> > > Subject: Running 10.0 build with a custom lucene 9.5
> > > To: 
> > >
> > >
> > > Lucene:
> > >
> > > I made a tweak to lucene for something I'm investigating, gave it a new 
> > > version, deployed to mavenLocal()
> > > I have verified that the jars are built with correct META-INF/services 
> > > files
> > >
> > > Solr:
> > >
> > > I added mavenLocal() in gradle/globals.gradle
> > > I removed the license file sha1 sigs for the default lucene & creates 
> > > signatures for my test version
> > > I updated versions.props
> > > I updated versions.lock
> > >
> > > Now when I run individual solr tests via my ide they seem to pass, but 
> > > virtually every test run via gradle fails with something like:
> > >
> > > org.apache.solr.embedded.TestJettySolrRunner > classMethod FAILED
> > > java.lang.ExceptionInInitializerError
> > > at org.apache.lucene.codecs.Codec.getDefault(Codec.java:141)
> > > at 
> > > org.apache.lucene.tests.util.TestRuleSetupAndRestoreClassEnv.before(TestRuleSetupAndRestoreClassEnv.java:137)
> > > at 
> > > org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:42)
> > > at 
> > > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > > at 
> > > org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> > > at 
> > > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> > > at 
> > > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> > > at 
> > > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > > at 
> > > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > > at 
> > > org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> > > at 
> > > org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> > > at 
> > > org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> > > at 
> > > org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> > > at 
> > > org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> > > at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> > > at 
> > > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > > at 
> > > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> > > at 
> > > com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> > > at java.base/java.la

Re: Running 10.0 build with a custom lucene 9.5

2023-05-13 Thread Michael Sokolov
sorry - META-INF not WEB-INF

On Sat, May 13, 2023 at 10:12 AM Michael Sokolov  wrote:
>
> You are probably missing the contents of WEB-INF in your custom jar?
> Roughly speaking the files in there define run-time-bound "services"
> that are looked up by name by the JDK's service-loader API.
>
> On Sat, May 13, 2023 at 9:33 AM Gus Heck  wrote:
> >
> > Cross posting to lucene on the possibility that folks here are more likely 
> > to add customized lucene to Solr and recognize what I'm stumbling on? (zero 
> > responses on solr list)
> >
> > Note that the specific test that I happened to copy is not the issue, all 
> > tests are doing this (or at least so many tests are failing I can't see the 
> > ones that are passing easily).
> >
> > -- Forwarded message -
> > From: Gus Heck 
> > Date: Wed, May 10, 2023 at 6:50 PM
> > Subject: Running 10.0 build with a custom lucene 9.5
> > To: 
> >
> >
> > Lucene:
> >
> > I made a tweak to lucene for something I'm investigating, gave it a new 
> > version, deployed to mavenLocal()
> > I have verified that the jars are built with correct META-INF/services files
> >
> > Solr:
> >
> > I added mavenLocal() in gradle/globals.gradle
> > I removed the license file sha1 sigs for the default lucene & creates 
> > signatures for my test version
> > I updated versions.props
> > I updated versions.lock
> >
> > Now when I run individual solr tests via my ide they seem to pass, but 
> > virtually every test run via gradle fails with something like:
> >
> > org.apache.solr.embedded.TestJettySolrRunner > classMethod FAILED
> > java.lang.ExceptionInInitializerError
> > at org.apache.lucene.codecs.Codec.getDefault(Codec.java:141)
> > at 
> > org.apache.lucene.tests.util.TestRuleSetupAndRestoreClassEnv.before(TestRuleSetupAndRestoreClassEnv.java:137)
> > at 
> > org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:42)
> > at 
> > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > at 
> > org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> > at 
> > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> > at 
> > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> > at 
> > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > at 
> > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > at 
> > org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> > at 
> > org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> > at 
> > org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> > at 
> > org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> > at 
> > org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> > at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> > at 
> > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> > at 
> > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> > at 
> > com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> > at java.base/java.lang.Thread.run(Thread.java:829)
> >
> > Caused by:
> > java.lang.IllegalArgumentException: An SPI class of type 
> > org.apache.lucene.codecs.Codec with name 'Lucene95' does not exist.  You 
> > need to add the corresponding JAR file supporting this SPI to your 
> > classpath.  The current classpath supports the following names: []
> > at 
> > org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113)
> > at org.apache.lucene.codecs.Codec$Holder.(Codec.java:58)
> > ... 19 more
> >
> > org.apache.solr.embedded.TestJettySolrRunner > classMethod FAILED
&

Re: Running 10.0 build with a custom lucene 9.5

2023-05-13 Thread Michael Sokolov
You are probably missing the contents of WEB-INF in your custom jar?
Roughly speaking the files in there define run-time-bound "services"
that are looked up by name by the JDK's service-loader API.

On Sat, May 13, 2023 at 9:33 AM Gus Heck  wrote:
>
> Cross posting to lucene on the possibility that folks here are more likely to 
> add customized lucene to Solr and recognize what I'm stumbling on? (zero 
> responses on solr list)
>
> Note that the specific test that I happened to copy is not the issue, all 
> tests are doing this (or at least so many tests are failing I can't see the 
> ones that are passing easily).
>
> -- Forwarded message -
> From: Gus Heck 
> Date: Wed, May 10, 2023 at 6:50 PM
> Subject: Running 10.0 build with a custom lucene 9.5
> To: 
>
>
> Lucene:
>
> I made a tweak to lucene for something I'm investigating, gave it a new 
> version, deployed to mavenLocal()
> I have verified that the jars are built with correct META-INF/services files
>
> Solr:
>
> I added mavenLocal() in gradle/globals.gradle
> I removed the license file sha1 sigs for the default lucene & creates 
> signatures for my test version
> I updated versions.props
> I updated versions.lock
>
> Now when I run individual solr tests via my ide they seem to pass, but 
> virtually every test run via gradle fails with something like:
>
> org.apache.solr.embedded.TestJettySolrRunner > classMethod FAILED
> java.lang.ExceptionInInitializerError
> at org.apache.lucene.codecs.Codec.getDefault(Codec.java:141)
> at 
> org.apache.lucene.tests.util.TestRuleSetupAndRestoreClassEnv.before(TestRuleSetupAndRestoreClassEnv.java:137)
> at 
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:42)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at 
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> Caused by:
> java.lang.IllegalArgumentException: An SPI class of type 
> org.apache.lucene.codecs.Codec with name 'Lucene95' does not exist.  You need 
> to add the corresponding JAR file supporting this SPI to your classpath.  The 
> current classpath supports the following names: []
> at 
> org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113)
> at org.apache.lucene.codecs.Codec$Holder.(Codec.java:58)
> ... 19 more
>
> org.apache.solr.embedded.TestJettySolrRunner > classMethod FAILED
> java.lang.NullPointerException
> at java.base/java.util.Objects.requireNonNull(Objects.java:221)
> at org.apache.lucene.codecs.Codec.setDefault(Codec.java:151)
> at 
> org.apache.lucene.tests.util.TestRuleSetupAndRestoreClassEnv.after(TestRuleSetupAndRestoreClassEnv.java:292)
> at 
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:49)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOn

Re: HNSW questions

2023-05-11 Thread Michael Sokolov
Yes, it's up to the application. And it is definitely a pathological
case when it happens; https://github.com/apache/lucene/issues/11626

On Tue, May 9, 2023 at 1:30 PM Jonathan Ellis  wrote:
>
> I don't see anything to make sure vectors are unique in IndexingChain down to 
> FieldWriter, is that handled somewhere else?  Or is it just up to the user to 
> make sure no documents end up with duplicate vectors?
>
> On Wed, Apr 19, 2023 at 5:07 AM Michael Sokolov  wrote:
>>
>> Oh identical vectors. Basically unsupported. If you create a large index 
>> filled with identical vectors it leads to pathological behavior. Seems to be 
>> a weakness in the algorithm. If you have any idea how to improve that, it 
>> would be welcome. But in real world scenarios, it doesn't seem to arise?
>>
>> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis  wrote:
>>>
>>> HI all, a couple questions on how HNSW works:
>>>
>>> 1. What is driving the requirement for two copies of the input vectors?  It 
>>> looks like the RAVV implementations do shallow copies, so the vector from A 
>>> is the same that would be returned by B.  What am I missing?
>>>
>>> 2. What is the intended behavior when adding identical vectors to a HNSW?  
>>> It looks like when I supply 10 identical vectors, they all get added to the 
>>> graph, but when I search for the nearest neighbors, I only get one of them 
>>> in the result set.
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



BooleanQuery score aggregation

2023-04-28 Thread Michael Sokolov
I think that in BooleanQuery and related classes we mostly aggregate
child scores by summing (although there is DisjunctionMaxScorer which
doesn't exactly take the max?). I have a use case where I want to take
the min score from a bunch of required terms. To do this I had to
write a new query and fork BlockMaxConjunctionScorer. I wonder if it
would make sense to expose the aggregator to callers, perhaps with an
enum, since we can't support arbitrary functions, but we could support
at least min, max, sum?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: HNSW questions

2023-04-20 Thread Michael Sokolov
Right RAVectorValues is just fronting an array of vectors and it
doesn't have any intermediate storage or other state (like a file
pointer) so it can support many simultaneous callers. Other
implementations of the interface work differently; see
OffHeapByteVectorValues, which is representing vectors in the index
and implemented using I/O calls.

If you shared some context about your interest here, we might be able
to help you better.

On Thu, Apr 20, 2023 at 1:22 PM Jonathan Ellis  wrote:
>
> It looks like I misunderstood how the Builder works, and the RAVV provided to 
> the constructor does not need to contain any values up front.  Specifically, 
> Lucene95HnswVectorsWriter.FieldWriter adds vectors incrementally to the RAVV 
> that it gives to the builder as addValue is called.
>
> On Wed, Apr 19, 2023 at 1:37 PM Michael Sokolov  wrote:
>>
>> That class is intended for use by the Lucene index writer - it's not
>> designed as a general purpose class for re-use outside that context.
>> And IndexWriter writes documents to disk in bulk.
>>
>> On Wed, Apr 19, 2023 at 3:54 PM Jonathan Ellis  wrote:
>> >
>> > Thanks, Michael!
>> >
>> > Looking at the paper by Malkov and Yashunin, it looks like the algorithm 
>> > allows for building the hnsw graph incrementally.  Why does our 
>> > implementation require specifying all the vectors up front to 
>> > HnswGraphBuilder.create?
>> >
>> > On Wed, Apr 19, 2023 at 3:04 AM Michael Sokolov  wrote:
>> >>
>> >> These vector values have internal buffers they use to return the vectors. 
>> >> In order to compare two vectors we need to use two independent sources so 
>> >> that one doesn't overwrite this internal state when fetching the second 
>> >> vector.
>> >>
>> >> Sorry I forgot the second question and can't see it on my phone. Brb
>> >>
>> >> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis  wrote:
>> >>>
>> >>> HI all, a couple questions on how HNSW works:
>> >>>
>> >>> 1. What is driving the requirement for two copies of the input vectors?  
>> >>> It looks like the RAVV implementations do shallow copies, so the vector 
>> >>> from A is the same that would be returned by B.  What am I missing?
>> >>>
>> >>> 2. What is the intended behavior when adding identical vectors to a 
>> >>> HNSW?  It looks like when I supply 10 identical vectors, they all get 
>> >>> added to the graph, but when I search for the nearest neighbors, I only 
>> >>> get one of them in the result set.
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> co-founder, http://www.datastax.com
>> >>> @spyced
>> >
>> >
>> >
>> > --
>> > Jonathan Ellis
>> > co-founder, http://www.datastax.com
>> > @spyced
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: HNSW questions

2023-04-19 Thread Michael Sokolov
That class is intended for use by the Lucene index writer - it's not
designed as a general purpose class for re-use outside that context.
And IndexWriter writes documents to disk in bulk.

On Wed, Apr 19, 2023 at 3:54 PM Jonathan Ellis  wrote:
>
> Thanks, Michael!
>
> Looking at the paper by Malkov and Yashunin, it looks like the algorithm 
> allows for building the hnsw graph incrementally.  Why does our 
> implementation require specifying all the vectors up front to 
> HnswGraphBuilder.create?
>
> On Wed, Apr 19, 2023 at 3:04 AM Michael Sokolov  wrote:
>>
>> These vector values have internal buffers they use to return the vectors. In 
>> order to compare two vectors we need to use two independent sources so that 
>> one doesn't overwrite this internal state when fetching the second vector.
>>
>> Sorry I forgot the second question and can't see it on my phone. Brb
>>
>> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis  wrote:
>>>
>>> HI all, a couple questions on how HNSW works:
>>>
>>> 1. What is driving the requirement for two copies of the input vectors?  It 
>>> looks like the RAVV implementations do shallow copies, so the vector from A 
>>> is the same that would be returned by B.  What am I missing?
>>>
>>> 2. What is the intended behavior when adding identical vectors to a HNSW?  
>>> It looks like when I supply 10 identical vectors, they all get added to the 
>>> graph, but when I search for the nearest neighbors, I only get one of them 
>>> in the result set.
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.6 release

2023-04-19 Thread Michael Sokolov
Yes, thanks Alan!

On Wed, Apr 19, 2023 at 3:41 PM Michael Wechner
 wrote:
>
> +1
>
> Thanks!
>
> Michael
>
> Am 19.04.23 um 18:09 schrieb Benjamin Trent:
>
> +1 !
>
> You rock Alan!
>
> On Wed, Apr 19, 2023, 9:54 AM Ignacio Vera  wrote:
>>
>> +1
>>
>> Thanks Alan!
>>
>> On Wed, Apr 19, 2023 at 1:27 PM Alan Woodward  wrote:
>>>
>>> Hi all,
>>>
>>> It’s been a while since our last release, and we have a number of nice 
>>> improvements and optimisations sitting in the 9x branch.  I propose that we 
>>> start the process for a 9.6 release, and I will volunteer to be the release 
>>> manager.  If there are no objections, I will cut a release branch one week 
>>> today, April 26th.
>>>
>>> - Alan
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: HNSW questions

2023-04-19 Thread Michael Sokolov
Oh identical vectors. Basically unsupported. If you create a large index
filled with identical vectors it leads to pathological behavior. Seems to
be a weakness in the algorithm. If you have any idea how to improve that,
it would be welcome. But in real world scenarios, it doesn't seem to arise?

On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis  wrote:

> HI all, a couple questions on how HNSW works:
>
> 1. What is driving the requirement for two copies of the input vectors?
> It looks like the RAVV implementations do shallow copies, so the vector
> from A is the same that would be returned by B.  What am I missing?
>
> 2. What is the intended behavior when adding identical vectors to a HNSW?
> It looks like when I supply 10 identical vectors, they all get added to the
> graph, but when I search for the nearest neighbors, I only get one of them
> in the result set.
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Re: HNSW questions

2023-04-19 Thread Michael Sokolov
These vector values have internal buffers they use to return the vectors.
In order to compare two vectors we need to use two independent sources so
that one doesn't overwrite this internal state when fetching the second
vector.

Sorry I forgot the second question and can't see it on my phone. Brb

On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis  wrote:

> HI all, a couple questions on how HNSW works:
>
> 1. What is driving the requirement for two copies of the input vectors?
> It looks like the RAVV implementations do shallow copies, so the vector
> from A is the same that would be returned by B.  What am I missing?
>
> 2. What is the intended behavior when adding identical vectors to a HNSW?
> It looks like when I supply 10 identical vectors, they all get added to the
> graph, but when I search for the nearest neighbors, I only get one of them
> in the result set.
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-12 Thread Michael Sokolov
.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>> - the variance of the value of each dimension is characteristic:
>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>
>> This probably represents something significant about how the ada-002 
>> embeddings are created, but I think it also means creating "realistic" 
>> values is possible.  I did not use this information when testing recall & 
>> performance on Lucene's HNSW implementation on 192m documents, as I slightly 
>> dithered the values of a "real" set on 47K docs and stored other fields in 
>> the doc that referenced the "base" document that the dithers were made from, 
>> and used different dithering magnitudes so that I could test recall with 
>> different neighbour sizes ("M"), construction-beamwidth and 
>> search-beamwidths.
>>
>> best regards
>>
>> Kent Fitch
>>
>>
>>
>>
>> On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner  
>> wrote:
>>>
>>> I understand what you mean that it seems to be artificial, but I don't
>>> understand why this matters to test performance and scalability of the
>>> indexing?
>>>
>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>>> are only open source models generating vectors with 4 dimensions, for
>>> example
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>
>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>
>>> -0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114
>>>
>>> and now I concatenate them to vectors with 8 dimensions
>>>
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114
>>>
>>> and normalize them to length 1.
>>>
>>> Why should this be any different to a model which is acting like a black
>>> box generating vectors with 8 dimensions?
>>>
>>>
>>>
>>>
>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>> >> What exactly do you consider real vector data? Vector data which is 
>>> >> based on texts written by humans?
>>> > We have plenty of text; the problem is coming up with a realistic
>>> > vector model that requires as many dimensions as people seem to be
>>> > demanding. As I said above, after surveying huggingface I couldn't
>>> > find any text-based model using more than 768 dimensions. So far we
>>> > have some ideas of generating higher-dimensional data by dithering or
>>> > concatenating existing data, but it seems artificial.
>>> >
>>> > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
>>> >  wrote:
>>> >> What exactly do you consider real vector data? Vector data which is 
>>> >> based on texts written by humans?
>>> >>
>>> >> I am asking, because I recently attended the following presentation by 
>>> >> Anastassia Shaitarova (UZH Institute for Computational Linguistics, 
>>> >> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>> >>
>>> >> 
>>> >>
>>> >> Can we Identify Machine-Generated Text? An Overview of Current Approaches
>>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>>> >>
>>> >> The detection of machine-generated text has become increasingly 
>>> >> important due to the prevalence of automated content generation and its 
>>> >> potential for misuse. In this talk, we will discuss the motivation for 
>>> >> automatic detection of generated text. We will present the currently 
>>> >> available methods, including feature-based classifica

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-11 Thread Michael Sokolov
l document the RAM requirements)?  Maybe merge RAM costs should be 
> accounted for in IW's RAM buffer accounting?  It is not today, and there are 
> some other things that use non-trivial RAM, e.g. the doc mapping (to compress 
> docid space when deletions are reclaimed).
>
> When we added KNN vector testing to Lucene's nightly benchmarks, the indexing 
> time massively increased -- see annotations DH and DP here: 
> https://home.apache.org/~mikemccand/lucenebench/indexing.html.  Nightly 
> benchmarks now start at 6 PM and don't finish until ~14.5 hours later.  Of 
> course, that is using a single thread for indexing (on a box that has 128 
> cores!) so we produce a deterministic index every night ...
>
> Stepping out (meta) a bit ... this discussion is precisely one of the awesome 
> benefits of the (informed) veto.  It means risky changes to the software, as 
> determined by any single informed developer on the project, can force a 
> healthy discussion about the problem at hand.  Robert is legitimately 
> concerned about a real issue and so we should use our creative energies to 
> characterize our HNSW implementation's performance, document it clearly for 
> users, and uncover ways to improve it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti  
> wrote:
>>
>> I think Gus points are on target.
>>
>> I recommend we move this forward in this way:
>> We stop any discussion and everyone interested proposes an option with a 
>> motivation, then we aggregate the options and we create a Vote maybe?
>>
>> I am also on the same page on the fact that a veto should come with a clear 
>> and reasonable technical merit, which also in my opinion has not come yet.
>>
>> I also apologise if any of my words sounded harsh or personal attacks, never 
>> meant to do so.
>>
>> My proposed option:
>>
>> 1) remove the limit and potentially make it configurable,
>> Motivation:
>> The system administrator can enforce a limit its users need to respect that 
>> it's in line with whatever the admin decided to be acceptable for them.
>> Default can stay the current one.
>>
>> That's my favourite at the moment, but I agree that potentially in the 
>> future this may need to change, as we may optimise the data structures for 
>> certain dimensions. I  am a big fan of Yagni (you aren't going to need it) 
>> so I am ok we'll face a different discussion if that happens in the future.
>>
>>
>>
>> On Sun, 9 Apr 2023, 18:46 Gus Heck,  wrote:
>>>
>>> What I see so far:
>>>
>>> Much positive support for raising the limit
>>> Slightly less support for removing it or making it configurable
>>> A single veto which argues that a (as yet undefined) performance standard 
>>> must be met before raising the limit
>>> Hot tempers (various) making this discussion difficult
>>>
>>> As I understand it, vetoes must have technical merit. I'm not sure that 
>>> this veto rises to "technical merit" on 2 counts:
>>>
>>> No standard for the performance is given so it cannot be technically met. 
>>> Without hard criteria it's a moving target.
>>> It appears to encode a valuation of the user's time, and that valuation is 
>>> really up to the user. Some users may consider 2hours useless and not worth 
>>> it, and others might happily wait 2 hours. This is not a technical 
>>> decision, it's a business decision regarding the relative value of the time 
>>> invested vs the value of the result. If I can cure cancer by indexing for a 
>>> year, that might be worth it... (hyperbole of course).
>>>
>>> Things I would consider to have technical merit that I don't hear:
>>>
>>> Impact on the speed of **other** indexing operations. (devaluation of other 
>>> functionality)
>>> Actual scenarios that work when the limit is low and fail when the limit is 
>>> high (new failure on the same data with the limit raised).
>>>
>>> One thing that might or might not have technical merit
>>>
>>> If someone feels there is a lack of documentation of the costs/performance 
>>> implications of using large vectors, possibly including reproducible 
>>> benchmarks establishing the scaling behavior (there seems to be 
>>> disagreement on O(n) vs O(n^2)).
>>>
>>> The users *should* know what they are getting into, but if the cost is 
>>> worth it to them, they should be able to pa

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-10 Thread Michael Sokolov
I poked around on huggingface looking at various models that are being
promoted there; this is the highest-performing text model they list,
which is expected to take sentences as input; it uses so-called
"attention" to capture the context of words:
https://huggingface.co/sentence-transformers/all-mpnet-base-v2 and it
is 768-dimensional. This is a list of models designed for "asymmetric
semantic search" ie short queries and long documents:
https://www.sbert.net/docs/pretrained-models/msmarco-v3.html. The
highest ranking one there also seems to be 768d
https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

 I did see some other larger-dimensional model, but they all seem to
involve images+text.

On Mon, Apr 10, 2023 at 9:54 AM Michael Sokolov  wrote:
>
> I think concatenating word-embedding vectors is a reasonable thing to
> do. It captures information about the sequence of tokens which is
> being lost by the current approach (summing them). Random article I
> found in a search
> https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
> shows higher performance with a concatenative approach. So it seems to
> me we could take the 300-dim Glove vectors and produce somewhat
> meaningful (say) 1200- or 1500-dim vectors by running a sliding window
> over the tokens in a document and concatenating the token-vectors
>
> On Sun, Apr 9, 2023 at 2:44 PM Dawid Weiss  wrote:
> >
> > > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 
> > > and 300 dimensional varieties and can easily enough generate large 
> > > numbers of vector documents from the articles data. To go higher we could 
> > > concatenate vectors from that and I believe the performance numbers would 
> > > be plausible.
> >
> > Apologies - I wasn't clear - I thought of building the 1k or 2k
> > vectors that would be realistic. Perhaps using glove or perhaps using
> > some other software but something that would reflect a true 2k
> > dimensional space accurately with "real" data underneath. I am not
> > familiar enough with the field to tell whether a simple concatenation
> > is a good enough simulation - perhaps it is.
> >
> > I would really prefer to focus on doing this kind of assessment of
> > feasibility/ limitations rather than arguing back and forth. I did my
> > experiment a while ago and I can't really tell whether there have been
> > improvements in the indexing/ merging part - your email contradicts my
> > experience Mike, so I'm a bit intrigued and would like to revisit it.
> > But it'd be ideal to work with real vectors rather than a simulation.
> >
> > Dawid
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-10 Thread Michael Sokolov
I think concatenating word-embedding vectors is a reasonable thing to
do. It captures information about the sequence of tokens which is
being lost by the current approach (summing them). Random article I
found in a search
https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
shows higher performance with a concatenative approach. So it seems to
me we could take the 300-dim Glove vectors and produce somewhat
meaningful (say) 1200- or 1500-dim vectors by running a sliding window
over the tokens in a document and concatenating the token-vectors

On Sun, Apr 9, 2023 at 2:44 PM Dawid Weiss  wrote:
>
> > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 
> > and 300 dimensional varieties and can easily enough generate large numbers 
> > of vector documents from the articles data. To go higher we could 
> > concatenate vectors from that and I believe the performance numbers would 
> > be plausible.
>
> Apologies - I wasn't clear - I thought of building the 1k or 2k
> vectors that would be realistic. Perhaps using glove or perhaps using
> some other software but something that would reflect a true 2k
> dimensional space accurately with "real" data underneath. I am not
> familiar enough with the field to tell whether a simple concatenation
> is a good enough simulation - perhaps it is.
>
> I would really prefer to focus on doing this kind of assessment of
> feasibility/ limitations rather than arguing back and forth. I did my
> experiment a while ago and I can't really tell whether there have been
> improvements in the indexing/ merging part - your email contradicts my
> experience Mike, so I'm a bit intrigued and would like to revisit it.
> But it'd be ideal to work with real vectors rather than a simulation.
>
> Dawid
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Michael Sokolov
We do have a dataset built from Wikipedia in luceneutil. It comes in 100
and 300 dimensional varieties and can easily enough generate large numbers
of vector documents from the articles data. To go higher we could
concatenate vectors from that and I believe the performance numbers would
be plausible.

On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss  wrote:

> Can we set up a branch in which the limit is bumped to 2048, then have
> a realistic, free data set (wikipedia sample or something) that has,
> say, 5 million docs and vectors created using public data (glove
> pre-trained embeddings or the like)? We then could run indexing on the
> same hardware with 512, 1024 and 2048 and see what the numbers, limits
> and behavior actually are.
>
> I can help in writing this but not until after Easter.
>
>
> Dawid
>
> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand  wrote:
> >
> > As Dawid pointed out earlier on this thread, this is the rule for
> > Apache projects: a single -1 vote on a code change is a veto and
> > cannot be overridden. Furthermore, Robert is one of the people on this
> > project who worked the most on debugging subtle bugs, making Lucene
> > more robust and improving our test framework, so I'm listening when he
> > voices quality concerns.
> >
> > The argument against removing/raising the limit that resonates with me
> > the most is that it is a one-way door. As MikeS highlighted earlier on
> > this thread, implementations may want to take advantage of the fact
> > that there is a limit at some point too. This is why I don't want to
> > remove the limit and would prefer a slight increase, such as 2048 as
> > suggested in the original issue, which would enable most of the things
> > that users who have been asking about raising the limit would like to
> > do.
> >
> > I agree that the merge-time memory usage and slow indexing rate are
> > not great. But it's still possible to index multi-million vector
> > datasets with a 4GB heap without hitting OOMEs regardless of the
> > number of dimensions, and the feedback I'm seeing is that many users
> > are still interested in indexing multi-million vector datasets despite
> > the slow indexing rate. I wish we could do better, and vector indexing
> > is certainly more expert than text indexing, but it still is usable in
> > my opinion. I understand how giving Lucene more information about
> > vectors prior to indexing (e.g. clustering information as Jim pointed
> > out) could help make merging faster and more memory-efficient, but I
> > would really like to avoid making it a requirement for indexing
> > vectors as it also makes this feature much harder to use.
> >
> > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
> >  wrote:
> > >
> > > I am very attentive to listen opinions but I am un-convinced here and
> I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >
> > > The limit as far as I know is literally just raising an exception.
> > > Removing it won't alter in any way the current performance for users
> in low dimensional space.
> > > Removing it will just enable more users to use Lucene.
> > >
> > > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > > This is how you make progress.
> > >
> > > If it's a reputation thing, trust me that not allowing users to play
> with high dimensional space will equally damage it.
> > >
> > > To me it's really a no brainer.
> > > Removing the limit and enable people to use high dimensional vectors
> will take minutes.
> > > Improving the hnsw implementation can take months.
> > > Pick one to begin with...
> > >
> > > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >
> > >
> > > On Sat, 8 Apr 2023, 18:57 Robert Muir,  wrote:
> > >>
> > >> I disagree with your categorization. I put in plenty of work and
> > >> experienced plenty of pain myself, writing tests and fighting these
> > >> issues, after i saw that, two releases in a row, vector indexing fell
> > >> over and hit integer overflows etc on small datasets:
> > >>
> > >> https://github.com/apache/lucene/pull/11905
> > >>
> > >> Attacking me isn't helping the situation.
> > >>
> > >> PS: when i said the "one guy who wrote the code

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Michael Sokolov
well, it's a final variable. But you could maybe extend KnnVectorField
to get around this limit? I think that's the only place it's currently
enforced

On Sat, Apr 8, 2023 at 3:54 PM Ishan Chattopadhyaya
 wrote:
>
> Can the limit be raised using Java reflection at run time? Or is there more 
> to it that needs to be changed?
>
> On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti,  
> wrote:
>>
>> I am very attentive to listen opinions but I am un-convinced here and I an 
>> not sure that a single person opinion should be allowed to be detrimental 
>> for such an important project.
>>
>> The limit as far as I know is literally just raising an exception.
>> Removing it won't alter in any way the current performance for users in low 
>> dimensional space.
>> Removing it will just enable more users to use Lucene.
>>
>> If new users in certain situations will be unhappy with the performance, 
>> they may contribute improvements.
>> This is how you make progress.
>>
>> If it's a reputation thing, trust me that not allowing users to play with 
>> high dimensional space will equally damage it.
>>
>> To me it's really a no brainer.
>> Removing the limit and enable people to use high dimensional vectors will 
>> take minutes.
>> Improving the hnsw implementation can take months.
>> Pick one to begin with...
>>
>> And there's no-one paying me here, no company interest whatsoever, actually 
>> I pay people to contribute, I am just convinced it's a good idea.
>>
>>
>> On Sat, 8 Apr 2023, 18:57 Robert Muir,  wrote:
>>>
>>> I disagree with your categorization. I put in plenty of work and
>>> experienced plenty of pain myself, writing tests and fighting these
>>> issues, after i saw that, two releases in a row, vector indexing fell
>>> over and hit integer overflows etc on small datasets:
>>>
>>> https://github.com/apache/lucene/pull/11905
>>>
>>> Attacking me isn't helping the situation.
>>>
>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> any kind of demeaning fashion really. I meant to describe the current
>>> state of usability with respect to indexing a few million docs with
>>> high dimensions. You can scroll up the thread and see that at least
>>> one other committer on the project experienced similar pain as me.
>>> Then, think about users who aren't committers trying to use the
>>> functionality!
>>>
>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov  wrote:
>>> >
>>> > What you said about increasing dimensions requiring a bigger ram buffer 
>>> > on merge is wrong. That's the point I was trying to make. Your concerns 
>>> > about merge costs are not wrong, but your conclusion that we need to 
>>> > limit dimensions is not justified.
>>> >
>>> > You complain that hnsw sucks it doesn't scale, but when I show it scales 
>>> > linearly with dimension you just ignore that and complain about something 
>>> > entirely different.
>>> >
>>> > You demand that people run all kinds of tests to prove you wrong but when 
>>> > they do, you don't listen and you won't put in the work yourself or 
>>> > complain that it's too hard.
>>> >
>>> > Then you complain about people not meeting you half way. Wow
>>> >
>>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir  wrote:
>>> >>
>>> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>> >>  wrote:
>>> >> >
>>> >> > What exactly do you consider reasonable?
>>> >>
>>> >> Let's begin a real discussion by being HONEST about the current
>>> >> status. Please put politically correct or your own company's wishes
>>> >> aside, we know it's not in a good state.
>>> >>
>>> >> Current status is the one guy who wrote the code can set a
>>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >> dimensions in HOURS (i didn't ask what hardware).
>>> >>
>>> >> My concerns are everyone else except the one guy, I want it to be
>>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>>> >> ram buffer and bigger heap to avoid OOM on merge.
>>> >> It is also a permanent backwards compatibility decision, we hav

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Michael Sokolov
What you said about increasing dimensions requiring a bigger ram buffer on
merge is wrong. That's the point I was trying to make. Your concerns about
merge costs are not wrong, but your conclusion that we need to limit
dimensions is not justified.

You complain that hnsw sucks it doesn't scale, but when I show it scales
linearly with dimension you just ignore that and complain about something
entirely different.

You demand that people run all kinds of tests to prove you wrong but when
they do, you don't listen and you won't put in the work yourself or
complain that it's too hard.

Then you complain about people not meeting you half way. Wow

On Sat, Apr 8, 2023, 12:40 PM Robert Muir  wrote:

> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>  wrote:
> >
> > What exactly do you consider reasonable?
>
> Let's begin a real discussion by being HONEST about the current
> status. Please put politically correct or your own company's wishes
> aside, we know it's not in a good state.
>
> Current status is the one guy who wrote the code can set a
> multi-gigabyte ram buffer and index a small dataset with 1024
> dimensions in HOURS (i didn't ask what hardware).
>
> My concerns are everyone else except the one guy, I want it to be
> usable. Increasing dimensions just means even bigger multi-gigabyte
> ram buffer and bigger heap to avoid OOM on merge.
> It is also a permanent backwards compatibility decision, we have to
> support it once we do this and we can't just say "oops" and flip it
> back.
>
> It is unclear to me, if the multi-gigabyte ram buffer is really to
> avoid merges because they are so slow and it would be DAYS otherwise,
> or if its to avoid merges so it doesn't hit OOM.
> Also from personal experience, it takes trial and error (means
> experiencing OOM on merge!!!) before you get those heap values correct
> for your dataset. This usually means starting over which is
> frustrating and wastes more time.
>
> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> to me like its a good idea. maybe the multigigabyte ram buffer can be
> avoided in this way and performance improved by writing bigger
> segments with lucene's defaults. But this doesn't mean we can simply
> ignore the horrors of what happens on merge. merging needs to scale so
> that indexing really scales.
>
> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> fashion when indexing.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Sokolov
one more data point:

32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)

On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov  wrote:
>
> I also want to add that we do impose some other limits on graph
> construction to help ensure that HNSW-based vector fields remain
> manageable; M is limited to <= 512, and maximum segment size also
> helps limit merge costs
>
> On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov  wrote:
> >
> > Thanks Kent - I tried something similar to what you did I think. Took
> > a set of 256d vectors I had and concatenated them to make bigger ones,
> > then shifted the dimensions to make more of them. Here are a few
> > single-threaded indexing test runs. I ran all tests with M=16.
> >
> >
> > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > buffer size=1994)
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > increasing the vector dimension makes things take longer (scaling
> > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > OOM while merging with a small heap and a large number of vectors, or
> > by increasing M, but none of this has anything to do with vector
> > dimensions. Also, if merge RAM usage is a problem I think we could
> > address it by adding accounting to the merge process and simply not
> > merging graphs when they exceed the buffer size (as we do with
> > flushing).
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
> > On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch  wrote:
> > >
> > > Hi,
> > > I have been testing Lucene with a custom vector similarity and loaded 
> > > 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java 
> > > memory..).
> > >
> > > As this was a performance test, the 192m vectors were derived by 
> > > dithering 47k original vectors in such a way to allow realistic ANN 
> > > evaluation of HNSW.  The original 47k vectors were generated by ada-002 
> > > on source newspaper article text.  After dithering, I used PQ to reduce 
> > > their dimensionality from 1536 floats to 512 bytes - 3 source dimensions 
> > > to a 1byte code, 512 code tables, each learnt to reduce total encoding 
> > > error using Lloyds algorithm (hence the need for the custom similarity). 
> > > BTW, HNSW retrieval was accurate and fast enough for the use case I was 
> > > investigating as long as a machine with 128gb memory was available as the 
> > > graph needs to be cached in memory for reasonable query rates.
> > >
> > > Anyway, if you want them, you are welcome to those 47k vectors of 1532 
> > > floats which can be readily dithered to generate very large and realistic 
> > > test vector sets.
> > >
> > > Best regards,
> > >
> > > Kent Fitch
> > >
> > >
> > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner,  
> > > wrote:
> > >>
> > >> you might want to use SentenceBERT to generate vectors
> > >>
> > >> https://sbert.net
> > >>
> > >> whereas for example the model "all-mpnet-base-v2" generates vectors with 
> > >> dimension 768
> > >>
> > >> We have SentenceBERT running as a web service, which we could open for 
> > >> these tests, but because of network latency it should be faster running 
> > >> locally.
> > >>
> > >> HTH
> > >>
> > >> Michael
> > >>
> > >>
> > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > >>
> > >> I've started to look on the internet, and surely someone will come, but 
> > >> the challenge I suspect is that these vectors are expensive to generate 
> > >> so people have not gone all in on generating such large vectors for 
> > >> large datasets. They certainly have not made them easy to find. Here is 
> > >> the most promising but it is too small, probably:  
> > >> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > >>
> > >>  I'm still in and out of the office at the moment, but when I return, I 
> > >> can ask my em

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Sokolov
I also want to add that we do impose some other limits on graph
construction to help ensure that HNSW-based vector fields remain
manageable; M is limited to <= 512, and maximum segment size also
helps limit merge costs

On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov  wrote:
>
> Thanks Kent - I tried something similar to what you did I think. Took
> a set of 256d vectors I had and concatenated them to make bigger ones,
> then shifted the dimensions to make more of them. Here are a few
> single-threaded indexing test runs. I ran all tests with M=16.
>
>
> 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> buffer size=1994)
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> increasing the vector dimension makes things take longer (scaling
> *linearly*) but doesn't lead to RAM issues. I think we could get to
> OOM while merging with a small heap and a large number of vectors, or
> by increasing M, but none of this has anything to do with vector
> dimensions. Also, if merge RAM usage is a problem I think we could
> address it by adding accounting to the merge process and simply not
> merging graphs when they exceed the buffer size (as we do with
> flushing).
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>
> On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch  wrote:
> >
> > Hi,
> > I have been testing Lucene with a custom vector similarity and loaded 192m 
> > vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
> >
> > As this was a performance test, the 192m vectors were derived by dithering 
> > 47k original vectors in such a way to allow realistic ANN evaluation of 
> > HNSW.  The original 47k vectors were generated by ada-002 on source 
> > newspaper article text.  After dithering, I used PQ to reduce their 
> > dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 
> > 1byte code, 512 code tables, each learnt to reduce total encoding error 
> > using Lloyds algorithm (hence the need for the custom similarity). BTW, 
> > HNSW retrieval was accurate and fast enough for the use case I was 
> > investigating as long as a machine with 128gb memory was available as the 
> > graph needs to be cached in memory for reasonable query rates.
> >
> > Anyway, if you want them, you are welcome to those 47k vectors of 1532 
> > floats which can be readily dithered to generate very large and realistic 
> > test vector sets.
> >
> > Best regards,
> >
> > Kent Fitch
> >
> >
> > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner,  
> > wrote:
> >>
> >> you might want to use SentenceBERT to generate vectors
> >>
> >> https://sbert.net
> >>
> >> whereas for example the model "all-mpnet-base-v2" generates vectors with 
> >> dimension 768
> >>
> >> We have SentenceBERT running as a web service, which we could open for 
> >> these tests, but because of network latency it should be faster running 
> >> locally.
> >>
> >> HTH
> >>
> >> Michael
> >>
> >>
> >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> >>
> >> I've started to look on the internet, and surely someone will come, but 
> >> the challenge I suspect is that these vectors are expensive to generate so 
> >> people have not gone all in on generating such large vectors for large 
> >> datasets. They certainly have not made them easy to find. Here is the most 
> >> promising but it is too small, probably:  
> >> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> >>
> >>  I'm still in and out of the office at the moment, but when I return, I 
> >> can ask my employer if they will sponsor a 10 million document collection 
> >> so that you can test with that. Or, maybe someone from work will see and 
> >> ask them on my behalf.
> >>
> >> Alternatively, next week, I may get some time to set up a server with an 
> >> open source LLM to generate the vectors. It still won't be free, but it 
> >> would be 99% cheaper than paying the LLM companies if we can be slow.
> >>
> >>
> >>
> >> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner  
> >> wrote:
> >>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Sokolov
Thanks Kent - I tried something similar to what you did I think. Took
a set of 256d vectors I had and concatenated them to make bigger ones,
then shifted the dimensions to make more of them. Here are a few
single-threaded indexing test runs. I ran all tests with M=16.


8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
buffer size=1994)
8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)

increasing the vector dimension makes things take longer (scaling
*linearly*) but doesn't lead to RAM issues. I think we could get to
OOM while merging with a small heap and a large number of vectors, or
by increasing M, but none of this has anything to do with vector
dimensions. Also, if merge RAM usage is a problem I think we could
address it by adding accounting to the merge process and simply not
merging graphs when they exceed the buffer size (as we do with
flushing).

Robert, since you're the only on-the-record veto here, does this
change your thinking at all, or if not could you share some test
results that didn't go the way you expected? Maybe we can find some
mitigation if we focus on a specific issue.

On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch  wrote:
>
> Hi,
> I have been testing Lucene with a custom vector similarity and loaded 192m 
> vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
>
> As this was a performance test, the 192m vectors were derived by dithering 
> 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. 
>  The original 47k vectors were generated by ada-002 on source newspaper 
> article text.  After dithering, I used PQ to reduce their dimensionality from 
> 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code 
> tables, each learnt to reduce total encoding error using Lloyds algorithm 
> (hence the need for the custom similarity). BTW, HNSW retrieval was accurate 
> and fast enough for the use case I was investigating as long as a machine 
> with 128gb memory was available as the graph needs to be cached in memory for 
> reasonable query rates.
>
> Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats 
> which can be readily dithered to generate very large and realistic test 
> vector sets.
>
> Best regards,
>
> Kent Fitch
>
>
> On Fri, 7 Apr 2023, 6:53 pm Michael Wechner,  
> wrote:
>>
>> you might want to use SentenceBERT to generate vectors
>>
>> https://sbert.net
>>
>> whereas for example the model "all-mpnet-base-v2" generates vectors with 
>> dimension 768
>>
>> We have SentenceBERT running as a web service, which we could open for these 
>> tests, but because of network latency it should be faster running locally.
>>
>> HTH
>>
>> Michael
>>
>>
>> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
>>
>> I've started to look on the internet, and surely someone will come, but the 
>> challenge I suspect is that these vectors are expensive to generate so 
>> people have not gone all in on generating such large vectors for large 
>> datasets. They certainly have not made them easy to find. Here is the most 
>> promising but it is too small, probably:  
>> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>>
>>  I'm still in and out of the office at the moment, but when I return, I can 
>> ask my employer if they will sponsor a 10 million document collection so 
>> that you can test with that. Or, maybe someone from work will see and ask 
>> them on my behalf.
>>
>> Alternatively, next week, I may get some time to set up a server with an 
>> open source LLM to generate the vectors. It still won't be free, but it 
>> would be 99% cheaper than paying the LLM companies if we can be slow.
>>
>>
>>
>> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner  
>> wrote:
>>>
>>> Great, thank you!
>>>
>>> How much RAM; etc. did you run this test on?
>>>
>>> Do the vectors really have to be based on real data for testing the
>>> indexing?
>>> I understand, if you want to test the quality of the search results it
>>> does matter, but for testing the scalability itself it should not matter
>>> actually, right?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
>>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
>>> > minutes with a single thread. I have some 256K vectors, but only about
>>> > 2M 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Sokolov
I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to lead to meaningless results

On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
 wrote:
>
>
>
> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > Well, I'm asking ppl actually try to test using such high dimensions.
> > Based on my own experience, I consider it unusable. It seems other
> > folks may have run into trouble too. If the project committers can't
> > even really use vectors with such high dimension counts, then its not
> > in an OK state for users, and we shouldn't bump the limit.
> >
> > I'm happy to discuss/compromise etc, but simply bumping the limit
> > without addressing the underlying usability/scalability is a real
> > no-go,
>
> I agree that this needs to be adressed
>
>
>
> >   it is not really solving anything, nor is it giving users any
> > freedom or allowing them to do something they couldnt do before.
> > Because if it still doesnt work it still doesnt work.
>
> I disagree, because it *does work* with "smaller" document sets.
>
> Currently we have to compile Lucene ourselves to not get the exception
> when using a model with vector dimension greater than 1024,
> which is of course possible, but not really convenient.
>
> As I wrote before, to resolve this discussion, I think we should test
> and address possible issues.
>
> I will try to stop discussing now :-) and instead try to understand
> better the actual issues. Would be great if others could join on this!
>
> Thanks
>
> Michael
>
>
>
> >
> > We all need to be on the same page, grounded in reality, not fantasy,
> > where if we set a limit of 1024 or 2048, that you can actually index
> > vectors with that many dimensions and it actually works and scales.
> >
> > On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
> >  wrote:
> >> As I said earlier, a max limit limits usability.
> >> It's not forcing users with small vectors to pay the performance penalty 
> >> of big vectors, it's literally preventing some users to use 
> >> Lucene/Solr/Elasticsearch at all.
> >> As far as I know, the max limit is used to raise an exception, it's not 
> >> used to initialise or optimise data structures (please correct me if I'm 
> >> wrong).
> >>
> >> Improving the algorithm performance is a separate discussion.
> >> I don't see a correlation with the fact that indexing billions of whatever 
> >> dimensioned vector is slow with a usability parameter.
> >>
> >> What about potential users that need few high dimensional vectors?
> >>
> >> As I said before, I am a big +1 for NOT just raise it blindly, but I 
> >> believe we need to remove the limit or size it in a way it's not a problem 
> >> for both users and internal data structure optimizations, if any.
> >>
> >>
> >> On Wed, 5 Apr 2023, 18:54 Robert Muir,  wrote:
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>> a few million vectors with 756 or 1024, which is allowed today.
> >>>
> >>> IMO based on how painful it is, it seems the limit is already too
> >>> high, I realize that will sound controversial but please at least try
> >>> it out!
> >>>
> >>> voting +1 without at least doing this is really the
> >>> "weak/unscientifically minded" approach.
> >>>
> >>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
> >>>  wrote:
>  Thanks for your feedback!
> 
>  I agree, that it should not crash.
> 
>  So far we did not experience crashes ourselves, but we did not index
>  millions of vectors.
> 
>  I will try to reproduce the crash, maybe this will help us to move 
>  forward.
> 
>  Thanks
> 
>  Michael
> 
>  Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >> Can you describe your crash in more detail?
> > I can't. That experiment was a while ago and a quick test to see if I
> > could index rather large-ish USPTO (patent office) data as vectors.
> > Couldn't do it then.
> >
> >> How much RAM?
> > My indexing jobs run with rather smallish heaps to give space for I/O
> > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > I recall segment merging grew slower and slower and then simply
> > crashed. Lucene should work with low heap requirements, even if it
> > slows down. Throwing ram at the indexing/ segment merging problem
> > is... I don't know - not elegant?
> >
> > Anyway. My main point was to remind folks about how Apache works -
> > code is merged in when there are no vetoes. If Rob (or anybody else)
> > remains unconvinced, he or she can block the change. (I didn't invent
> > those rules).
> >
> > D.
> >
> > -
> > To unsubscribe, e-mail: dev-unsub

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Sokolov
yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If you index a lot of little
segments and then force merge them it will take longer, because it had
to build the graphs for the little segments, and then for the big one
when merging, and it will eventually use the same amount of RAM to
build the big graph, although I don't believe it will have to load the
vectors en masse into RAM while merging.

On Thu, Apr 6, 2023 at 10:20 AM Michael Wechner
 wrote:
>
> thanks very much for these insights!
>
> Does it make a difference re RAM when I do a batch import, for example
> import 1000 documents and close the IndexWriter and do a forceMerge or
> import 1Mio documents at once?
>
> I would expect so, or do I misunderstand this?
>
> Thanks
>
> Michael
>
>
>
> Am 06.04.23 um 16:11 schrieb Michael Sokolov:
> > re: how does this HNSW stuff scale - I think people are calling out
> > indexing memory usage here, so let's discuss some facts. During
> > initial indexing we hold in RAM all the vector data and the graph
> > constructed from the new documents, but this is accounted for and
> > limited by the size of IndexWriter's buffer; the document vectors and
> > their graph will be flushed to disk when this fills up, and at search
> > time, they are not read in wholesale to RAM. There is potentially
> > unbounded RAM usage during merging though, because the entire merged
> > graph will be built in RAM. I lost track of how we handle the vector
> > data now, but at least in theory it should be fairly straightforward
> > to write the merged vector data in chunks using only limited RAM. So
> > how much RAM does the graph use? It uses numdocs*fanout VInts.
> > Actually it doesn't really scale with the vector dimension at all -
> > rather it scales with the graph fanout (M) parameter and with the
> > total number of documents. So I think this focus on limiting the
> > vector dimension is not helping to address the concern about RAM usage
> > while merging.
> >
> > The vector dimension does have a strong role in the search, and
> > indexing time, but the impact is linear in the dimension and won't
> > exhaust any limited resource.
> >
> > On Thu, Apr 6, 2023 at 5:48 AM Michael McCandless
> >  wrote:
> >>> We shouldn't accept weakly/not scientifically motivated vetos anyway 
> >>> right?
> >> In fact we must accept all vetos by any committer as a veto, for a change 
> >> to Lucene's source code, regardless of that committer's reasoning.  This 
> >> is the power of Apache's model.
> >>
> >> Of course we all can and will work together to convince one another (this 
> >> is where the scientifically motivated part comes in) to change our votes, 
> >> one way or another.
> >>
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index a 
> >>> few million vectors with 756 or 1024, which is allowed today.
> >> +1, if the current implementation really does not scale / needs more and 
> >> more RAM for merging, let's understand what's going on here, first, before 
> >> increasing limits.  I rescind my hasty +1 for now!
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti 
> >>  wrote:
> >>> Ok, so what should we do then?
> >>> This space is moving fast, and in my opinion we should act fast to 
> >>> release and guarantee we attract as many users as possible.
> >>>
> >>> At the same time I am not saying we should proceed blind, if there's 
> >>> concrete evidence for setting a limit rather than another, or that a 
> >>> certain limit is detrimental to the project, I think that veto should be 
> >>> valid.
> >>>
> >>> We shouldn't accept weakly/not scientifically motivated vetos anyway 
> >>> right?
> >>>
> >>> The problem I see is that more than voting we should first decide this 
> >>> limit and I don't know how we can operate.
> >>> I am imagining like a poll where each entry is a limit + motivation  and 
> >>> PMCs maybe vote/add entries?
> >>>
> >>> Did anything similar happen in the past? How was the current limit added?
> >>>
> >>>
> >>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss,  wrote:

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Sokolov
re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.

The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.

On Thu, Apr 6, 2023 at 5:48 AM Michael McCandless
 wrote:
>
> > We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> In fact we must accept all vetos by any committer as a veto, for a change to 
> Lucene's source code, regardless of that committer's reasoning.  This is the 
> power of Apache's model.
>
> Of course we all can and will work together to convince one another (this is 
> where the scientifically motivated part comes in) to change our votes, one 
> way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to index a few 
> > million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more and more 
> RAM for merging, let's understand what's going on here, first, before 
> increasing limits.  I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti  
> wrote:
>>
>> Ok, so what should we do then?
>> This space is moving fast, and in my opinion we should act fast to release 
>> and guarantee we attract as many users as possible.
>>
>> At the same time I am not saying we should proceed blind, if there's 
>> concrete evidence for setting a limit rather than another, or that a certain 
>> limit is detrimental to the project, I think that veto should be valid.
>>
>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>
>> The problem I see is that more than voting we should first decide this limit 
>> and I don't know how we can operate.
>> I am imagining like a poll where each entry is a limit + motivation  and 
>> PMCs maybe vote/add entries?
>>
>> Did anything similar happen in the past? How was the current limit added?
>>
>>
>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss,  wrote:
>>>
>>>

 Should create a VOTE thread, where we propose some values with a 
 justification and we vote?
>>>
>>>
>>> Technically, a vote thread won't help much if there's no full consensus - a 
>>> single veto will make the patch unacceptable for merging.
>>> https://www.apache.org/foundation/voting.html#Veto
>>>
>>> Dawid
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: question about impacts use case

2023-04-01 Thread Michael Sokolov
Well, digging a little deeper I can see that skipping behavior is
going to depend heavily on the distribution of documents in the index,
and how many skip levels there are and so on, and I may be getting
hung up on a particular test case that doesn't generalize.  In this
case all the high-scoring documents come early in the docid order (due
to our static index sort), so there are lots of possibilities for
skipping that may be unusual? One thing that occurred to me was that
when the Query writer knows that a child Query will always lead the
disjunction, they could possibly indicate that somehow - we could have
a UNION query or so that would process its child Queries in series and
then merge their results? Which would be a bad strategy in general,
but good when there is one high-scoring lead query that has few
results. But I am of course hoping this would just fall out of
WANDScorer as it is already dividing up head/tail queries ...

One thing that seems odd (I think, unless I'm confused! - very
possible) is that TermScorer reports its max score as the global max
once its iterator has been exhausted, when it seems it ought to report
0. I added a check for docID() == NO_MORE_DOCS in my wrapping Query to
assert this, and I can see it has some effect.

Anyway I am seeing *some* skipping, which is tantalizing.

On Sat, Apr 1, 2023 at 10:00 AM Michael Sokolov  wrote:
>
> Hi, I've been working on seeing whether we can make use of impacts in
> Amazon search and I have some questions. To date, we haven't used
> Lucene's scoring APIs at all; all of our queries are constant score,
> we early terminate based on a sorted index rank and then re-rank using
> custom non-Lucene ranking models. There is now an opportunity (some
> early ranking models have gotten simplified) for us to move some of
> the ranking workload into Lucene where we should be able to benefit
> from skipping hits via impacts.
>
> I'm struggling with a typical query (not our actual setup, but
> illustrates the functional gap) that is an OR-query something like:
>
> title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry
> +fulltext:potter +sorcerer + stone)
>
> Suppose there is only one document with that title, but a few dozen
> match all the individual terms. The one-word terms occur frequently in
> the fulltext field, but the title only once, yet it is a "high impact"
> term from the point of view of the query score. We don't index impacts
> for a term when docFreq < 128. This means we will never be able to
> skip low-scoring documents for this query, assuming that the score of
> the fulltext clause will always be much less than the score from the
> exact title match (which is by design - we always want exact title
> matches to rank highly). Even when min-competitive-score is for a
> document that has each word twice, we still can't skip documents where
> they only occur once, because the maximum score for the title scorer
> is the maximum *over the whole index* -- basically the scorer is
> thinking there might be another exact title match somewhere deeper in
> the index *even though its postings have already been exhausted*.
>
> I have only just started to look at the impacts code and don't have
> any clear idea whether this is difficult to fix, or whether I may have
> misconfigured something, but thought I would ask here to see if anyone
> has any idea. Things I did check:
>
> - the query is running in TOP_SCORES mode
> - the collector is calling Scorer.setMinimumScore with a low score,
> and subsequently collecting all matching hits even though their scores
> are all lower than the min
> - the title impacts is represented by SlowImpactsEnum
>
> One thing that may be relevant is that I am using a custom
> Query/Weight/Scorer wrapping the two clauses in order to modify their
> scores, because I am trying to mimic a pre-existing scoring function.
> These apply a linear function with an offset, scale and a maximum
> ceiling (so can't be done just with boosts as shown above). This
> Scorer implements score/getMaxScore by applying its modifications to
> the underlying scores, setMinCompetitiveScore basically inverts that,
> and advanceShallow delegates to the inner Scorer. I didn't implement
> anything around BulkScorer - maybe that's a gap?
>
> any pointers appreciated!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



question about impacts use case

2023-04-01 Thread Michael Sokolov
Hi, I've been working on seeing whether we can make use of impacts in
Amazon search and I have some questions. To date, we haven't used
Lucene's scoring APIs at all; all of our queries are constant score,
we early terminate based on a sorted index rank and then re-rank using
custom non-Lucene ranking models. There is now an opportunity (some
early ranking models have gotten simplified) for us to move some of
the ranking workload into Lucene where we should be able to benefit
from skipping hits via impacts.

I'm struggling with a typical query (not our actual setup, but
illustrates the functional gap) that is an OR-query something like:

title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry
+fulltext:potter +sorcerer + stone)

Suppose there is only one document with that title, but a few dozen
match all the individual terms. The one-word terms occur frequently in
the fulltext field, but the title only once, yet it is a "high impact"
term from the point of view of the query score. We don't index impacts
for a term when docFreq < 128. This means we will never be able to
skip low-scoring documents for this query, assuming that the score of
the fulltext clause will always be much less than the score from the
exact title match (which is by design - we always want exact title
matches to rank highly). Even when min-competitive-score is for a
document that has each word twice, we still can't skip documents where
they only occur once, because the maximum score for the title scorer
is the maximum *over the whole index* -- basically the scorer is
thinking there might be another exact title match somewhere deeper in
the index *even though its postings have already been exhausted*.

I have only just started to look at the impacts code and don't have
any clear idea whether this is difficult to fix, or whether I may have
misconfigured something, but thought I would ask here to see if anyone
has any idea. Things I did check:

- the query is running in TOP_SCORES mode
- the collector is calling Scorer.setMinimumScore with a low score,
and subsequently collecting all matching hits even though their scores
are all lower than the min
- the title impacts is represented by SlowImpactsEnum

One thing that may be relevant is that I am using a custom
Query/Weight/Scorer wrapping the two clauses in order to modify their
scores, because I am trying to mimic a pre-existing scoring function.
These apply a linear function with an offset, scale and a maximum
ceiling (so can't be done just with boosts as shown above). This
Scorer implements score/getMaxScore by applying its modifications to
the underlying scores, setMinCompetitiveScore basically inverts that,
and advanceShallow delegates to the inner Scorer. I didn't implement
anything around BulkScorer - maybe that's a gap?

any pointers appreciated!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-01 Thread Michael Sokolov
I'm also in favor of raising this limit. We do see some datasets with
higher than 1024 dims. I also think we need to keep a limit. For example we
currently need to keep all the vectors in RAM while indexing and we want to
be able to support reasonable numbers of vectors in an index segment. Also
we don't know what innovations might come down the road. Maybe someday we
want to do product quantization and enforce that (k, m) both fit in a byte
-- we wouldn't be able to do that if a vector's dimension were to exceed
32K.

On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti 
wrote:

> I am also curious what would be the worst-case scenario if we remove the
> constant at all (so automatically the limit becomes the Java
> Integer.MAX_VALUE).
> i.e.
> right now if you exceed the limit you get:
>
>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>> throw new IllegalArgumentException(
>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>> MAX_DIMENSIONS);
>> }
>
>
> in relation to:
>
>> These limits allow us to
>> better tune our data structures, prevent overflows, help ensure we
>> have good test coverage, etc.
>
>
> I agree 100% especially for typing stuff properly and avoiding resource
> waste here and there, but I am not entirely sure this is the case for the
> current implementation i.e. do we have optimizations in place that assume
> the max dimension to be 1024?
> If I missed that (and I likely have), I of course suggest the contribution
> should not just blindly remove the limit, but do it appropriately.
> I am not in favor of just doubling it as suggested by some people, I would
> ideally prefer a solution that remains there to a decent extent, rather
> than having to modifying it anytime someone requires a higher limit.
>
> Cheers
>
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io 
> LinkedIn  | Twitter
>  | Youtube
>  | Github
> 
>
>
> On Fri, 31 Mar 2023 at 16:12, Michael Wechner 
> wrote:
>
>> OpenAI reduced their size to 1536 dimensions
>>
>> https://openai.com/blog/new-and-improved-embedding-model
>>
>> so 2048 would work :-)
>>
>> but other services do provide also higher dimensions with sometimes
>> slightly better accuracy
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>> > I'm supportive of bumping the limit on the maximum dimension for
>> > vectors to something that is above what the majority of users need,
>> > but I'd like to keep a limit. We have limits for other things like the
>> > max number of docs per index, the max term length, the max number of
>> > dimensions of points, etc. and there are a few things that we don't
>> > have limits on that I wish we had limits on. These limits allow us to
>> > better tune our data structures, prevent overflows, help ensure we
>> > have good test coverage, etc.
>> >
>> > That said, these other limits we have in place are quite high. E.g.
>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>> > Likewise for the max of 8 dimensions for points: a segment cannot
>> > possibly have 2 splits per dimension on average if it doesn't have
>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>> > than 8 would likely defeat the point of indexing. In contrast, our
>> > limit on the number of dimensions of vectors seems to be under what
>> > some users would like, and while I understand the performance argument
>> > against bumping the limit, it doesn't feel to me like something that
>> > would be so bad that we need to prevent users from using numbers of
>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>> > look at a very small subset of the full dataset.
>> >
>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>> > Mayya on the issue that you linked.
>> >
>> > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
>> >  wrote:
>> >> Thanks Alessandro for summarizing the discussion below!
>> >>
>> >> I understand that there is no clear reasoning re what is the best
>> embedding size, whereas I think heuristic approaches like described by the
>> following link can be helpful
>> >>
>> >>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>> >>
>> >> Having said this, we see various embedding services providing higher
>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>> >>
>> >> And it would be great if we could run benchmarks without having to
>> recompile Lucene ourselves.
>> >>
>> >> Therefore I would to suggest to either increase the limit or e

Re: [GitHub] [lucene] david-sitsky commented on issue #12185: Using DirectIODirectory results in BufferOverflowException

2023-03-22 Thread Michael Sokolov
Using directio with nfs makes no sense at all to me, I think that is the
problem in a nutshell. Directio tries to bypass the operating systems
buffers, but that's not going to play nicely with nfs.

On Wed, Mar 22, 2023, 4:38 PM david-sitsky (via GitHub) 
wrote:

>
> david-sitsky commented on issue #12185:
> URL: https://github.com/apache/lucene/issues/12185#issuecomment-1480390389
>
>As an aside, in some standard benchmark tests I run with our product, I
> have found the final optimisation of Lucene indexes after all the data has
> been indexed took 36 seconds with NIO, but 148 seconds with NIO+DirectIO
> enabled.  For mmap, optimisation took 30 seconds but 100 seconds with
> DirectIO was enabled.  So it is odd the use-case DirectIO was meant to
> speed up actually seemed to be slower..
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
>
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
>
>
> -
> To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
> For additional commands, e-mail: issues-h...@lucene.apache.org
>
>


Re: Welcome Ben Trent as Lucene committer

2023-01-27 Thread Michael Sokolov
Welcome, Ben! Congratulations

On Fri, Jan 27, 2023 at 4:52 PM Anshum Gupta  wrote:
>
> Congratulations and welcome, Ben!
>
> On Fri, Jan 27, 2023 at 7:18 AM Adrien Grand  wrote:
>>
>> I'm pleased to announce that Ben Trent has accepted the PMC's
>> invitation to become a committer.
>>
>> Ben, the tradition is that new committers introduce themselves with a
>> brief bio.
>>
>> Congratulations and welcome!
>>
>> --
>> Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Anshum Gupta

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Is there a way to customize segment names?

2022-12-16 Thread Michael Sokolov
+1 trying to coordinate multiple writers running independently will
not work. My 2c for availability: you can have a single primary active
writer with a backup one waiting, receiving all the segments from the
primary. Then if the primary goes down, the secondary one has the most
recent commit replicated from the primary (identical commit, same
segments etc) and can pick up from there. You would need a mechanism
to replay the writes the primary never had a chance to commit.

On Fri, Dec 16, 2022 at 5:41 AM Robert Muir  wrote:
>
> You are still talking "Multiple writers". Like i said, going down this
> path (playing tricks with filenames) isn't going to work out well.
>
> On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai  wrote:
> >
> > Hi Robert,
> >
> > Maybe I didn't explain it clearly but we're not going to constantly switch
> > between writers or share effort between writers, it's purely for
> > availability: the second writer only kicks in when the first writer is not
> > available for some reason.
> > And as far as I know the replicator/nrt module has not provided a solution
> > on when the primary node (main indexer) is down, how would we recover with
> > a back up indexer?
> >
> > Thanks
> > Patrick
> >
> >
> > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir  wrote:
> >
> > > This multiple-writer isn't going to work and customizing names won't
> > > allow it anyway. Each file also contains a unique identifier tied to
> > > its commit so that we know everything is intact.
> > >
> > > I would look at the segment replication in lucene/replicator and not
> > > try to play games with files and mixing multiple writers.
> > >
> > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  wrote:
> > > >
> > > > Hi Folks,
> > > >
> > > > We're trying to build a search architecture using segment replication
> > > (indexer and searcher are separated and indexer shipping new segments to
> > > searchers) right now and one of the problems we're facing is: for
> > > availability reason we need to have multiple indexers running, and when 
> > > the
> > > searcher is switching from consuming one indexer to another, there are
> > > chances where the segment names collide with each other (because segment
> > > names are count based) and the searcher have to reload the whole index.
> > > > To avoid that we're looking for a way to name the segments so that
> > > Lucene is able to tell the difference and load only the difference (by
> > > calling `openIfChanged`). I've checked the IndexWriter and the
> > > DocumentsWriter and it seems it is controlled by a private final method
> > > `newSegmentName()` so likely not possible there. So I wonder whether
> > > there's any other ways people are aware of that can help control the
> > > segment names?
> > > >
> > > > A example of the situation described above:
> > > > Searcher previously consuming from indexer 1, and have following
> > > segments: _1, _2, _3, _4
> > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > > segments, and produced its own 4th segments (notioned as _4', but it 
> > > shares
> > > the same "_4" name): _1, _2, _3, _4'
> > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> > > 2, then when it finished downloading the segments and trying to refresh 
> > > the
> > > reader, it will likely hit the exception here, and seems all we can do
> > > right now is to reload the whole index and that could be potentially a 
> > > high
> > > cost.
> > > >
> > > > Sorry for the long email and thank you in advance for any replies!
> > > >
> > > > Best
> > > > Patrick
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.4.2 RC1

2022-11-18 Thread Michael Sokolov
(I don't really believe the .asc files are broken; probably a local
gpg problem I don't understand)

SUCCESS! [0:44:08.338731]
+1 from me

On Fri, Nov 18, 2022 at 10:18 AM Uwe Schindler  wrote:
>
> Hi,
>
> the second build succeeded. I really think it was another job running at same 
> time that also tried to communicate with GPG and used another home dir.
>
> Log: https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console
>
> SUCCESS! [1:43:46.817984]
> Finished: SUCCESS
>
> After jenkins finished the job it killed all child processes and all agents 
> are gone.
>
> In the meantime I also did some manual checks: Running Luke from windows with 
> whitespace in dir worked and I was able to open my test index. I also started 
> with Java 19 and --enable-preview and the Luke log showed that it uses the 
> new MMapDire impl.
>
> I correct my previous vote: ++1 to release. 😁
>
> Uwe
>
> Am 18.11.2022 um 16:06 schrieb Uwe Schindler:
>
> I had also seen this message. My guess: Another build was running in Jenkins 
> that also spawned an agent with different home dir! I think Robert already 
> talked about this. We should kill the agents before/after we have used them.
>
> Uwe
>
> Am 18.11.2022 um 15:47 schrieb Adrien Grand:
>
> Reading Uwe's error message more carefully, I had first assumed that the GPG 
> failure was due to the lack of an ultimately trusted signature, but it seems 
> like it's due to "can't connect to the agent: IPC connect call failed" 
> actually, which suggests an issue with the GPG agent?
>
> On Fri, Nov 18, 2022 at 3:00 PM Michael Sokolov  wrote:
>>
>> I got this message when initially downloading the artifacts:
>>
>> Downloading 
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db/lucene/lucene-9.4.2-src.tgz.asc
>> File: 
>> /tmp/smoke_lucene_9.4.2_858d9b437047a577fa9457089afff43eefa461db/lucene.lucene-9.4.2-src.tgz.gpg.verify.log
>> verify trust
>>   GPG: gpg: WARNING: This key is not certified with a trusted signature!
>>
>> is it related?
>>
>> On Fri, Nov 18, 2022 at 8:43 AM Uwe Schindler  wrote:
>> >
>> > The problem is: it is working like this since years - the 9.4.1 release 
>> > worked fine. No change!
>> >
>> > And I can't configure this because GPG uses its own home directory setup 
>> > by smoke tester (see paths below). So it should not look anywhere else? In 
>> > addition "gpg: no ultimately trusted keys found" is just a warning, it 
>> > should not cause gpg to exit.
>> >
>> > Also why does it only happens at the time of Maven? It checks signatures 
>> > before, too. This is why I restarted the build: 
>> > https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console (still 
>> > running)
>> >
>> > Uwe
>> >
>> > Am 18.11.2022 um 14:21 schrieb Adrien Grand:
>> >
>> > Uwe, the error message suggests that Policeman Jenkins is not ultimately 
>> > trusting any of the keys. Does it work if you configure it to ultimately 
>> > trust your "Uwe Schindler (CODE SIGNING KEY) " key 
>> > (which I assume you would be ok with)?
>> >
>> > On Fri, Nov 18, 2022 at 2:18 PM Uwe Schindler  wrote:
>> >>
>> >> I am restarting the build, maybe it was some hickup. Interestingly it 
>> >> only failed for the Maven dependencies. P.S.: Why does it import the key 
>> >> file over and over? It would be enough to do this once at beginning of 
>> >> smoker.
>> >>
>> >> Uwe
>> >>
>> >> Am 18.11.2022 um 14:12 schrieb Uwe Schindler:
>> >>
>> >> Hi,
>> >>
>> >> I get a failure because your key is somehow rejected by GPG (Ubuntu 
>> >> 22.04):
>> >>
>> >> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/24/console
>> >>
>> >> verify maven artifact sigs command "gpg --homedir 
>> >> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg 
>> >> --import /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/KEYS" 
>> >> failed: gpg: keybox 
>> >> '/home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/pubring.kbx'
>> >>  created gpg: 
>> >> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/trustdb.gpg:
>> >>  trustdb created gpg: key B83EA82A0AFCEE7C: public key "Yonik Seeley 
>> >> " impo

Re: [GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

2022-11-18 Thread Michael Sokolov
What I have in mind would be to implement entirely in the
KnnVectorQuery. Since results are sorted by score, they can easily be
post-filtered there: no need to implement anything at the codec layer
I think.

On Thu, Nov 17, 2022 at 10:10 AM GitBox  wrote:
>
>
> rmuir commented on PR #11946:
> URL: https://github.com/apache/lucene/pull/11946#issuecomment-1318777402
>
>i'm also concerned about committing to providing this API for the future. 
> eventually, we'll move away from HNSW to something that actually scales, and 
> it may not support this thresholding?
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
>
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
>
>
> -
> To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
> For additional commands, e-mail: issues-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.4.2 RC1

2022-11-18 Thread Michael Sokolov
I got this message when initially downloading the artifacts:

Downloading 
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db/lucene/lucene-9.4.2-src.tgz.asc
File: 
/tmp/smoke_lucene_9.4.2_858d9b437047a577fa9457089afff43eefa461db/lucene.lucene-9.4.2-src.tgz.gpg.verify.log
verify trust
  GPG: gpg: WARNING: This key is not certified with a trusted signature!

is it related?

On Fri, Nov 18, 2022 at 8:43 AM Uwe Schindler  wrote:
>
> The problem is: it is working like this since years - the 9.4.1 release 
> worked fine. No change!
>
> And I can't configure this because GPG uses its own home directory setup by 
> smoke tester (see paths below). So it should not look anywhere else? In 
> addition "gpg: no ultimately trusted keys found" is just a warning, it should 
> not cause gpg to exit.
>
> Also why does it only happens at the time of Maven? It checks signatures 
> before, too. This is why I restarted the build: 
> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console (still 
> running)
>
> Uwe
>
> Am 18.11.2022 um 14:21 schrieb Adrien Grand:
>
> Uwe, the error message suggests that Policeman Jenkins is not ultimately 
> trusting any of the keys. Does it work if you configure it to ultimately 
> trust your "Uwe Schindler (CODE SIGNING KEY) " key 
> (which I assume you would be ok with)?
>
> On Fri, Nov 18, 2022 at 2:18 PM Uwe Schindler  wrote:
>>
>> I am restarting the build, maybe it was some hickup. Interestingly it only 
>> failed for the Maven dependencies. P.S.: Why does it import the key file 
>> over and over? It would be enough to do this once at beginning of smoker.
>>
>> Uwe
>>
>> Am 18.11.2022 um 14:12 schrieb Uwe Schindler:
>>
>> Hi,
>>
>> I get a failure because your key is somehow rejected by GPG (Ubuntu 22.04):
>>
>> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/24/console
>>
>> verify maven artifact sigs command "gpg --homedir 
>> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg --import 
>> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/KEYS" failed: gpg: 
>> keybox 
>> '/home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/pubring.kbx'
>>  created gpg: 
>> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/trustdb.gpg:
>>  trustdb created gpg: key B83EA82A0AFCEE7C: public key "Yonik Seeley 
>> " imported gpg: can't connect to the agent: IPC connect 
>> call failed gpg: key E48025ED13E57FFC: public key "Upayavira 
>> " imported [...] gpg: key 051A0FAF76BC6507: public key 
>> "Adrien Grand (CODE SIGNING KEY) " imported [...] gpg: 
>> key 32423B0E264B5CBA: public key "Julie Tibshirani (New code signing key) 
>> " imported gpg: Total number processed: 62 gpg: 
>> imported: 62 gpg: no ultimately trusted keys found
>> It looks like for others it succeeds? No idea why. Maybe Ubuntu 22.04 has a 
>> too-new GPG or it needs to use gpg2?
>>
>> -1 to release until this is sorted out.
>>
>> Uwe
>>
>> Am 17.11.2022 um 15:18 schrieb Adrien Grand:
>>
>> Please vote for release candidate 1 for Lucene 9.4.2
>>
>> The artifacts can be downloaded from:
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>>
>> The vote will be open for at least 72 hours i.e. until 2022-11-20 15:00 UTC.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1.
>>
>> --
>> Adrien
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: u...@thetaphi.de
>
>
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: HNSW search with threshold

2022-11-11 Thread Michael Sokolov
I think it's fine to warn about this, but in general large values of K
will increase cost, with or without thresholding, so this is not a new
thing to warn about

On Thu, Nov 10, 2022 at 5:50 AM Adrien Grand  wrote:
>
> That would work for me, though this is something that I would like to be 
> documented as not recommended.
>
> On Thu, Nov 10, 2022 at 2:33 PM Alexey Gorlenko  wrote:
>>
>> I think we can support both parameters: k and threshold. And if we need to 
>> get all docs by the threshold, we just will set k == Integer.MAX_VALUE.
>>
>> чт, 10 нояб. 2022 г. в 12:43, Adrien Grand :
>>>
>>> I wonder if it would actually be a good idea to support filtering _only_ 
>>> based on distance. In the worst case scenario, this may require traversing 
>>> the whole HNSW graph and would run in linear time with the number of 
>>> vectors, with a high constant factor since we'd need to compute a distance 
>>> for every vector? I imagine that this would only make sense for low values 
>>> of the radius, so that few vectors would match, but this looks to me like 
>>> it would be hard to predict whether a given radius would actually match a 
>>> small set of vectors. Should the query still require a `k` value in 
>>> addition to the radius to make sure it doesn't go wild?
>>>
>>> On Tue, Nov 8, 2022 at 7:26 AM Alexey Gorlenko  wrote:
>>>>
>>>> Thanks, Michael!
>>>> Yes, I will try.
>>>>
>>>> вт, 8 нояб. 2022 г. в 03:31, Michael Sokolov :
>>>>>
>>>>> +1 to adding a scoring threshold. I think it could be another
>>>>> parameter to KnnVectorQuery. Do you want to have a try at adding this?
>>>>> If so, please feel free to open a PR and I will be happy to guide you.
>>>>>
>>>>> On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko  
>>>>> wrote:
>>>>> >
>>>>> > Hi!
>>>>> >
>>>>> > There are some use cases where we need to find vectors with the 
>>>>> > distance (by some metric) to the given vector V less than the given 
>>>>> > threshold T. That task is very similar to the knn problem, but in this 
>>>>> > case we don't have a quantity of the nearest neighbours k.
>>>>> >
>>>>> > As I see, the current implementation of knn doesn't provide such 
>>>>> > functionality. But at the first glance it is not very difficult to 
>>>>> > modify the method search of HnswGraph to implement that feature (do not 
>>>>> > limit result size and get rid of candidates which exceed threshold).
>>>>> >
>>>>> > But maybe that idea has some not obvious problems which I haven't 
>>>>> > noticed, and in reality an implementation of that idea would have 
>>>>> > fundamental difficulties?
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>
>>>
>>> --
>>> Adrien
>
>
>
> --
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Release Lucene 9.4.2

2022-11-11 Thread Michael Sokolov
+1 makes sense. I do think given this is the second similar-flavored
bug we've found that we should be thorough and try to get them all
rather than having a 9.4.3 ...

On Wed, Nov 9, 2022 at 10:25 AM Julie Tibshirani  wrote:
>
> +1 from me for a bugfix release once we've solidified testing. Thanks to 
> everyone working on improving tests and static analysis -- this now is our 
> second time encountering a bad arithmetic bug and it's important to get ahead 
> of these issues!
>
> Julie
>
> On Wed, Nov 9, 2022 at 8:26 AM Robert Muir  wrote:
>>
>> Thank you Adrien!
>>
>> I created an issue for the static analysis piece, but I'm not
>> currently working on it yet. This could be a fun one, if anyone is
>> interested, to flush a bunch of these bugs out at once:
>> https://github.com/apache/lucene/issues/11910
>>
>> On Wed, Nov 9, 2022 at 10:48 AM Adrien Grand  wrote:
>> >
>> > Totally Robert, I was not trying to add any time pressure, next week is 
>> > totally fine. I mostly wanted to get the discussion started because folks 
>> > sometimes have one or two bug fixes they'd like to fold into a bugfix 
>> > release so I wanted to give them time to plan. Friday is also a public 
>> > holiday here, celebrating the end of World War 1. :)
>> >
>> > On Wed, Nov 9, 2022 at 4:41 PM Robert Muir  wrote:
>> >>
>> >> Can we please have a few days to improve the test situation? I think
>> >> we need to beef up checkindex to exercise seek() on the vectors, also
>> >> we need to look at static analysis to try to find other similar bugs.
>> >> This would help prevent "whack-a-mole" and improve correctness going 
>> >> forwards.
>> >>
>> >> I want to help more but it's difficult timing-wise, lots of stuff
>> >> going on this week, and in my country friday is Veteran's Day holiday.
>> >>
>> >> On Wed, Nov 9, 2022 at 10:39 AM Adrien Grand  wrote:
>> >> >
>> >> > Hello all,
>> >> >
>> >> > A bad integer overflow has been discovered in the KNN vectors format, 
>> >> > which affects segments that have more than ~16M vectors. I'd like to do 
>> >> > a bugfix release when the bug is fixed and we have a test for such 
>> >> > large datasets of KNN vectors. I volunteer to be the RM for this 
>> >> > release.
>> >> >
>> >> > --
>> >> > Adrien
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >
>> >
>> > --
>> > Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   >