from:"Robert Muir"

Re: Query about the GitHub statistics for Lucene

2024-03-05 Thread Robert Muir

On Tue, Mar 5, 2024 at 4:50 AM Chris Hegarty
 wrote:
> It appears that there is no GH activity for 2024! Clearly this is incorrect. 
> I’ve yet to track down what’s going on with this. Familiar to anyone here?
>

Last time I looked at this, it appeared it is looking at the incorrect
github repositories, for example https://github.com/apache/lucene-solr
and not https://github.com/apache/lucene

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene 9.10.0 RC1

2024-02-15 Thread Robert Muir

On Thu, Feb 15, 2024 at 9:54 AM Uwe Schindler  wrote:
>
> Hi,
>
> My Python knowledge is too limited to fix the build script to allow to test 
> the smoker with arbitrary JAVA_HOME dircetories next to the baseline (Java 
> 11). With lots of copypaste I can make it run on Java 21 in addition to 17, 
> but that looks like too unflexible.
>
> Mike McCandless: If you could help me to make it more flexible, it would be 
> good. I can open an issue, but if you have an easy solution. I think of the 
> following:
>
> JAVA_HOME must run be Java 11 (in 9.x)
> At moment you can pass "--test-java17 ", but this one is also checked to 
> be really java 17 (by parsing strings from its version output), but I'd like 
> to pass "--test-alternative-java " multiple times and it would just run 
> all those as part of smoking, maxbe the version number can be extracted to be 
> printed out.
>
> To me this is a hopeless task with Python.
>
> Uwe
>
> Am 15.02.2024 um 12:50 schrieb Uwe Schindler:
>

I opened https://github.com/apache/lucene/issues/13107 as I have
struggles with the smoketester java 21 support too. Java is moving
faster these days, we should make it easier to maintain the script.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: The need for a Lucene 9.9.1 release

2023-12-09 Thread Robert Muir

I don't understand use of the word corruption, isn't it just a bug in
intersect() that only affects wildcards etc? e.g. its not gonna merge
into new segments or impact written data in any way.

And i don't think we should rushout some bugfix release without any
test for this?

On Sat, Dec 9, 2023 at 5:30 AM Luca Cavanna  wrote:
>
> Based on the discussions in https://github.com/apache/lucene/issues/12895 , 
> it seems like reverting the change that caused the corruption on read is the 
> quickest fix, so that we can speed up releasing 9.9.1. I opened a PR for 
> that: https://github.com/apache/lucene/pull/12899. Is there additional 
> testing that needs to be done to ensure that this is enough to address the 
> corruption?
>
> Regarding a fix for the JVM SIGSEGV crash, how far are we from a fix that 
> protects Lucene from it? Should we wait for that to be included in 9.9.1? 
> Asking because the corruption above looks like it needs to be addressed 
> rather quickly. It would be great to include both, but I don't know how long 
> that delays 9.9.1.
>
> Cheers
> Luca
>
>
>
> On Sat, Dec 9, 2023 at 11:13 AM Chris Hegarty 
>  wrote:
>>
>> Oh, and I’m happy to be Release Manager for 9.9.1 (given my recent 
>> experience on 9.9.0)
>>
>> -Chris.
>>
>> > On 9 Dec 2023, at 09:09, Chris Hegarty  
>> > wrote:
>> >
>> > Hi,
>> >
>> > We’ve encounter two very serious issues with the recent Lucene 9.9.0 
>> > release, both of which (even if taken by themselves) would warrant a 
>> > 9.9.1. The issues are:
>> >
>> > 1. https://github.com/apache/lucene/issues/12895 - Corruption read on term 
>> > dictionaries in Lucene 9.9
>> >
>> > 2. https://github.com/apache/lucene/issues/12898 - JVM SIGSEGV crash when 
>> > compiling computeCommonPrefixLengthAndBuildHistogram Lucene 9.9.0
>> >
>> > There is still a little investigation and work left to bring these issues 
>> > to a point where we’re comfortable with proposing a solution. I would be 
>> > hopeful that we’ll get there by early next week. If so, then a Lucene 
>> > 9.9.1 release can be proposed.
>> >
>> > Thanks,
>> > -Chris.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: GDPR compliance

2023-11-28 Thread Robert Muir

and if you delete those segments, will that data ever be actually
removed from the underlying physical storage? equally uncertain.

deleting a file from the filesystem is similar to what lucene is
doing, it doesn't really delete anything from the disk, just allows it
to be overwritten by future writes.

so I don't think we should provide any "GDPRMergePolicy" to satisfy an
extreme (and short-sighted) legal interpretation. it wouldn't solve
the problem anyway.

On Tue, Nov 28, 2023 at 3:27 PM Ilan Ginzburg  wrote:
>
> Are larger and older segments even certain to ever be merged in practice? I 
> was assuming that if there is not a lot of new indexed content and not a lot 
> of older documents being deleted, large older segment might never have to be 
> merged.
>
>
> On Tue 28 Nov 2023 at 20:53, Robert Muir  wrote:
>>
>> I don't think there's any problem with GDPR, and I don't think users
>> should be running unnecessary "optimize". GDRP just says data should
>> be erased without "undue" delay. waiting for a merge to nuke the
>> deleted docs isn't "undue", there is a good reason for it.
>>
>> On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
>> >
>> > Hi Folks,
>> > In LinkedIn we need to comply with GDPR for a large part of our data, and 
>> > an important part of it is that we need to be sure we have completely 
>> > deleted the data the user requested to delete within a certain period of 
>> > time.
>> > The way we have come up with so far is to:
>> > 1. Record the segment creation time somewhere (not decided yet, maybe 
>> > index commit userinfo, maybe some other place outside of lucene)
>> > 2. Create a new merge policy which delegate most operations to a normal 
>> > MP, like TieredMergePolicy, and then add extra single-segment (merge from 
>> > 1 segment to 1 segment, basically only do deletion) merges if it finds any 
>> > segment is about to violate the GDPR time frame.
>> >
>> > So here's my question:
>> > 1. Is there a better/existing way to do this?
>> > 2. I would like to directly contribute to Lucene about such a merge policy 
>> > since I think GDPR is more or less a common thing. Would like to know 
>> > whether people feel like it's necessary or not?
>> > 3. It's also nice if we can store the segment creation time to the index 
>> > directly by IndexWriter (maybe write to SegmentInfo?), I can try to do 
>> > that but would like to ask whether there's any objections?
>> >
>> > Best
>> > Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: GDPR compliance

2023-11-28 Thread Robert Muir

I don't think there's any problem with GDPR, and I don't think users
should be running unnecessary "optimize". GDRP just says data should
be erased without "undue" delay. waiting for a merge to nuke the
deleted docs isn't "undue", there is a good reason for it.

On Tue, Nov 28, 2023 at 2:40 PM Patrick Zhai  wrote:
>
> Hi Folks,
> In LinkedIn we need to comply with GDPR for a large part of our data, and an 
> important part of it is that we need to be sure we have completely deleted 
> the data the user requested to delete within a certain period of time.
> The way we have come up with so far is to:
> 1. Record the segment creation time somewhere (not decided yet, maybe index 
> commit userinfo, maybe some other place outside of lucene)
> 2. Create a new merge policy which delegate most operations to a normal MP, 
> like TieredMergePolicy, and then add extra single-segment (merge from 1 
> segment to 1 segment, basically only do deletion) merges if it finds any 
> segment is about to violate the GDPR time frame.
>
> So here's my question:
> 1. Is there a better/existing way to do this?
> 2. I would like to directly contribute to Lucene about such a merge policy 
> since I think GDPR is more or less a common thing. Would like to know whether 
> people feel like it's necessary or not?
> 3. It's also nice if we can store the segment creation time to the index 
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that 
> but would like to ask whether there's any objections?
>
> Best
> Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Ascii folding

2023-11-10 Thread Robert Muir

Sorry, I meant to provide the demo link too, in case you want to play:
https://util.unicode.org/UnicodeJsps/confusables.jsp?a=paypal=None

It illustrates how the problem of "visually confusing" is really its
own beast, e.g. confusion of 'L' vs '1' with some fonts.

On Fri, Nov 10, 2023 at 1:13 PM Robert Muir  wrote:
>
> For visual confusing characters we have the option to expose specific
> processing for that, e.g.
> https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/SpoofChecker.html#getSkeleton-java.lang.CharSequence-
>
> Maybe there are use-cases for a search engine, e.g. find me documents
> with words that "could be confused visually" with 'beer' (or whatever
> the query is). Usually this processing is geared around security
> use-cases.
>
> On Fri, Nov 10, 2023 at 1:03 PM Dawid Weiss  wrote:
> >
> >
> > Hi Steve, Chris,
> >
> > Ok, makes sense. Thanks for the pointers. I agree the justification for the 
> > use of character-level normalization filters is highly context-dependent 
> > (for example, unsuitable when mixed languages are present on input).
> >
> > Dawid
> >
> > On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter  
> > wrote:
> >>
> >>
> >> : Here's the unicode letter after "th":
> >> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> >> :
> >> : To my surprise, I couldn't find it in the ascii folding filter:
> >> :
> >> : 
> >> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> >> :
> >> : Anybody remembers whether the omission of Cyrillic characters was
> >> : intentional (there is quite a few of them that are nearly identical in
> >> : appearance to Latin letters).
> >>
> >> From the javadocs, i'm going to guess it's because the the filter focuses
> >> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
> >> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
> >> of the other characters that are considered to have a direct mapping to
> >> the "ASCII" / latin characters.
> >>
> >> If you look back at when it was added...
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1390
> >>
> >> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> >> replacing it with "a more comprehensive version of this code that included
> >> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> >> Extended A unicode blocks."  (The originally proposed name was
> >> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
> >> Latin blocks.
> >>
> >> There was a related issue at the time which initially aimed to add a
> >> more general "UnicodeNormalizationFilter" that ultimated resulted in
> >> adding the "ICU" analysis classes...
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1343
> >>
> >> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
> >> tested that)
> >>
> >>
> >>
> >> -Hoss
> >> http://www.lucidworks.com/
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Ascii folding

2023-11-10 Thread Robert Muir

For visual confusing characters we have the option to expose specific
processing for that, e.g.
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/SpoofChecker.html#getSkeleton-java.lang.CharSequence-

Maybe there are use-cases for a search engine, e.g. find me documents
with words that "could be confused visually" with 'beer' (or whatever
the query is). Usually this processing is geared around security
use-cases.

On Fri, Nov 10, 2023 at 1:03 PM Dawid Weiss  wrote:
>
>
> Hi Steve, Chris,
>
> Ok, makes sense. Thanks for the pointers. I agree the justification for the 
> use of character-level normalization filters is highly context-dependent (for 
> example, unsuitable when mixed languages are present on input).
>
> Dawid
>
> On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter  
> wrote:
>>
>>
>> : Here's the unicode letter after "th":
>> : https://www.fileformat.info/info/unicode/char/0435/index.htm
>> :
>> : To my surprise, I couldn't find it in the ascii folding filter:
>> :
>> : 
>> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
>> :
>> : Anybody remembers whether the omission of Cyrillic characters was
>> : intentional (there is quite a few of them that are nearly identical in
>> : appearance to Latin letters).
>>
>> From the javadocs, i'm going to guess it's because the the filter focuses
>> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
>> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
>> of the other characters that are considered to have a direct mapping to
>> the "ASCII" / latin characters.
>>
>> If you look back at when it was added...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1390
>>
>> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
>> replacing it with "a more comprehensive version of this code that included
>> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
>> Extended A unicode blocks."  (The originally proposed name was
>> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
>> Latin blocks.
>>
>> There was a related issue at the time which initially aimed to add a
>> more general "UnicodeNormalizationFilter" that ultimated resulted in
>> adding the "ICU" analysis classes...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1343
>>
>> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
>> tested that)
>>
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Bump minimum Java version requirement to 21

2023-11-06 Thread Robert Muir

> > The only concern I have with no.2 is that it could be considered an 
> > “aggressive” adoption of Java 21 - adoption sooner than the ecosystem can 
> > handle, e.g. are environments in which Lucene is deployed, and their 
> > transitive dependencies, ready to run on Java 21? By the time we’re ready 
> > to release 10.0.0, say March 2023, then I expect no issue with this.
>

As an open source library from apache software foundation, with no
warranty, it is impossible to release too aggressively. Someone
doesn't like that we released version 10 because the minimum JDK
version won't run on their 486? They just keep using version 9, we
didn't hurt them by releasing 10. We can't force them to upgrade to 10
anyway.

But on the other hand, it gave a lot of other people a choice. They
get the choice to use newer code instead of no choice at all (that
code sitting on the shelf for years). Run "git blame
lucene/CHANGES.txt" if you think I am crazy. Here's a change I made
nearly two years ago, it just sits on the shelf.

84e4b85b094c lucene/CHANGES.txt (Robert Muir
2021-12-07 21:39:13 -050014) * LUCENE-10010: AutomatonQuery,
CompiledAutomaton, RunAutomaton, RegExp
b2e866b70366 lucene/CHANGES.txt (Robert Muir
2021-12-03 19:48:33 -050015)   classes no longer determinize NFAs.
Instead it is the responsibility
b2e866b70366 lucene/CHANGES.txt (Robert Muir
2021-12-03 19:48:33 -050016)   of the caller to determinize.
(Robert Muir)

I didn't backport that change, not because I am lazy, but because it
is the kind of change that deserves to be in a major release (hard to
wrap-your-head-around-type-of-change). But I didn't intend for it to
sit on the shelf for two years either.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Bump minimum Java version requirement to 21

2023-11-06 Thread Robert Muir

On Mon, Nov 6, 2023 at 4:22 AM Chris Hegarty
 wrote:
>
> Hi,
>
> Great discussion, I agree with all that you have said. And that we will have 
> to deal with the intricacies of the MR-JAR regardless of the outcome here, 
> which is doable.
>
> I would very much like to avoid supporting Java 17 (released in Sep 2021) in 
> 2025. So far we have two possible approaches:
>
> 1. Release Lucene 10.0.0 now with Java 17 minimum. Bump _main_ to Java 21.
>
> 2. Release Lucene 9.9.0 soon(ish) with Java 11 minimum. Bump _main_ to Java 
> 21, and release 10.0.0 in the first quarter of 2024.
>
> Have I captured this correctly? Are there other alternatives that should be 
> considered?
>
> My issue with no.1 is that the 10.x train will likely live on for ~2yrs? In 
> which case we’ll be supporting Java 17 until some time in late 2025, when 
> Java 25 is released. This could be mitigated by releasing Lucene 11.0.0 
> earlier than 2yrs, say 1yr after 10.0.0.
>
> The only concern I have with no.2 is that it could be considered an 
> “aggressive” adoption of Java 21 - adoption sooner than the ecosystem can 
> handle, e.g. are environments in which Lucene is deployed, and their 
> transitive dependencies, ready to run on Java 21? By the time we’re ready to 
> release 10.0.0, say March 2023, then I expect no issue with this.

The problem is worse, historically jdk version X isn't adopted as a
minimum until it is already EOL. And the lucene major versions take an
eternity to get out there, code just sits in "main" branch for years
unreleased to nobody. It is really discouraging as a contributor to
contribute code that literally sits on the shelf for years, for no
good reason at all. So why delay?

The argument of "moving sooner than ecosystem can handle" is also
bogus in the same way. You mean versus the code sitting on the shelf
and being released to nobody?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Squash vs merge of PRs

2023-11-04 Thread Robert Muir

This isn't a community issue, it is me avoiding useless unnecessary
merge conflicts. Word "community" is invoked here to try to make it
out, like you can hold a vote about what git commands i should type on
my computer? You know that isn't gonna work. have some humility.

thread moved to spam.

On Sat, Nov 4, 2023 at 8:36 AM Mike Drob  wrote:
>
> We all agree on using Java though, and using a specific version, and even the 
> style output from gradle tidy. Is that nanny state or community consensus?
>
> On Sat, Nov 4, 2023 at 7:29 AM Robert Muir  wrote:
>>
>> example of a nanny state IMO, trying to dictate what git commands to
>> use, or what editor to use. Maybe this works for you in your corporate
>> hellholes, but I think some folks have a bit of a power issue, are
>> accustomed to dictacting this stuff to their employees and so on, but
>> this is open-source. I don't report to you, i dont use the editor you
>> tell me, or the git commands you tell me.
>>
>> On Sat, Nov 4, 2023 at 8:21 AM Uwe Schindler  wrote:
>> >
>> > Hi,
>> >
>> > I just wanted to give your attention to the following discussion:
>> > https://github.com/apache/lucene/pull/12737#issuecomment-1793426911
>> >
>> >  From my knowledge the Lucene (and Solr) community decided a while back
>> > to disable merging and only allow squashig of PRs. Robert always did
>> > this, but because of a one-time problem with two branches he was working
>> > on in parallel, he suddenly changed his mind and did merges on his own,
>> > not sqashing the branch and pushing to ASF Git.
>> >
>> > I am also not a fan of removing all history, but especially for heavy
>> > committing branches like the given PR, I think we should invite our
>> > committers to also adhere to community standards everyone else
>> > practices. I would agree with merging those branches if all commit
>> > messages in the branch would be well-formed with issue ID or PR number,
>> > but in the above case you get a history of random commits which is no
>> > longer linear and not easy readable.
>> >
>> > What do others think?
>> >
>> > Uwe
>> >
>> > --
>> > Uwe Schindler
>> > Achterdiek 19, D-28357 Bremen
>> > https://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Squash vs merge of PRs

2023-11-04 Thread Robert Muir

example of a nanny state IMO, trying to dictate what git commands to
use, or what editor to use. Maybe this works for you in your corporate
hellholes, but I think some folks have a bit of a power issue, are
accustomed to dictacting this stuff to their employees and so on, but
this is open-source. I don't report to you, i dont use the editor you
tell me, or the git commands you tell me.

On Sat, Nov 4, 2023 at 8:21 AM Uwe Schindler  wrote:
>
> Hi,
>
> I just wanted to give your attention to the following discussion:
> https://github.com/apache/lucene/pull/12737#issuecomment-1793426911
>
>  From my knowledge the Lucene (and Solr) community decided a while back
> to disable merging and only allow squashig of PRs. Robert always did
> this, but because of a one-time problem with two branches he was working
> on in parallel, he suddenly changed his mind and did merges on his own,
> not sqashing the branch and pushing to ASF Git.
>
> I am also not a fan of removing all history, but especially for heavy
> committing branches like the given PR, I think we should invite our
> committers to also adhere to community standards everyone else
> practices. I would agree with merging those branches if all commit
> messages in the branch would be well-formed with issue ID or PR number,
> but in the above case you get a history of random commits which is no
> longer linear and not easy readable.
>
> What do others think?
>
> Uwe
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-24 Thread Robert Muir

>
> Ooh, thank you Dawid!  And it's now merged, so we now have a decent timeout 
> protection, so if a bad actor tries to crypto mine or run some distributed 
> LLM or whatever, at least the wasted resources are bounded by how long a 
> "typical" legitimate run takes, plus generous buffer.  So given this 
> protection, why require the added manual approval step :)
>
> Net/net I don't think we have to do anything more here ... for now I'll try 
> to make a periodic effort myself to approve & run these blocked jobs.  Maybe 
> that's enough to create a smoother first-contributor experience.

We can write a bot to do this. Why do it manually?
https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#approve-a-workflow-run-for-a-fork-pull-request

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Could we allow an IndexInput to read from a still writing IndexOutput?

2023-10-19 Thread Robert Muir

what will happen on windows?

sorry, could not resist.

On Thu, Oct 19, 2023 at 9:48 AM Michael McCandless
 wrote:
>
> Hi Team,
>
> Today, Lucene's Directory abstraction does not allow opening an IndexInput on 
> a file until the file is fully written and closed via IndexOutput.  We 
> enforce this in tests, and some of our core Directory implementations demand 
> this (e.g. caching the file's length on opening an IndexInput).
>
> Yet, most filesystems will easily allow simultaneous read/append of a single 
> file.  We just don't expose this IO semantics to Lucene, but could we allow 
> random-access reads with append-only writes on one file?  Is there a strong 
> reason that we don't allow this?
>
> Quick TL/DR context: we are trying to enable FST compilation to write 
> off-heap (directly to disk), enabling creating arbitrarily large FSTs with 
> bounded heap, matching how FSTs can now be read off-heap, and it would be 
> much much more RAM efficient if we could read/append the same file at once.
>
> Full gory details context: inspired by how Tantivy (awesome and fast Rust 
> search engine!) writes its FSTs, over in this issue and PR, we (thank you 
> Dzung Bui / @dungba88!) are trying to fix Lucene's FST building to 
> immediately stream the FST to disk, instead of buffering the whole thing in 
> RAM and then writing to disk.
>
> This would allow building arbitrarily large FSTs without using up heap, and 
> symmetrically matches how we can now read FSTs off-heap, plus FST building is 
> already (mostly) append-only. This would also allow removing some of the 
> crazy abstractions we have for writing FST bytes into RAM (FSTStore, 
> BytesStore).  It would enable interesting things like a Codec whose term 
> dictionary is stored entirely in an FST (also inspired by Tantivy).
>
> The wrinkle is that, while the FST is building, it sometimes looks back and 
> reads previously written bytes, to share suffixes and create a minimal (or 
> near minimal) FST.  So if IndexInput could read those bytes, even as the FST 
> is still appending to IndexOutput, it would "just work".
>
> Failing that, our plan B is to wastefully duplicate the byte[] slices from 
> the already written bytes into our own private (heap resident, boo) copy, 
> which would use quite a bit more RAM while building the FST, and make less 
> minimal FSTs for a given RAM budget.  I haven't measured the added wasted RAM 
> if we have to go this route but I fear it is sizable in practice, i.e. it 
> strongly negates the whole idea of writing an FST off-heap since its 
> effectively storing a possibly large portion of the FST in many duplicated 
> byte[] fragments (in the NodeHash).
>
> So ... could we somehow relax Lucene's Directory semantics to allow opening 
> an IndexInput on a still appending IndexOutput, since most filesystems are 
> fine with this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-16 Thread Robert Muir

I think running the builds with a timeout is a good thing to do
anyway, for any CI build. I'm sure github actions has some fancy yaml
for that, but you can just do "timeout -k 1m 1h ./gradlew..." instead
of "./gradlew" too.

On Mon, Oct 16, 2023 at 9:58 AM Michael McCandless
 wrote:
>
> When a non-committer (I think?) opens a PR, one of the committers must notice 
> it and click Approve & Run so the contributor can find out if something broke 
> in our automated tests/precommit/linting.
>
> This seems like a waste, and a friction in the worst possible place for our 
> community: new contributor onboarding experience.
>
> I think we have it to prevent e.g. a crypto mining bot of a PR sneaking in 
> and taking tons of resources to mine dogecoin or so?
>
> But 1) that doesn't seem to be happening so far, 2) when I hit "Approve & 
> Run" I never look closely to see if there is in fact a hidden crypto miner in 
> there, and 3) can't we just put some reasonable timeout on the GitHub actions 
> to block such abuse?
>
> Is this some sort of requirement by GitHub, or did we choose to turn on this 
> silly step?
>
> Mike McCandless
>
> http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene » Lucene-NightlyTests-9.x - Build # 665 - Unstable!

2023-08-31 Thread Robert Muir

I looked at this lockverifyserver and would say its probably just the
craziness of this code.

it sets 30 second socket timeout and intentionally calls accept() when
there is nothing yet to accept... well no wonder we see this issue.

p.s. why does it set SO_REUSEADDR? no reason to do this leniency when
binding to port 0. nuke it.

On Thu, Aug 31, 2023 at 8:46 AM Robert Muir  wrote:
>
> probably a bug in some jvm sockets code that called accept() in its
> default blocking mode, when there wasn't any connection to accept? in
> that case accept() call will just block and wait for someone to make a
> new connection.
>
> On Thu, Aug 31, 2023 at 8:16 AM Dawid Weiss  wrote:
> >
> >
> > https://ge.apache.org/s/orksynljk2yp6/tests/task/:lucene:core:test/details/org.apache.lucene.store.TestStressLockFactories/testSimpleFSLockFactory?top-execution=1
> >
> > This test took 31 seconds... An extremely slow vm, perhaps? I don't know 
> > what the default connection timeouts are... it does look weird though.
> >
> > Dawid
> >
> > On Thu, Aug 31, 2023 at 1:08 PM Michael McCandless 
> >  wrote:
> >>
> >> Good grief -- why are we getting SocketTimeoutException in our 
> >> LockVerifyServer's attempt to accept an incoming connection!?  These are 
> >> all processes running on the same host ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Tue, Aug 29, 2023 at 11:17 PM Apache Jenkins Server 
> >>  wrote:
> >>>
> >>> Build: 
> >>> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.x/665/
> >>>
> >>> 2 tests failed.
> >>> FAILED:  
> >>> org.apache.lucene.store.TestStressLockFactories.testSimpleFSLockFactory
> >>>
> >>> Error Message:
> >>> java.net.SocketTimeoutException: Accept timed out
> >>>
> >>> Stack Trace:
> >>> java.net.SocketTimeoutException: Accept timed out
> >>> at 
> >>> __randomizedtesting.SeedInfo.seed([E1AD0D2AD68BA993:F325FE2A6E367AC7]:0)
> >>> at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
> >>> at 
> >>> java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:474)
> >>> at 
> >>> java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
> >>> at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
> >>> at 
> >>> org.apache.lucene.store.LockVerifyServer.run(LockVerifyServer.java:62)
> >>> at 
> >>> org.apache.lucene.store.TestStressLockFactories.runImpl(TestStressLockFactories.java:53)
> >>> at 
> >>> org.apache.lucene.store.TestStressLockFactories.testSimpleFSLockFactory(TestStressLockFactories.java:104)
> >>> at 
> >>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> >>> Method)
> >>> at 
> >>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>> at 
> >>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> >>> at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> >>> at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> >>> at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> >>> at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> >>> at 
> >>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> >>> at 
> >>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> >>> at 
> >>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> >>> at 
> >>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> >>> at 
> >>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> >>> at

Re: [JENKINS] Lucene » Lucene-NightlyTests-9.x - Build # 665 - Unstable!

2023-08-31 Thread Robert Muir

probably a bug in some jvm sockets code that called accept() in its
default blocking mode, when there wasn't any connection to accept? in
that case accept() call will just block and wait for someone to make a
new connection.

On Thu, Aug 31, 2023 at 8:16 AM Dawid Weiss  wrote:
>
>
> https://ge.apache.org/s/orksynljk2yp6/tests/task/:lucene:core:test/details/org.apache.lucene.store.TestStressLockFactories/testSimpleFSLockFactory?top-execution=1
>
> This test took 31 seconds... An extremely slow vm, perhaps? I don't know what 
> the default connection timeouts are... it does look weird though.
>
> Dawid
>
> On Thu, Aug 31, 2023 at 1:08 PM Michael McCandless 
>  wrote:
>>
>> Good grief -- why are we getting SocketTimeoutException in our 
>> LockVerifyServer's attempt to accept an incoming connection!?  These are all 
>> processes running on the same host ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Aug 29, 2023 at 11:17 PM Apache Jenkins Server 
>>  wrote:
>>>
>>> Build: 
>>> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.x/665/
>>>
>>> 2 tests failed.
>>> FAILED:  
>>> org.apache.lucene.store.TestStressLockFactories.testSimpleFSLockFactory
>>>
>>> Error Message:
>>> java.net.SocketTimeoutException: Accept timed out
>>>
>>> Stack Trace:
>>> java.net.SocketTimeoutException: Accept timed out
>>> at 
>>> __randomizedtesting.SeedInfo.seed([E1AD0D2AD68BA993:F325FE2A6E367AC7]:0)
>>> at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
>>> at 
>>> java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:474)
>>> at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
>>> at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
>>> at 
>>> org.apache.lucene.store.LockVerifyServer.run(LockVerifyServer.java:62)
>>> at 
>>> org.apache.lucene.store.TestStressLockFactories.runImpl(TestStressLockFactories.java:53)
>>> at 
>>> org.apache.lucene.store.TestStressLockFactories.testSimpleFSLockFactory(TestStressLockFactories.java:104)
>>> at 
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>>> Method)
>>> at 
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at 
>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>>> at 
>>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>>> at 
>>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>>> at 
>>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>>> at 
>>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>>> at 
>>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>>> at 
>>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>>> at 
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>>> at 
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>>> at 
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>>> at 
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>>> at 
>>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>>> at 
>>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>>> at 
>>> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
>>> at 
>>>

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Robert Muir

chart is wrong, average word length for english is like 5.

On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien
 wrote:
>
> Hey all,
>
> another data point: There's a diagram with the relevant distributions of word 
> lengths in various languages here:
>
> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>
> While English is close to the 8-byte limit, average word length in German is 
> 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll gather 
> some averages over the various Wikipedia indices.
>
> Cheers,
> Thomas
>
> On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien  
> wrote:
>>
>> Hey there,
>>
>> reviving this thread. To clarify: In order to show this patch is worth 
>> doing, I should index a bunch of natural-language documents (whichever 
>> language that is) and show that the patch brings a performance benefit?
>>
>> (Just clarifying, because at least inside ElasticSearch for the logs 
>> use-case, it turns out that it does provide a performance benefit -- but I 
>> want to make sure I understand what the Lucene community wishes to see as 
>> "evidence" this is worth pursuing :-)
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood  
>> wrote:
>>>
>>> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, 
>>> Hebrew) will have longer byte strings because of UTF8. German has large 
>>> compound words.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien 
>>>  wrote:
>>>
>>> Hey all,
>>>
>>> ok, attached is a second patch that adds some unit tests; I am happy to add 
>>> more.
>>>
>>> This brings me back to my original question: I'd like to run some pretty 
>>> thorough benchmarking on Lucene, both for this change and for possible 
>>> other future changes, largely focused on indexing performance. What are 
>>> good command lines to do so? What are good corpora?
>>>
>>> Cheers,
>>> Thomas
>>>
>>> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien  
>>> wrote:
>>>>
>>>> Hey,
>>>>
>>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish 
>>>> official test vectors, see the following URLs:
>>>> https://github.com/aappleby/smhasher/issues/6
>>>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>>>> There is a link to a pastebin entry in the first issue, which leads to 
>>>> https://pastebin.com/kkggV9Vx
>>>>
>>>> Now, the test vectors in that pastebin do not match either the output of 
>>>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package. 
>>>> That said, the pre-change Lucene and the mmh3 package agree, just not with 
>>>> the published list.
>>>>
>>>> There *are* test vectors in the source code for the mmh3 python package, 
>>>> which I could use, or cook up a set of bespoke ones, or both (I share the 
>>>> concern about 8-byte boundaries and signedness).
>>>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>>>
>>>> Cheers,
>>>> Thomas
>>>>
>>>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir  wrote:
>>>>>
>>>>> i dont think we need a ton of random strings. But if you want to
>>>>> optimize for strings of length 8, at a minimum there should be very
>>>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>>>> string of length exactly 8). i would also strongly recommend testing
>>>>> non-ascii since java is a language with signed integer types so it may
>>>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>>>>
>>>>> IMO this could be 2 simple unit tests.
>>>>>
>>>>> usually at least with these kinds of algorithms you can also find
>>>>> published "test vectors" that intend to seek out the corner cases. if
>>>>> these exist for murmurhash, we should fold them in too.
>>>>>
>>>>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
>>>>>  wrote:
>>>>> >
>>

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-17 Thread Robert Muir

As a reminder this isn't the Disney Plus channel and I'll use strong
language if I fucking want to.



On Wed, May 17, 2023, 4:45 AM Alessandro Benedetti 
wrote:

> Robert,
> A gentle reminder of the
> https://www.apache.org/foundation/policies/conduct.html.
> I've read many e-mails about this topic that ended up in a tone that is
> not up to the standard of a healthy community.
> To be specific and pragmatic how you addressed Gus here, how you addressed
> the rest of our community mocking us as sort of "ChatGPT minions" and the
> usage of bad words in English (f* word), does not make sense and it's not
> acceptable here.
> Even if you feel heated, I recommend separating such emotions from what
> you write and always being respectful of other people with different ideas.
> You are an intelligent person, don't ruin your time (and others' time) on
> a wonderful project such as Lucene, blinded by excessive emotion.
> Please remember that the vast majority of us participate in this community
> purely on a volunteering basis.
> So when I spend time on this, I like to see respect,
> thoughtful discussions, and intellectual challenges, the time we spend
> together must be peaceful and positive.
>
> The community comes first and here we are collecting what the community
> would like for a feature.
> Your vote and opinion are extremely valuable, but at this stage, we are
> here to listen to the community rather than imposing a personal idea.
> Once we observe the dominant need, we'll proceed with a contribution.
> If you disagree with such a contribution and bring technical evidence that
> supports a convincing veto, we (the Lucene community) will listen and
> improve/change the contribution.
> If you disagree with such a contribution and bring an unconvincing veto,
> we (the Lucene community) will proceed with steps that are appropriate for
> the situation.
> Let's also remember that the project and the community come first, Lucene
> is an Apache project, not mine or yours for that matters.
>
> Cheers
>
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 17 May 2023 at 01:54, Robert Muir  wrote:
>
>> Gus, I think i explained myself multiple times on issues and in this
>> thread. the performance is unacceptable, everyone knows it, but nobody is
>> talking about.
>> I don't need to explain myself time and time again here.
>> You don't seem to understand the technical issues (at least you sure as
>> fuck don't know how service loading works or you wouldnt have opened
>> https://github.com/apache/lucene/issues/12300 )
>>
>> I'm just the only one here completely unconstrained by any of silicon
>> valley's influences to speak my true mind, without any repercussions, so I
>> do it. Don't give any fucks about ChatGPT.
>>
>> I'm standing by my technical veto. If you bypass it, I'll revert the
>> offending commit.
>>
>> As far as fixing the technical performance, I just opened an issue with
>> some ideas to at least improve cpu usage by a factor of N. It does not help
>> with the crazy heap memory usage or other issues of KNN implementation
>> causing shit like OOM on merge. But it is one step:
>> https://github.com/apache/lucene/issues/12302
>>
>>
>>
>> On Tue, May 16, 2023 at 7:45 AM Gus Heck  wrote:
>>
>>> Robert,
>>>
>>> Can you explain in clear technical terms the standard that must be met
>>> for performance? A benchmark that must run in X time on Y hardware for
>>> example (and why that test is suitable)? Or some other reproducible
>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>> that's not a technical criteria, others may have a different concept of
>>> what is usable to them.
>>>
>>> Forgive me if I misunderstand, but the essence of your argument has
>>> seemed to be
>>>
>>> "Performance isn't good enough, therefore we should force anyone who
>>> wants to experiment with something bigger to fork the code base to do it"
>>>
>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>> can verif

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Robert Muir

My problem is that it impacts the default codec which is supported by our
backwards compatibility policy for many years. We can't just let the user
determine backwards compatibility with a sysprop. how will checkindex work?
We have to have bounds and also allow for more performant implementations
that might have different limitations. And I'm pretty sure we want a faster
implementation than what we have in the future, and it will probably have
different limits.

For other codecs, it is fine to have a different limit as I already said,
as it is implementation dependent. And honestly the stuff in lucene/codecs
can be more "Fast and loose" because it doesn't require the extensive index
back compat guarantee.

Again, penultimate concern is that index back compat guarantee. When it
comes to limits, the proper way is not to just keep bumping them without
technical reasons, instead the correct approach is to fix the technical
problems and make them irrelevant. Great example here (merged this
morning):
https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645


On Tue, May 16, 2023 at 10:49 PM David Smiley  wrote:

> Robert, I have not heard from you (or anyone) an argument against System
> property based configurability (as I described in Option 4 via a System
> property).  Uwe notes wisely some care must be taken to ensure it actually
> works.  Sure, of course.  What concerns do you have with this?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, May 16, 2023 at 9:50 PM Robert Muir  wrote:
>
>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>> hsnw-specific code.
>>
>> This way, someone can write alternative codec with vectors using some
>> other completely different approach that incorporates a different more
>> appropriate limit (maybe lower, maybe higher) depending upon their
>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>> don't force the project into many years of index backwards compatibility,
>> which is really my penultimate concern. We can lock ourselves into a truly
>> bad place and become irrelevant (especially with scalar code implementing
>> all this vector stuff, it is really senseless). In the meantime I suggest
>> we try to reduce pain for the default codec with the current implementation
>> if possible. If it is not possible, we need a new codec that performs.
>>
>> On Tue, May 16, 2023 at 8:53 PM Robert Muir  wrote:
>>
>>> Gus, I think i explained myself multiple times on issues and in this
>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>> talking about.
>>> I don't need to explain myself time and time again here.
>>> You don't seem to understand the technical issues (at least you sure as
>>> fuck don't know how service loading works or you wouldnt have opened
>>> https://github.com/apache/lucene/issues/12300 )
>>>
>>> I'm just the only one here completely unconstrained by any of silicon
>>> valley's influences to speak my true mind, without any repercussions, so I
>>> do it. Don't give any fucks about ChatGPT.
>>>
>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>> offending commit.
>>>
>>> As far as fixing the technical performance, I just opened an issue with
>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>> with the crazy heap memory usage or other issues of KNN implementation
>>> causing shit like OOM on merge. But it is one step:
>>> https://github.com/apache/lucene/issues/12302
>>>
>>>
>>>
>>> On Tue, May 16, 2023 at 7:45 AM Gus Heck  wrote:
>>>
>>>> Robert,
>>>>
>>>> Can you explain in clear technical terms the standard that must be met
>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>> example (and why that test is suitable)? Or some other reproducible
>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>> that's not a technical criteria, others may have a different concept of
>>>> what is usable to them.
>>>>
>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>> seemed to be
>>>>
>>>> "Performance isn't good enough, therefore we should force anyone who
>>>> wants to experiment with something bigger to fork the code base to do it"
>>&

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Robert Muir

by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
hsnw-specific code.

This way, someone can write alternative codec with vectors using some other
completely different approach that incorporates a different more
appropriate limit (maybe lower, maybe higher) depending upon their
tradeoffs. We should encourage this as I think it is the "only true fix" to
the scalability issues: use a scalable algorithm! Also, alternative codecs
don't force the project into many years of index backwards compatibility,
which is really my penultimate concern. We can lock ourselves into a truly
bad place and become irrelevant (especially with scalar code implementing
all this vector stuff, it is really senseless). In the meantime I suggest
we try to reduce pain for the default codec with the current implementation
if possible. If it is not possible, we need a new codec that performs.

On Tue, May 16, 2023 at 8:53 PM Robert Muir  wrote:

> Gus, I think i explained myself multiple times on issues and in this
> thread. the performance is unacceptable, everyone knows it, but nobody is
> talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you sure as
> fuck don't know how service loading works or you wouldnt have opened
> https://github.com/apache/lucene/issues/12300 )
>
> I'm just the only one here completely unconstrained by any of silicon
> valley's influences to speak my true mind, without any repercussions, so I
> do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert the
> offending commit.
>
> As far as fixing the technical performance, I just opened an issue with
> some ideas to at least improve cpu usage by a factor of N. It does not help
> with the crazy heap memory usage or other issues of KNN implementation
> causing shit like OOM on merge. But it is one step:
> https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45 AM Gus Heck  wrote:
>
>> Robert,
>>
>> Can you explain in clear technical terms the standard that must be met
>> for performance? A benchmark that must run in X time on Y hardware for
>> example (and why that test is suitable)? Or some other reproducible
>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>> that's not a technical criteria, others may have a different concept of
>> what is usable to them.
>>
>> Forgive me if I misunderstand, but the essence of your argument has
>> seemed to be
>>
>> "Performance isn't good enough, therefore we should force anyone who
>> wants to experiment with something bigger to fork the code base to do it"
>>
>> Thus, it is necessary to have a clear unambiguous standard that anyone
>> can verify for "good enough". A clear standard would also focus efforts at
>> improvement.
>>
>> Where are the goal posts?
>>
>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>> is fundamentally counterproductive in an open source setting, as it will
>> lead to *fewer people* pushing the limits. Extremely few people are
>> going to get into the nitty-gritty of optimizing things unless they are
>> staring at code that they can prove does something interesting, but doesn't
>> run fast enough for their purposes. If people hit a hard limit, more of
>> them give up and never develop the code that will motivate them to look for
>> optimizations.
>>
>> -Gus
>>
>> On Tue, May 16, 2023 at 6:04 AM Robert Muir  wrote:
>>
>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>> does not change the technical facts or make the veto go away.
>>>
>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>> a.benede...@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configur

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Robert Muir

Gus, I think i explained myself multiple times on issues and in this
thread. the performance is unacceptable, everyone knows it, but nobody is
talking about.
I don't need to explain myself time and time again here.
You don't seem to understand the technical issues (at least you sure as
fuck don't know how service loading works or you wouldnt have opened
https://github.com/apache/lucene/issues/12300 )

I'm just the only one here completely unconstrained by any of silicon
valley's influences to speak my true mind, without any repercussions, so I
do it. Don't give any fucks about ChatGPT.

I'm standing by my technical veto. If you bypass it, I'll revert the
offending commit.

As far as fixing the technical performance, I just opened an issue with
some ideas to at least improve cpu usage by a factor of N. It does not help
with the crazy heap memory usage or other issues of KNN implementation
causing shit like OOM on merge. But it is one step:
https://github.com/apache/lucene/issues/12302



On Tue, May 16, 2023 at 7:45 AM Gus Heck  wrote:

> Robert,
>
> Can you explain in clear technical terms the standard that must be met for
> performance? A benchmark that must run in X time on Y hardware for example
> (and why that test is suitable)? Or some other reproducible criteria? So
> far I've heard you give an *opinion* that it's unusable, but that's not a
> technical criteria, others may have a different concept of what is usable
> to them.
>
> Forgive me if I misunderstand, but the essence of your argument has seemed
> to be
>
> "Performance isn't good enough, therefore we should force anyone who wants
> to experiment with something bigger to fork the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard that anyone can
> verify for "good enough". A clear standard would also focus efforts at
> improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit is
> fundamentally counterproductive in an open source setting, as it will lead
> to *fewer people* pushing the limits. Extremely few people are going to
> get into the nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting, but doesn't run fast
> enough for their purposes. If people hit a hard limit, more of them give up
> and never develop the code that will motivate them to look for
> optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04 AM Robert Muir  wrote:
>
>> i still feel -1 (veto) on increasing this limit. sending more emails does
>> not change the technical facts or make the veto go away.
>>
>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Robert Muir

i still feel -1 (veto) on increasing this limit. sending more emails does
not change the technical facts or make the veto go away.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti 
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability.  Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io 
> LinkedIn  | Twitter
>  | Youtube
>  | Github
> 
>

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Robert Muir

I remember the benefits from Terms.intersect being pretty huge. Rather
than simple ping-pong, the whole monster gets handed off directly to
the codec's term dictionary implementation. For the default terms
dictionary using blocktree, this saves time seeking to terms you don't
care about (because the postingsformat is aware of the blocktree
structure). It is probably worth just prototyping on its own enough,
to see if we can get the benefits. It may turn out, you dont need a
bloom anymore after that.

On Tue, May 9, 2023 at 3:24 PM Greg Miller  wrote:
>
> Thanks for the feedback Robert. This approach sounds like a better path to 
> follow. I'll explore it. I agree that we should provide default behavior that 
> is overall best for our users, and not for one specific use-case such as 
> Amazon search :).
>
> Mike- TermInSetQuery used to use seekExact, and now uses seekCeil. We haven't 
> used intersect... yet.
>
> Thanks again for the feedback.
>
> Cheers,
> -Greg
>
> On Tue, May 9, 2023 at 11:09 AM Michael McCandless 
>  wrote:
>>
>> Besides not being able to use the bloom filter, seekCeil is also just more 
>> costly than seekExact since it is essentially both .seekExact and .next in a 
>> single operation.
>>
>> Are either of the two approaches using the intersect method of TermsEnum?  
>> It might be faster if the number of terms is over some threshold.
>>
>> It would require building an Automaton out of the set of terms, which is 
>> fast with DaciukMihovAutomatonBuilder.  Hmm, I think we should rename this 
>> class maybe.  I'll open an issue.  Naming is the hardest part!
>>
>> The Codec can implement this quite efficiently since it can do the ping-pong 
>> skipping Patrick is referring to on a byte-by-byte basis in each of the 
>> sources of Term iteration.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, May 5, 2023 at 9:34 PM Patrick Zhai  wrote:
>>>
>>> Hi Greg
>>> IMO I still think the seekCeil is a better solution for the default posting 
>>> format, as it could potentially save time on traversing the FST by doing 
>>> the ping-pong skipping.
>>> I can see that in the case of using bloom filter the seekExact might be 
>>> better but I'm not sure whether there is a better way than overriding the 
>>> `getTermsEnum`...
>>>
>>> Patrick
>>>
>>> On Fri, May 5, 2023 at 4:45 PM Greg Miller  wrote:

 Hi folks-

 Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote 
 TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery 
 can now leverage the various "rewrite methods" available to 
 MultiTermQuery, allowing users to customize the query evaluation strategy 
 (e.g., postings vs. doc values, etc.), which was a nice win. In the 
 benchmarks we ran, we didn't see any performance issues.

 In anticipation of 9.6 releasing, I've pulled this change into the Lucene 
 snapshot we use for Amazon product search, and started running some 
 additional benchmarks, which have surfaced an interesting issue. One 
 use-case we have for TermInSetQuery creates a term disjunction over a 
 field that's using bloom filtering (i.e., BloomFilterPostingsFormat). 
 Because bloom filtering can only help with seekExact and not seekCeil, 
 we're seeing a performance regression (primarily in red-line QPS).

 One way I can think to address this is to move back to a seekExact 
 approach when creating the filtered TermsEnum used by MultiTermQuery (for 
 the TermInSetQuery implementation). Because TermInSetQuery can provide all 
 of its terms up-front, we can have a simpler term intersection 
 implementation that relies on seekExact over seekCeil. Here's a quick take 
 on what I'm thinking: 
 https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
  I've tested this internally and confirmed it solves our QPS regression 
 problem.

 I'm curious if anyone has an objection to moving back to a seekExact term 
 intersection approach for TermInSetQuery, or has alternative ideas. I 
 wonder if I'm overlooking some important factors and focusing too much on 
 this specific case where the bloom filter interaction is hurting 
 performance? It seems like seekCeil could provide benefits in some cases 
 over seekExact by skipping over multiple query terms at a time, so that's 
 a possible consideration. If we solve for the most common cases by 
 default, I suppose advanced users could always override 
 TermInSetQuery#getTermsEnum as necessary (we could take this approach 
 internally for example to work with our bloom filtering if the best 
 default is to leverage seekCeil). I can easily turn my quick solution into 
 a PR, but before I do, I wanted to poll this group for thoughts on the 
 approach or other alternatives I might be overlooking. Thanks in advance!

 Cheers,
 -Greg

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Robert Muir

The better solution is to use Terms.intersect. Then the postings
format can do the right thing. But this query doesn't use
Terms.intersect today, instead doing ping-ponging itself.

That's the problem.

We must *not* tune our algorithms for amazon's search but instead what
is the best for users (default postings format).

On Fri, May 5, 2023 at 9:34 PM Patrick Zhai  wrote:
>
> Hi Greg
> IMO I still think the seekCeil is a better solution for the default posting 
> format, as it could potentially save time on traversing the FST by doing the 
> ping-pong skipping.
> I can see that in the case of using bloom filter the seekExact might be 
> better but I'm not sure whether there is a better way than overriding the 
> `getTermsEnum`...
>
> Patrick
>
> On Fri, May 5, 2023 at 4:45 PM Greg Miller  wrote:
>>
>> Hi folks-
>>
>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote 
>> TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery 
>> can now leverage the various "rewrite methods" available to MultiTermQuery, 
>> allowing users to customize the query evaluation strategy (e.g., postings 
>> vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we 
>> didn't see any performance issues.
>>
>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene 
>> snapshot we use for Amazon product search, and started running some 
>> additional benchmarks, which have surfaced an interesting issue. One 
>> use-case we have for TermInSetQuery creates a term disjunction over a field 
>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because 
>> bloom filtering can only help with seekExact and not seekCeil, we're seeing 
>> a performance regression (primarily in red-line QPS).
>>
>> One way I can think to address this is to move back to a seekExact approach 
>> when creating the filtered TermsEnum used by MultiTermQuery (for the 
>> TermInSetQuery implementation). Because TermInSetQuery can provide all of 
>> its terms up-front, we can have a simpler term intersection implementation 
>> that relies on seekExact over seekCeil. Here's a quick take on what I'm 
>> thinking: 
>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>>  I've tested this internally and confirmed it solves our QPS regression 
>> problem.
>>
>> I'm curious if anyone has an objection to moving back to a seekExact term 
>> intersection approach for TermInSetQuery, or has alternative ideas. I wonder 
>> if I'm overlooking some important factors and focusing too much on this 
>> specific case where the bloom filter interaction is hurting performance? It 
>> seems like seekCeil could provide benefits in some cases over seekExact by 
>> skipping over multiple query terms at a time, so that's a possible 
>> consideration. If we solve for the most common cases by default, I suppose 
>> advanced users could always override TermInSetQuery#getTermsEnum as 
>> necessary (we could take this approach internally for example to work with 
>> our bloom filtering if the best default is to leverage seekCeil). I can 
>> easily turn my quick solution into a PR, but before I do, I wanted to poll 
>> this group for thoughts on the approach or other alternatives I might be 
>> overlooking. Thanks in advance!
>>
>> Cheers,
>> -Greg

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Robert Muir

On Tue, May 2, 2023 at 3:24 PM Michael Froh  wrote:
>
> > This seems ok if it isn't invasive. I still feel like something is
> > "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> > you have way too many segments?
>
> From what I saw, it's 1KB per "leaf query" to create the BM25Scorer instance 
> (at the Weight level), but then that BM25Scorer is shared across all scorer 
> (DISI) instances for all segments. So it doesn't scale with segment count. It 
> looks like the old logic used to allocate a SimScorer per segment, so this is 
> a big improvement in that regard (for scoring clauses, since the non-scoring 
> clauses had a super-lightweight SimScorer).
>
> In this particular case, they're running these gnarly machine-generated 
> BoolenQuery trees with at least 512 non-scoring TermQuery clauses (across a 
> bunch of different fields, so TermInSetQuery isn't an option). From what I 
> can see, each of those TermQueries produces a TermWeight that holds a 
> BM25Scorer that holds yet another instance of this float[256] array, for 
> 512KB+ of these caches per running query. It's definitely only going to be an 
> issue for folks who are flying close to the max clause count.
>

Yeah, but the same situation could be said for buffers like this one:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java#L311-L312
So I'm actually still confused why this float[256] stands out in your
measurejments vs two long[128]'s. Maybe its a profiler ghost?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Robert Muir

On Tue, May 2, 2023 at 2:34 PM Robert Muir  wrote:
>
> On Tue, May 2, 2023 at 12:49 PM Michael Froh  wrote:
> >
> > Hi all,
> >
> > I was looking into a customer issue where they noticed some increased GC 
> > time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps 
> > from both systems, the big difference was tracked down to the float[256] 
> > allocated (as a norms cache) when creating a BM25Scorer (in 
> > BM25Similarity.scorer()).
> >
> > The change seems to have come in with 
> > https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1,
> >  which removed some of the special-case logic around the "non-scoring 
> > similarity" embedded in IndexSearcher (returned in the false case from the 
> > old IndexSearcher#scorer(boolean needsScores)).
> >
> > While I really like that we no longer have that special-case logic in 
> > IndexSearcher, we now have the issue that every time we create a new 
> > TermWeight (or other Weight) it allocates a float[256], even if the 
> > TermWeight doesn't need scores. Also, I think it's the exact same 
> > float[256] for all non-scoring weights, since it's being computed using the 
> > same "all 1s" CollectionStatistics and TermStatistics.
> >
> > (For the record, yes, the queries in question have an obscene number of 
> > TermQueries, so 1024 bytes times lots of TermWeights, times multiple 
> > queries running concurrently makes lots of heap allocation.)
> >
> > I'd like to submit a patch to fix this, but I'm wondering what approach to 
> > take. One option I'm considering is precomputing a singleton float[256] for 
> > the non-scoring case (where CollectionStatistics and TermStatistics are all 
> > 1s). That would have the least functional impact, but would let all 
> > non-scoring clauses share the same array. Is there a better way to tackle 
> > this?
> >
>
> This seems ok if it isn't invasive. I still feel like something is
> "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> you have way too many segments?
>
> Originally (for various similar reasons) there was a place in the API
> to do this, so it would only happen per-Weight instead of per-Scorer,
> which was the SimWeight that got eliminated by the commit you point
> to. But I'd love if we could steer clear of that complexity:
> simplifying the API here was definitely the right move. Its been more
> than 5 years since this change was made, and this is the first
> complaint i've heard about the 1KB, which is why i asked about your
> setup.

One last thought: we should re-check if the cache is still needed :) I
think decoding norms used to be more expensive in the past. This cache
is now only precomputing part of the bm25 formula to save some
add/multiply/divide.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Robert Muir

On Tue, May 2, 2023 at 12:49 PM Michael Froh  wrote:
>
> Hi all,
>
> I was looking into a customer issue where they noticed some increased GC time 
> after upgrading from Lucene 7.x to 9.x. After taking some heap dumps from 
> both systems, the big difference was tracked down to the float[256] allocated 
> (as a norms cache) when creating a BM25Scorer (in BM25Similarity.scorer()).
>
> The change seems to have come in with 
> https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1,
>  which removed some of the special-case logic around the "non-scoring 
> similarity" embedded in IndexSearcher (returned in the false case from the 
> old IndexSearcher#scorer(boolean needsScores)).
>
> While I really like that we no longer have that special-case logic in 
> IndexSearcher, we now have the issue that every time we create a new 
> TermWeight (or other Weight) it allocates a float[256], even if the 
> TermWeight doesn't need scores. Also, I think it's the exact same float[256] 
> for all non-scoring weights, since it's being computed using the same "all 
> 1s" CollectionStatistics and TermStatistics.
>
> (For the record, yes, the queries in question have an obscene number of 
> TermQueries, so 1024 bytes times lots of TermWeights, times multiple queries 
> running concurrently makes lots of heap allocation.)
>
> I'd like to submit a patch to fix this, but I'm wondering what approach to 
> take. One option I'm considering is precomputing a singleton float[256] for 
> the non-scoring case (where CollectionStatistics and TermStatistics are all 
> 1s). That would have the least functional impact, but would let all 
> non-scoring clauses share the same array. Is there a better way to tackle 
> this?
>

This seems ok if it isn't invasive. I still feel like something is
"off" if you are seeing GC time from 1KB-per-segment allocation. Do
you have way too many segments?

Originally (for various similar reasons) there was a place in the API
to do this, so it would only happen per-Weight instead of per-Scorer,
which was the SimWeight that got eliminated by the commit you point
to. But I'd love if we could steer clear of that complexity:
simplifying the API here was definitely the right move. Its been more
than 5 years since this change was made, and this is the first
complaint i've heard about the 1KB, which is why i asked about your
setup.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

i dont think we need a ton of random strings. But if you want to
optimize for strings of length 8, at a minimum there should be very
simple tests ensuring correctness for some boundary conditions (e.g.
string of length exactly 8). i would also strongly recommend testing
non-ascii since java is a language with signed integer types so it may
be susceptible to bugs where the input bytes have the "sign bit" set.

IMO this could be 2 simple unit tests.

usually at least with these kinds of algorithms you can also find
published "test vectors" that intend to seek out the corner cases. if
these exist for murmurhash, we should fold them in too.

On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
 wrote:
>
> Hey,
>
> I offered to run a large number of random-string-hashes to ensure that the 
> output is the same pre- and post-change. I can add an arbitrary number of 
> such tests to TestStringHelper.java, just specify the number you wish.
>
> If your worry is that my change breaches the inlining bytecode limit: Did you 
> check whether the old version was inlineable or not? The new version is 263 
> bytecode instructions, the old version was 110. The default inlining limit 
> appears to be 35 bytecode instructions on cursory checking (I may be wrong on 
> this, though), so I don't think it was ever inlineable in default configs.
>
> On your statement "we haven't seen performance gains" -- the starting point 
> of this thread was a friendly request to please point me to instructions for 
> running a broad range of Lucene indexing benchmarks, so I can gather data for 
> further discussion; from my perspective, we haven't even gathered any data, 
> so obviously we haven't seen any gains.
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 4:27 PM Robert Muir  wrote:
>>
>> There is literally one string, all-ascii. This won't fail if all the
>> shifts and masks are wrong.
>>
>> About the inlining, i'm not talking about cpu stuff, i'm talking about
>> java. There are limits to the size of methods that get inlined (e.g.
>> -XX:MaxInlineSize). If we make this method enormous like this, it may
>> have performance consequences.
>>
>> We still haven't seen any performance gain from this. Elasticsearch
>> putting huge unique IDs into indexed terms doesnt count.
>>
>> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien
>>  wrote:
>> >
>> > Hey,
>> >
>> > so there are unit tests in TestStringHelper.java that test strings of 
>> > length greater than 8, and my change passes them. Could you explain what 
>> > you want tested?
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir  wrote:
>> >>
>> >> sure, but "if length > 8 return 1" might pass these same tests too,
>> >> yet cause a ton of hash collisions.
>> >>
>> >> I just think if you want to optimize for super-long strings, there
>> >> should be a unit test.
>> >>
>> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien
>> >>  wrote:
>> >> >
>> >> > Hey,
>> >> >
>> >> > I am pretty confident about correctness. The change passes both Lucene 
>> >> > and ES regression tests and my careful reading of the code is pretty 
>> >> > certain that the output is the same. If you want me to randomly test 
>> >> > the result for a few hundred million random strings, I'm happy to do 
>> >> > that, too, if you have other suggestions for correctness testing, let 
>> >> > me know.
>> >> >
>> >> > The change does increase the method size and may impact inlining - but 
>> >> > so does literally any code change, particularly in a JIT'ed environment 
>> >> > where placement of code (and hence things like instruction cache 
>> >> > conflicts) depend on the precise history of execution. The way I 
>> >> > understand it, one deals with this by benchmarking and measuring.
>> >> >
>> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable 
>> >> > improvement in indexing speed - this is why I was asking about a broad 
>> >> > range of Lucene benchmarks; to verify that this is indeed the case for 
>> >> > Lucene-only, too.
>> >> >
>> >> > Let me know what data you'd like to see to decide whether this patch is 
>> >> > a good idea, and if there is consensus among the Lucene committers that 
>> >> > t

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

There is literally one string, all-ascii. This won't fail if all the
shifts and masks are wrong.

About the inlining, i'm not talking about cpu stuff, i'm talking about
java. There are limits to the size of methods that get inlined (e.g.
-XX:MaxInlineSize). If we make this method enormous like this, it may
have performance consequences.

We still haven't seen any performance gain from this. Elasticsearch
putting huge unique IDs into indexed terms doesnt count.

On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien
 wrote:
>
> Hey,
>
> so there are unit tests in TestStringHelper.java that test strings of length 
> greater than 8, and my change passes them. Could you explain what you want 
> tested?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 4:21 PM Robert Muir  wrote:
>>
>> sure, but "if length > 8 return 1" might pass these same tests too,
>> yet cause a ton of hash collisions.
>>
>> I just think if you want to optimize for super-long strings, there
>> should be a unit test.
>>
>> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien
>>  wrote:
>> >
>> > Hey,
>> >
>> > I am pretty confident about correctness. The change passes both Lucene and 
>> > ES regression tests and my careful reading of the code is pretty certain 
>> > that the output is the same. If you want me to randomly test the result 
>> > for a few hundred million random strings, I'm happy to do that, too, if 
>> > you have other suggestions for correctness testing, let me know.
>> >
>> > The change does increase the method size and may impact inlining - but so 
>> > does literally any code change, particularly in a JIT'ed environment where 
>> > placement of code (and hence things like instruction cache conflicts) 
>> > depend on the precise history of execution. The way I understand it, one 
>> > deals with this by benchmarking and measuring.
>> >
>> > FWIW, several indexing-heavy ES benchmarks show a noticeable improvement 
>> > in indexing speed - this is why I was asking about a broad range of Lucene 
>> > benchmarks; to verify that this is indeed the case for Lucene-only, too.
>> >
>> > Let me know what data you'd like to see to decide whether this patch is a 
>> > good idea, and if there is consensus among the Lucene committers that 
>> > those are reasonable criteria, I'll work on producing that data.
>> >
>> > Cheers,
>> > Thomas
>> >
>> >
>> >
>> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir  wrote:
>> >>
>> >> well there is some cost, as it must add additional checks to see if
>> >> its longer than 8. in your patch, additional loops. it increases the
>> >> method size and may impact inlining and other things. also we can't
>> >> forget about correctness, if the hash function does the wrong thing it
>> >> could slow everything to a crawl.
>> >>
>> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien
>> >>  wrote:
>> >> >
>> >> > Ah, I see what you mean.
>> >> >
>> >> > You are correct -- the change will not speed up a 5-byte word, but it 
>> >> > *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >> >
>> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir  wrote:
>> >> >>
>> >> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>> >> >> speed anything up. there aren't 8 bytes to process.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
>> >> >>  wrote:
>> >> >> >
>> >> >> > Is average word length <= 4 realistic though? I mean, even the 
>> >> >> > english wiki corpus has ~5, which would require two calls to the 
>> >> >> > lucene layer instead of one; e.g. multiple layers of virtual 
>> >> >> > dispatch that are unnecessary?
>> >> >> >
>> >> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 
>> >> >> > bytes, so the cost of doing so will be the same - while speeding up 
>> >> >> > in cases where 4 isn't quite enough?
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Thomas
>> >> >> >
>> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir  wrote:
>> >> >> >>
>> >> >> >

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

sure, but "if length > 8 return 1" might pass these same tests too,
yet cause a ton of hash collisions.

I just think if you want to optimize for super-long strings, there
should be a unit test.

On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien
 wrote:
>
> Hey,
>
> I am pretty confident about correctness. The change passes both Lucene and ES 
> regression tests and my careful reading of the code is pretty certain that 
> the output is the same. If you want me to randomly test the result for a few 
> hundred million random strings, I'm happy to do that, too, if you have other 
> suggestions for correctness testing, let me know.
>
> The change does increase the method size and may impact inlining - but so 
> does literally any code change, particularly in a JIT'ed environment where 
> placement of code (and hence things like instruction cache conflicts) depend 
> on the precise history of execution. The way I understand it, one deals with 
> this by benchmarking and measuring.
>
> FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in 
> indexing speed - this is why I was asking about a broad range of Lucene 
> benchmarks; to verify that this is indeed the case for Lucene-only, too.
>
> Let me know what data you'd like to see to decide whether this patch is a 
> good idea, and if there is consensus among the Lucene committers that those 
> are reasonable criteria, I'll work on producing that data.
>
> Cheers,
> Thomas
>
>
>
> On Tue, Apr 25, 2023 at 4:02 PM Robert Muir  wrote:
>>
>> well there is some cost, as it must add additional checks to see if
>> its longer than 8. in your patch, additional loops. it increases the
>> method size and may impact inlining and other things. also we can't
>> forget about correctness, if the hash function does the wrong thing it
>> could slow everything to a crawl.
>>
>> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien
>>  wrote:
>> >
>> > Ah, I see what you mean.
>> >
>> > You are correct -- the change will not speed up a 5-byte word, but it 
>> > *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >
>> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir  wrote:
>> >>
>> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>> >> speed anything up. there aren't 8 bytes to process.
>> >>
>> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
>> >>  wrote:
>> >> >
>> >> > Is average word length <= 4 realistic though? I mean, even the english 
>> >> > wiki corpus has ~5, which would require two calls to the lucene layer 
>> >> > instead of one; e.g. multiple layers of virtual dispatch that are 
>> >> > unnecessary?
>> >> >
>> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 
>> >> > bytes, so the cost of doing so will be the same - while speeding up in 
>> >> > cases where 4 isn't quite enough?
>> >> >
>> >> > Cheers,
>> >> > Thomas
>> >> >
>> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir  wrote:
>> >> >>
>> >> >> i think from my perspective it has nothing to do with cpus being
>> >> >> 32-bit or 64-bit and more to do with the average length of terms in
>> >> >> most languages being smaller than 8. for the languages with longer
>> >> >> word length, its usually because of complex morphology that most users
>> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> >> >> languages from nature don't care about your cpu.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
>> >> >>  wrote:
>> >> >> >
>> >> >> > For a truly "pure" indexing test I usually use a single thread for 
>> >> >> > indexing, and SerialMergeScheduler (using that single thread to also 
>> >> >> > do single-threaded merging).  It makes the indexing take forever lol 
>> >> >> > but it produces "comparable" results.
>> >> >> >
>> >> >> > But ... this sounds like a great change anyway?  Do we really need 
>> >> >> > to gate it on benchmark results?  Do we think there could be a 
>> >> >> > downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >> >
>> >> >> &g

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

well there is some cost, as it must add additional checks to see if
its longer than 8. in your patch, additional loops. it increases the
method size and may impact inlining and other things. also we can't
forget about correctness, if the hash function does the wrong thing it
could slow everything to a crawl.

On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien
 wrote:
>
> Ah, I see what you mean.
>
> You are correct -- the change will not speed up a 5-byte word, but it *will* 
> speed up all 8+-byte words, at no cost to the shorter words.
>
> On Tue, Apr 25, 2023 at 3:20 PM Robert Muir  wrote:
>>
>> if a word is of length 5, processing 8 bytes at a time isn't going to
>> speed anything up. there aren't 8 bytes to process.
>>
>> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
>>  wrote:
>> >
>> > Is average word length <= 4 realistic though? I mean, even the english 
>> > wiki corpus has ~5, which would require two calls to the lucene layer 
>> > instead of one; e.g. multiple layers of virtual dispatch that are 
>> > unnecessary?
>> >
>> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, 
>> > so the cost of doing so will be the same - while speeding up in cases 
>> > where 4 isn't quite enough?
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir  wrote:
>> >>
>> >> i think from my perspective it has nothing to do with cpus being
>> >> 32-bit or 64-bit and more to do with the average length of terms in
>> >> most languages being smaller than 8. for the languages with longer
>> >> word length, its usually because of complex morphology that most users
>> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> >> languages from nature don't care about your cpu.
>> >>
>> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
>> >>  wrote:
>> >> >
>> >> > For a truly "pure" indexing test I usually use a single thread for 
>> >> > indexing, and SerialMergeScheduler (using that single thread to also do 
>> >> > single-threaded merging).  It makes the indexing take forever lol but 
>> >> > it produces "comparable" results.
>> >> >
>> >> > But ... this sounds like a great change anyway?  Do we really need to 
>> >> > gate it on benchmark results?  Do we think there could be a downside 
>> >> > e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >
>> >> > Mike McCandless
>> >> >
>> >> > http://blog.mikemccandless.com
>> >> >
>> >> >
>> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir  wrote:
>> >> >>
>> >> >> I think the results of the benchmark will depend on the properties of
>> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> >> length is around 5 bytes so this optimization may not do much.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai  
>> >> >> wrote:
>> >> >> >
>> >> >> > I did a quick run with your patch, but since I turned on the CMS as 
>> >> >> > well as TieredMergePolicy I'm not sure how fair the comparison is. 
>> >> >> > Here's the result:
>> >> >> > Candidate:
>> >> >> > Indexer: indexing done (890209 msec); total 2620 docs
>> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> > Indexer: finished (961877 msec)
>> >> >> > Baseline:
>> >> >> > Indexer: indexing done (909706 msec); total 2620 docs
>> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> > Indexer: finished (964528 msec)
>> >> >> >
>> >> >> > For more accurate comparison I guess it's better to use 
>> >> >> > LogxxMergePolicy and turn off CMS? If you want to run it yourself 
>> >> >> > you can find the lines I quoted from the log file.
>> >> >> >
>> >> >> > Patrick
>> >> >> >
>> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien 
>> >> >> >  wrote:
>> >> >> >>
>>

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

if a word is of length 5, processing 8 bytes at a time isn't going to
speed anything up. there aren't 8 bytes to process.

On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
 wrote:
>
> Is average word length <= 4 realistic though? I mean, even the english wiki 
> corpus has ~5, which would require two calls to the lucene layer instead of 
> one; e.g. multiple layers of virtual dispatch that are unnecessary?
>
> You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so 
> the cost of doing so will be the same - while speeding up in cases where 4 
> isn't quite enough?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 3:07 PM Robert Muir  wrote:
>>
>> i think from my perspective it has nothing to do with cpus being
>> 32-bit or 64-bit and more to do with the average length of terms in
>> most languages being smaller than 8. for the languages with longer
>> word length, its usually because of complex morphology that most users
>> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> languages from nature don't care about your cpu.
>>
>> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
>>  wrote:
>> >
>> > For a truly "pure" indexing test I usually use a single thread for 
>> > indexing, and SerialMergeScheduler (using that single thread to also do 
>> > single-threaded merging).  It makes the indexing take forever lol but it 
>> > produces "comparable" results.
>> >
>> > But ... this sounds like a great change anyway?  Do we really need to gate 
>> > it on benchmark results?  Do we think there could be a downside e.g. 
>> > slower indexing on (the dwindling) 32 bit CPUs?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir  wrote:
>> >>
>> >> I think the results of the benchmark will depend on the properties of
>> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> length is around 5 bytes so this optimization may not do much.
>> >>
>> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai  wrote:
>> >> >
>> >> > I did a quick run with your patch, but since I turned on the CMS as 
>> >> > well as TieredMergePolicy I'm not sure how fair the comparison is. 
>> >> > Here's the result:
>> >> > Candidate:
>> >> > Indexer: indexing done (890209 msec); total 2620 docs
>> >> > Indexer: waitForMerges done (71622 msec)
>> >> > Indexer: finished (961877 msec)
>> >> > Baseline:
>> >> > Indexer: indexing done (909706 msec); total 2620 docs
>> >> > Indexer: waitForMerges done (54775 msec)
>> >> > Indexer: finished (964528 msec)
>> >> >
>> >> > For more accurate comparison I guess it's better to use 
>> >> > LogxxMergePolicy and turn off CMS? If you want to run it yourself you 
>> >> > can find the lines I quoted from the log file.
>> >> >
>> >> > Patrick
>> >> >
>> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien 
>> >> >  wrote:
>> >> >>
>> >> >> Hey all,
>> >> >>
>> >> >> I've been experimenting with fixing some low-hanging performance fruit 
>> >> >> in the ElasticSearch codebase, and came across the fact that the 
>> >> >> MurmurHash implementation that is used by ByteRef.hashCode() is 
>> >> >> reading 4 bytes per loop iteration (which is likely an artifact from 
>> >> >> 32-bit architectures, which are ever-less-important). I made a small 
>> >> >> fix to change the implementation to read 8 bytes per loop iteration; I 
>> >> >> expected a very small impact (2-3% CPU or so over an indexing run in 
>> >> >> ElasticSearch), but got a pretty nontrivial throughput improvement 
>> >> >> over a few indexing benchmarks.
>> >> >>
>> >> >> I tried running Lucene-only benchmarks, and succeeded in running the 
>> >> >> example from https://github.com/mikemccand/luceneutil - but I couldn't 
>> >> >> figure out how to run indexing benchmarks and how to interpret the 
>> >> >> results.
>> >> >>
>> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >>
>> >> >> Cheers,
>> >> >> Thomas
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

i think from my perspective it has nothing to do with cpus being
32-bit or 64-bit and more to do with the average length of terms in
most languages being smaller than 8. for the languages with longer
word length, its usually because of complex morphology that most users
would stem away. so doing 4 bytes at a time seems optimal IMO.
languages from nature don't care about your cpu.

On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
 wrote:
>
> For a truly "pure" indexing test I usually use a single thread for indexing, 
> and SerialMergeScheduler (using that single thread to also do single-threaded 
> merging).  It makes the indexing take forever lol but it produces 
> "comparable" results.
>
> But ... this sounds like a great change anyway?  Do we really need to gate it 
> on benchmark results?  Do we think there could be a downside e.g. slower 
> indexing on (the dwindling) 32 bit CPUs?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 25, 2023 at 7:39 AM Robert Muir  wrote:
>>
>> I think the results of the benchmark will depend on the properties of
>> the indexed terms. For english wikipedia (luceneutil) the average word
>> length is around 5 bytes so this optimization may not do much.
>>
>> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai  wrote:
>> >
>> > I did a quick run with your patch, but since I turned on the CMS as well 
>> > as TieredMergePolicy I'm not sure how fair the comparison is. Here's the 
>> > result:
>> > Candidate:
>> > Indexer: indexing done (890209 msec); total 2620 docs
>> > Indexer: waitForMerges done (71622 msec)
>> > Indexer: finished (961877 msec)
>> > Baseline:
>> > Indexer: indexing done (909706 msec); total 2620 docs
>> > Indexer: waitForMerges done (54775 msec)
>> > Indexer: finished (964528 msec)
>> >
>> > For more accurate comparison I guess it's better to use LogxxMergePolicy 
>> > and turn off CMS? If you want to run it yourself you can find the lines I 
>> > quoted from the log file.
>> >
>> > Patrick
>> >
>> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien 
>> >  wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I've been experimenting with fixing some low-hanging performance fruit in 
>> >> the ElasticSearch codebase, and came across the fact that the MurmurHash 
>> >> implementation that is used by ByteRef.hashCode() is reading 4 bytes per 
>> >> loop iteration (which is likely an artifact from 32-bit architectures, 
>> >> which are ever-less-important). I made a small fix to change the 
>> >> implementation to read 8 bytes per loop iteration; I expected a very 
>> >> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but 
>> >> got a pretty nontrivial throughput improvement over a few indexing 
>> >> benchmarks.
>> >>
>> >> I tried running Lucene-only benchmarks, and succeeded in running the 
>> >> example from https://github.com/mikemccand/luceneutil - but I couldn't 
>> >> figure out how to run indexing benchmarks and how to interpret the 
>> >> results.
>> >>
>> >> Could someone help me in running the benchmarks for the attached patch?
>> >>
>> >> Cheers,
>> >> Thomas
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir

I think the results of the benchmark will depend on the properties of
the indexed terms. For english wikipedia (luceneutil) the average word
length is around 5 bytes so this optimization may not do much.

On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai  wrote:
>
> I did a quick run with your patch, but since I turned on the CMS as well as 
> TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
> Candidate:
> Indexer: indexing done (890209 msec); total 2620 docs
> Indexer: waitForMerges done (71622 msec)
> Indexer: finished (961877 msec)
> Baseline:
> Indexer: indexing done (909706 msec); total 2620 docs
> Indexer: waitForMerges done (54775 msec)
> Indexer: finished (964528 msec)
>
> For more accurate comparison I guess it's better to use LogxxMergePolicy and 
> turn off CMS? If you want to run it yourself you can find the lines I quoted 
> from the log file.
>
> Patrick
>
> On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien 
>  wrote:
>>
>> Hey all,
>>
>> I've been experimenting with fixing some low-hanging performance fruit in 
>> the ElasticSearch codebase, and came across the fact that the MurmurHash 
>> implementation that is used by ByteRef.hashCode() is reading 4 bytes per 
>> loop iteration (which is likely an artifact from 32-bit architectures, which 
>> are ever-less-important). I made a small fix to change the implementation to 
>> read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or 
>> so over an indexing run in ElasticSearch), but got a pretty nontrivial 
>> throughput improvement over a few indexing benchmarks.
>>
>> I tried running Lucene-only benchmarks, and succeeded in running the example 
>> from https://github.com/mikemccand/luceneutil - but I couldn't figure out 
>> how to run indexing benchmarks and how to interpret the results.
>>
>> Could someone help me in running the benchmarks for the attached patch?
>>
>> Cheers,
>> Thomas
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Should IndexWriter.flush return seqNo?

2023-04-23 Thread Robert Muir

>
> Yes thats true, I just have to add: You can still open a NRT reader
> directly from IndexWriter. But you don't need a sequence number there as
> its hidden completely. So flushing is fine to allow users to get a new
> NRT reader with the state up to that point, but it does not need to
> return anything.
>

Uwe, sorry, I must correct you: flushing doesnt do that. It doesn't
allow you to get an NRT reader or any other type of reader. it is the
same as if you filled up the RAMBuffer with documents, that is all. If
you want NRTReader you should be calling openIfChanged (and calling
flush yourself is irrelevant/unnecessary). The two methods are
completely separate, to me unrelated. That's why flush makes no sense
in the api.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Should IndexWriter.flush return seqNo?

2023-04-21 Thread Robert Muir

This is not true: if i call IndexWriter.commit, then i can open an
indexreader and see the documents.

IndexWriter.flush doesn't do anything at all, really, just moves stuff
from RAM to disk but not in a way that indexreader can see it or
anything, right?

It doesn't make much sense that this method is public in the API,
definitely adding sequence number makes no sense since nothing was
committed here.

On Thu, Apr 20, 2023 at 1:28 AM Patrick Zhai  wrote:
>
> Hi folks,
> I just realized that while "commit" returns the sequence number which 
> represents the latest event that committed in the index, "flush" still 
> returns nothing. Since they're essentially the same except fsync I wonder 
> whether there's any specific reason to not do so?
>
> Best
> Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Robert Muir

I don't care. you guys personally attacked me first. And then it turns
out, you were being dishonest the entire time and hiding your true
intent, which was not search at all but instead some chatgpt pyramid
scheme or similar.

i'm done with this thread.

On Sun, Apr 9, 2023 at 7:37 AM Alessandro Benedetti
 wrote:
>
> I don't think this tone and language is appropriate for a community of 
> volunteers and men of science.
>
> I personally find offensive to generalise Lucene people here to be "crazy 
> people hyped about chatGPT".
>
> I personally don't give a damn about chatGPT except the fact it is a very 
> interesting technology.
>
> As usual I see very little motivation and a lot of "convince me".
> We're discussing here about a limit that raises an exception.
>
> Improving performance is absolutely important and no-one here is saying we 
> won't address it, it's just a separate discussion.
>
>
> On Sun, 9 Apr 2023, 12:59 Robert Muir,  wrote:
>>
>> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
>> LIBRARY. not a vector database or whatever trash is being proposed
>> here.
>>
>> i think we should table this and revisit it after chatgpt hype has 
>> dissipated.
>>
>> this hype is causing ppl to behave irrationally, it is why i can't
>> converse with basically anyone on this thread because they are all
>> stating crazy things that don't make sense.
>>
>> On Sun, Apr 9, 2023 at 6:25 AM Robert Muir  wrote:
>> >
>> > Yes, its very clear that folks on this thread are ignoring reason
>> > entirely and completely swooned by chatgpt-hype.
>> > And what happens when they make chatgpt-8 that uses even more dimensions?
>> > backwards compatibility decisions can't be made by garbage hype such
>> > as cryptocurrency or chatgpt.
>> > Trying to convince me we should bump it because of chatgpt, well, i
>> > think it has the opposite effect.
>> >
>> > Please, lemme see real technical arguments why this limit needs to be
>> > bumped. not including trash like chatgpt.
>> >
>> > On Sat, Apr 8, 2023 at 7:50 PM Marcus Eagan  wrote:
>> > >
>> > > Given the massive amounts of funding going into the development and 
>> > > investigation of the project, I think it would be good to at least have 
>> > > Lucene be a part of the conversation. Simply because academics typically 
>> > > focus on vectors <= 784 dimensions does not mean all users will. A large 
>> > > swathe of very important users of the Lucene project never exceed 500k 
>> > > documents, though they are shifting to other search engines to try out 
>> > > very popular embeddings.
>> > >
>> > > I think giving our users the opportunity to build chat bots or LLM 
>> > > memory machines using Lucene is a positive development, even if some 
>> > > datasets won't be able to work well. We don't limit the number of fields 
>> > > someone can add in most cases, though we did just undeprecate that API 
>> > > to better support multi-tenancy. But people still add so many fields and 
>> > > can crash their clusters with mapping explosions when unlimited. The 
>> > > limit to vectors feels similar.  I expect more people to dig into Lucene 
>> > > due to its openness and robustness as they run into problems. Today, 
>> > > they are forced to consider other engines that are more permissive.
>> > >
>> > > Not everyone important or valuable Lucene workload is in the millions of 
>> > > documents. Many of them only have lots of queries or computationally 
>> > > expensive access patterns for B-trees.  We can document that it is very 
>> > > ill-advised to make a deployment with vectors too large. What others 
>> > > will do with it is on them.
>> > >
>> > >
>> > > On Sat, Apr 8, 2023 at 2:29 PM Adrien Grand  wrote:
>> > >>
>> > >> As Dawid pointed out earlier on this thread, this is the rule for
>> > >> Apache projects: a single -1 vote on a code change is a veto and
>> > >> cannot be overridden. Furthermore, Robert is one of the people on this
>> > >> project who worked the most on debugging subtle bugs, making Lucene
>> > >> more robust and improving our test framework, so I'm listening when he
>> > >> voices quality concerns.
>> > >>
>> > >> The argument against removing/raising the limit that resonates with me
>> > >> the most is

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Robert Muir

Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
LIBRARY. not a vector database or whatever trash is being proposed
here.

i think we should table this and revisit it after chatgpt hype has dissipated.

this hype is causing ppl to behave irrationally, it is why i can't
converse with basically anyone on this thread because they are all
stating crazy things that don't make sense.

On Sun, Apr 9, 2023 at 6:25 AM Robert Muir  wrote:
>
> Yes, its very clear that folks on this thread are ignoring reason
> entirely and completely swooned by chatgpt-hype.
> And what happens when they make chatgpt-8 that uses even more dimensions?
> backwards compatibility decisions can't be made by garbage hype such
> as cryptocurrency or chatgpt.
> Trying to convince me we should bump it because of chatgpt, well, i
> think it has the opposite effect.
>
> Please, lemme see real technical arguments why this limit needs to be
> bumped. not including trash like chatgpt.
>
> On Sat, Apr 8, 2023 at 7:50 PM Marcus Eagan  wrote:
> >
> > Given the massive amounts of funding going into the development and 
> > investigation of the project, I think it would be good to at least have 
> > Lucene be a part of the conversation. Simply because academics typically 
> > focus on vectors <= 784 dimensions does not mean all users will. A large 
> > swathe of very important users of the Lucene project never exceed 500k 
> > documents, though they are shifting to other search engines to try out very 
> > popular embeddings.
> >
> > I think giving our users the opportunity to build chat bots or LLM memory 
> > machines using Lucene is a positive development, even if some datasets 
> > won't be able to work well. We don't limit the number of fields someone can 
> > add in most cases, though we did just undeprecate that API to better 
> > support multi-tenancy. But people still add so many fields and can crash 
> > their clusters with mapping explosions when unlimited. The limit to vectors 
> > feels similar.  I expect more people to dig into Lucene due to its openness 
> > and robustness as they run into problems. Today, they are forced to 
> > consider other engines that are more permissive.
> >
> > Not everyone important or valuable Lucene workload is in the millions of 
> > documents. Many of them only have lots of queries or computationally 
> > expensive access patterns for B-trees.  We can document that it is very 
> > ill-advised to make a deployment with vectors too large. What others will 
> > do with it is on them.
> >
> >
> > On Sat, Apr 8, 2023 at 2:29 PM Adrien Grand  wrote:
> >>
> >> As Dawid pointed out earlier on this thread, this is the rule for
> >> Apache projects: a single -1 vote on a code change is a veto and
> >> cannot be overridden. Furthermore, Robert is one of the people on this
> >> project who worked the most on debugging subtle bugs, making Lucene
> >> more robust and improving our test framework, so I'm listening when he
> >> voices quality concerns.
> >>
> >> The argument against removing/raising the limit that resonates with me
> >> the most is that it is a one-way door. As MikeS highlighted earlier on
> >> this thread, implementations may want to take advantage of the fact
> >> that there is a limit at some point too. This is why I don't want to
> >> remove the limit and would prefer a slight increase, such as 2048 as
> >> suggested in the original issue, which would enable most of the things
> >> that users who have been asking about raising the limit would like to
> >> do.
> >>
> >> I agree that the merge-time memory usage and slow indexing rate are
> >> not great. But it's still possible to index multi-million vector
> >> datasets with a 4GB heap without hitting OOMEs regardless of the
> >> number of dimensions, and the feedback I'm seeing is that many users
> >> are still interested in indexing multi-million vector datasets despite
> >> the slow indexing rate. I wish we could do better, and vector indexing
> >> is certainly more expert than text indexing, but it still is usable in
> >> my opinion. I understand how giving Lucene more information about
> >> vectors prior to indexing (e.g. clustering information as Jim pointed
> >> out) could help make merging faster and more memory-efficient, but I
> >> would really like to avoid making it a requirement for indexing
> >> vectors as it also makes this feature much harder to use.
> >>
> >> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
> >>  wrote:
> >

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Robert Muir

 
>> > high dimensional space will equally damage it.
>> >
>> > To me it's really a no brainer.
>> > Removing the limit and enable people to use high dimensional vectors will 
>> > take minutes.
>> > Improving the hnsw implementation can take months.
>> > Pick one to begin with...
>> >
>> > And there's no-one paying me here, no company interest whatsoever, 
>> > actually I pay people to contribute, I am just convinced it's a good idea.
>> >
>> >
>> > On Sat, 8 Apr 2023, 18:57 Robert Muir,  wrote:
>> >>
>> >> I disagree with your categorization. I put in plenty of work and
>> >> experienced plenty of pain myself, writing tests and fighting these
>> >> issues, after i saw that, two releases in a row, vector indexing fell
>> >> over and hit integer overflows etc on small datasets:
>> >>
>> >> https://github.com/apache/lucene/pull/11905
>> >>
>> >> Attacking me isn't helping the situation.
>> >>
>> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> >> any kind of demeaning fashion really. I meant to describe the current
>> >> state of usability with respect to indexing a few million docs with
>> >> high dimensions. You can scroll up the thread and see that at least
>> >> one other committer on the project experienced similar pain as me.
>> >> Then, think about users who aren't committers trying to use the
>> >> functionality!
>> >>
>> >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov  
>> >> wrote:
>> >> >
>> >> > What you said about increasing dimensions requiring a bigger ram buffer 
>> >> > on merge is wrong. That's the point I was trying to make. Your concerns 
>> >> > about merge costs are not wrong, but your conclusion that we need to 
>> >> > limit dimensions is not justified.
>> >> >
>> >> > You complain that hnsw sucks it doesn't scale, but when I show it 
>> >> > scales linearly with dimension you just ignore that and complain about 
>> >> > something entirely different.
>> >> >
>> >> > You demand that people run all kinds of tests to prove you wrong but 
>> >> > when they do, you don't listen and you won't put in the work yourself 
>> >> > or complain that it's too hard.
>> >> >
>> >> > Then you complain about people not meeting you half way. Wow
>> >> >
>> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir  wrote:
>> >> >>
>> >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>> >> >>  wrote:
>> >> >> >
>> >> >> > What exactly do you consider reasonable?
>> >> >>
>> >> >> Let's begin a real discussion by being HONEST about the current
>> >> >> status. Please put politically correct or your own company's wishes
>> >> >> aside, we know it's not in a good state.
>> >> >>
>> >> >> Current status is the one guy who wrote the code can set a
>> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> >> dimensions in HOURS (i didn't ask what hardware).
>> >> >>
>> >> >> My concerns are everyone else except the one guy, I want it to be
>> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> >> It is also a permanent backwards compatibility decision, we have to
>> >> >> support it once we do this and we can't just say "oops" and flip it
>> >> >> back.
>> >> >>
>> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> >> or if its to avoid merges so it doesn't hit OOM.
>> >> >> Also from personal experience, it takes trial and error (means
>> >> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> >> for your dataset. This usually means starting over which is
>> >> >> frustrating and wastes more time.
>> >> >>
>> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> >> avoided in this way and performance improved by writing bigger
>> >> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> >> that indexing really scales.
>> >> >>
>> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> >> fashion when indexing.
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>>
>> --
>> Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Marcus Eagan
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Robert Muir

I disagree with your categorization. I put in plenty of work and
experienced plenty of pain myself, writing tests and fighting these
issues, after i saw that, two releases in a row, vector indexing fell
over and hit integer overflows etc on small datasets:

https://github.com/apache/lucene/pull/11905

Attacking me isn't helping the situation.

PS: when i said the "one guy who wrote the code" I didn't mean it in
any kind of demeaning fashion really. I meant to describe the current
state of usability with respect to indexing a few million docs with
high dimensions. You can scroll up the thread and see that at least
one other committer on the project experienced similar pain as me.
Then, think about users who aren't committers trying to use the
functionality!

On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov  wrote:
>
> What you said about increasing dimensions requiring a bigger ram buffer on 
> merge is wrong. That's the point I was trying to make. Your concerns about 
> merge costs are not wrong, but your conclusion that we need to limit 
> dimensions is not justified.
>
> You complain that hnsw sucks it doesn't scale, but when I show it scales 
> linearly with dimension you just ignore that and complain about something 
> entirely different.
>
> You demand that people run all kinds of tests to prove you wrong but when 
> they do, you don't listen and you won't put in the work yourself or complain 
> that it's too hard.
>
> Then you complain about people not meeting you half way. Wow
>
> On Sat, Apr 8, 2023, 12:40 PM Robert Muir  wrote:
>>
>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>  wrote:
>> >
>> > What exactly do you consider reasonable?
>>
>> Let's begin a real discussion by being HONEST about the current
>> status. Please put politically correct or your own company's wishes
>> aside, we know it's not in a good state.
>>
>> Current status is the one guy who wrote the code can set a
>> multi-gigabyte ram buffer and index a small dataset with 1024
>> dimensions in HOURS (i didn't ask what hardware).
>>
>> My concerns are everyone else except the one guy, I want it to be
>> usable. Increasing dimensions just means even bigger multi-gigabyte
>> ram buffer and bigger heap to avoid OOM on merge.
>> It is also a permanent backwards compatibility decision, we have to
>> support it once we do this and we can't just say "oops" and flip it
>> back.
>>
>> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> avoid merges because they are so slow and it would be DAYS otherwise,
>> or if its to avoid merges so it doesn't hit OOM.
>> Also from personal experience, it takes trial and error (means
>> experiencing OOM on merge!!!) before you get those heap values correct
>> for your dataset. This usually means starting over which is
>> frustrating and wastes more time.
>>
>> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> avoided in this way and performance improved by writing bigger
>> segments with lucene's defaults. But this doesn't mean we can simply
>> ignore the horrors of what happens on merge. merging needs to scale so
>> that indexing really scales.
>>
>> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> fashion when indexing.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Robert Muir

On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
 wrote:
>
> What exactly do you consider reasonable?

Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.

Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).

My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.

It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.

Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horrors of what happens on merge. merging needs to scale so
that indexing really scales.

At least it shouldnt spike RAM on trivial data amounts and cause OOM,
and definitely it shouldnt burn hours and hours of CPU in O(n^2)
fashion when indexing.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Robert Muir

Great way to try to meet me in the middle and win me over, basically
just dismiss my concerns. This is not going to achieve what you want.

On Sat, Apr 8, 2023 at 5:56 AM Alessandro Benedetti
 wrote:
>
> Yes, that was explicitly mentioned in the original mail, improving the vector 
> based search of Lucene is an interesting area, but off topic here.
>
> Let's summarise:
> - We want to at least increase the limit (or remove it)
> - We proved that performance are ok to do it (and we can improve them more in 
> the future), no harm is given to users that intend to stick to low 
> dimensional vectors
>
> What are the next steps?
> What apache community tool can we use to agree on a new limit/no explicit 
> limit (max integer)?
> I think we need some sort of place where each of us propose a limit with a 
> motivation and we vote the best option?
> Any idea on how to do it?
>
> Cheers
>
> On Sat, 8 Apr 2023, 03:57 Michael Wechner,  wrote:
>>
>> sorry to interrupt, but I think we get side-tracked from the original 
>> discussion to increase the vector dimension limit.
>>
>> I think improving the vector indexing performance is one thing and making 
>> sure Lucene does not crash when increasing the vector dimension limit is 
>> another.
>>
>> I think it is great to find better ways to index vectors, but I think this 
>> should not prevent people from being able to use models with higher vector 
>> dimensions than 1024.
>>
>> The following comparison might not be perfect, but imagine we have invented 
>> a combustion engine, which is strong enough to move a car in the flat area, 
>> but when applying it to a truck to move things over mountains it will fail, 
>> because it is not strong enough. Would you prevent people from using the 
>> combustion engine for a car in the flat area?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>>
>> > Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> Yep I agree. Personally I don t see how we can solve this without prior 
>> knowledge of the vectors. Faiss has a nice implementation that fits 
>> naturally with Lucene called IVF (
>> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
>> but if we want to avoid running kmeans on every merge we d require to 
>> provide the clusters for the entire index before indexing the first vector.
>> It s a complex issue…
>>
>> On Fri, 7 Apr 2023 at 22:58, Robert Muir  wrote:
>>>
>>> Personally i'd have to re-read the paper, but in general the merging
>>> issue has to be addressed somehow to fix the overall indexing time
>>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>>> here.
>>> Keep in mind, there may be other ways to do it. In general if merging
>>> something is going to be "heavyweight", we should think about it to
>>> prevent things from going really bad overall.
>>>
>>> As an example, I'm most familiar with adding DEFLATE compression to
>>> stored fields. Previously, we'd basically decompress and recompress
>>> the stored fields on merge, and LZ4 is so fast that it wasn't
>>> obviously a problem. But with DEFLATE it got slower/heavier (more
>>> intense compression algorithm), something had to be done or indexing
>>> would be unacceptably slow. Hence if you look at storedfields writer,
>>> there is "dirtiness" logic etc so that recompression is amortized over
>>> time and doesn't happen on every merge.
>>>
>>> On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi  wrote:
>>> >
>>> > I am also not sure that diskann would solve the merging issue. The idea 
>>> > describe in the paper is to run kmeans first to create multiple graphs, 
>>> > one per cluster. In our case the vectors in each segment could belong to 
>>> > different cluster so I don’t see how we could merge them efficiently.
>>> >
>>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi  wrote:
>>> >>
>>> >> The inference time (and cost) to generate these big vectors must be 
>>> >> quite large too ;).
>>> >> Regarding the ram buffer, we could drastically reduce the size by 
>>> >> writing the vectors on disk instead of keeping them in the heap. With 1k 
>>> >> dimensions the ram buffer is filled with these ve

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Robert Muir

I don't think we have. The performance needs to be reasonable in order
to bump this limit. Otherwise bumping this limit makes the worst-case
2x worse than it already is!

Moreover, its clear something needs to happen to address the
scalability/lack of performance. I'd hate for this limit to be in the
way of that. Because of backwards compatibility, it's a one-way,
permanent, irreversible change.

I'm not sold by any means in any way yet. My vote remains the same.

On Fri, Apr 7, 2023 at 10:57 PM Michael Wechner
 wrote:
>
> sorry to interrupt, but I think we get side-tracked from the original 
> discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making 
> sure Lucene does not crash when increasing the vector dimension limit is 
> another.
>
> I think it is great to find better ways to index vectors, but I think this 
> should not prevent people from being able to use models with higher vector 
> dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have invented a 
> combustion engine, which is strong enough to move a car in the flat area, but 
> when applying it to a truck to move things over mountains it will fail, 
> because it is not strong enough. Would you prevent people from using the 
> combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior 
> knowledge of the vectors. Faiss has a nice implementation that fits naturally 
> with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to provide 
> the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir  wrote:
>>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi  wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea 
>> > describe in the paper is to run kmeans first to create multiple graphs, 
>> > one per cluster. In our case the vectors in each segment could belong to 
>> > different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi  wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be quite 
>> >> large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by writing 
>> >> the vectors on disk instead of keeping them in the heap. With 1k 
>> >> dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov  
>> >>> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer 
>> >>> > size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
&g

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Robert Muir

Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

As an example, I'm most familiar with adding DEFLATE compression to
stored fields. Previously, we'd basically decompress and recompress
the stored fields on merge, and LZ4 is so fast that it wasn't
obviously a problem. But with DEFLATE it got slower/heavier (more
intense compression algorithm), something had to be done or indexing
would be unacceptably slow. Hence if you look at storedfields writer,
there is "dirtiness" logic etc so that recompression is amortized over
time and doesn't happen on every merge.

On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi  wrote:
>
> I am also not sure that diskann would solve the merging issue. The idea 
> describe in the paper is to run kmeans first to create multiple graphs, one 
> per cluster. In our case the vectors in each segment could belong to 
> different cluster so I don’t see how we could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi  wrote:
>>
>> The inference time (and cost) to generate these big vectors must be quite 
>> large too ;).
>> Regarding the ram buffer, we could drastically reduce the size by writing 
>> the vectors on disk instead of keeping them in the heap. With 1k dimensions 
>> the ram buffer is filled with these vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:
>>>
>>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov  wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Robert Muir

On Fri, Apr 7, 2023 at 5:13 PM Benjamin Trent  wrote:
>
> From all I have seen when hooking up JFR when indexing a medium number of 
> vectors(1M +), almost all the time is spent simply comparing the vectors 
> (e.g. dot_product).
>
> This indicates to me that another algorithm won't really help index build 
> time tremendously. Unless others do dramatically fewer vector comparisons 
> (from what I can tell, this is at least not true for DiskAnn, unless some 
> fancy footwork is done when building the PQ codebook).
>
> I would also say comparing vector index build time to indexing terms are 
> apples and oranges. Yeah, they both live in Lucene, but the number of 
> calculations required (no matter the data structure used), will be magnitudes 
> greater.
>

I'm not sure, i think this slowness due to massive number of
comparisons is just another side effect of the unscalable algorithm.
It is designed to build an in-memory datastructure and "merge" means
"rebuild". And since we fully rebuild a new one when merging, you get
something like O(n^2) total indexing when you take merges into
account.
Some of the other algorithms... in fact support merging. The DiskANN
paper has like a "chapter" on this.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Robert Muir

On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov  wrote:
>
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>

My scale concerns are both space and time. What does the execution
time look like if you don't set insanely large IW rambuffer? The
default is 16MB. Just concerned we're shoving some problems under the
rug :)

Even with the yuge RAMbuffer, we're still talking about almost 2 hours
to index 4M documents with these 2k vectors. Whereas you'd measure
this in seconds with typical lucene indexing, its nothing.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Robert Muir

Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for users, and we shouldn't bump the limit.

I'm happy to discuss/compromise etc, but simply bumping the limit
without addressing the underlying usability/scalability is a real
no-go, it is not really solving anything, nor is it giving users any
freedom or allowing them to do something they couldnt do before.
Because if it still doesnt work it still doesnt work.

We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales.

On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
 wrote:
>
> As I said earlier, a max limit limits usability.
> It's not forcing users with small vectors to pay the performance penalty of 
> big vectors, it's literally preventing some users to use 
> Lucene/Solr/Elasticsearch at all.
> As far as I know, the max limit is used to raise an exception, it's not used 
> to initialise or optimise data structures (please correct me if I'm wrong).
>
> Improving the algorithm performance is a separate discussion.
> I don't see a correlation with the fact that indexing billions of whatever 
> dimensioned vector is slow with a usability parameter.
>
> What about potential users that need few high dimensional vectors?
>
> As I said before, I am a big +1 for NOT just raise it blindly, but I believe 
> we need to remove the limit or size it in a way it's not a problem for both 
> users and internal data structure optimizations, if any.
>
>
> On Wed, 5 Apr 2023, 18:54 Robert Muir,  wrote:
>>
>> I'd ask anyone voting +1 to raise this limit to at least try to index
>> a few million vectors with 756 or 1024, which is allowed today.
>>
>> IMO based on how painful it is, it seems the limit is already too
>> high, I realize that will sound controversial but please at least try
>> it out!
>>
>> voting +1 without at least doing this is really the
>> "weak/unscientifically minded" approach.
>>
>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
>>  wrote:
>> >
>> > Thanks for your feedback!
>> >
>> > I agree, that it should not crash.
>> >
>> > So far we did not experience crashes ourselves, but we did not index
>> > millions of vectors.
>> >
>> > I will try to reproduce the crash, maybe this will help us to move forward.
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> > >> Can you describe your crash in more detail?
>> > > I can't. That experiment was a while ago and a quick test to see if I
>> > > could index rather large-ish USPTO (patent office) data as vectors.
>> > > Couldn't do it then.
>> > >
>> > >> How much RAM?
>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> > > I recall segment merging grew slower and slower and then simply
>> > > crashed. Lucene should work with low heap requirements, even if it
>> > > slows down. Throwing ram at the indexing/ segment merging problem
>> > > is... I don't know - not elegant?
>> > >
>> > > Anyway. My main point was to remind folks about how Apache works -
>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> > > remains unconvinced, he or she can block the change. (I didn't invent
>> > > those rules).
>> > >
>> > > D.
>> > >
>> > > -
>> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-05 Thread Robert Muir

I'd ask anyone voting +1 to raise this limit to at least try to index
a few million vectors with 756 or 1024, which is allowed today.

IMO based on how painful it is, it seems the limit is already too
high, I realize that will sound controversial but please at least try
it out!

voting +1 without at least doing this is really the
"weak/unscientifically minded" approach.

On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
 wrote:
>
> Thanks for your feedback!
>
> I agree, that it should not crash.
>
> So far we did not experience crashes ourselves, but we did not index
> millions of vectors.
>
> I will try to reproduce the crash, maybe this will help us to move forward.
>
> Thanks
>
> Michael
>
> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >> Can you describe your crash in more detail?
> > I can't. That experiment was a while ago and a quick test to see if I
> > could index rather large-ish USPTO (patent office) data as vectors.
> > Couldn't do it then.
> >
> >> How much RAM?
> > My indexing jobs run with rather smallish heaps to give space for I/O
> > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > I recall segment merging grew slower and slower and then simply
> > crashed. Lucene should work with low heap requirements, even if it
> > slows down. Throwing ram at the indexing/ segment merging problem
> > is... I don't know - not elegant?
> >
> > Anyway. My main point was to remind folks about how Apache works -
> > code is merged in when there are no vetoes. If Rob (or anybody else)
> > remains unconvinced, he or she can block the change. (I didn't invent
> > those rules).
> >
> > D.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Lucene 9.5.0 release

2023-01-17 Thread Robert Muir

+1 to release, thank you for volunteering to be RM!

I went thru 9.5 section of CHANGES.txt and tagged all the GH issues in
there with milestone too, if they didn't already have it. It looks
even bigger now.

On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna  wrote:
>
> Hi all,
> I'd like to propose that we release Lucene 9.5.0. There is a decent amount of 
> changes that would go into it looking at the github milestone: 
> https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the 
> release manager. There is one PR open listed for the 9.5 milestone: 
> https://github.com/apache/lucene/pull/11873 . Is this something that we do 
> want to address before we release? Is anybody aware of outstanding work that 
> we would like to include or known blocker issues that are not listed in the 
> 9.5 milestone?
>
> Cheers
> Luca
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Is there a way to customize segment names?

2022-12-17 Thread Robert Muir

No, you can't control them. And we must not open up anything to try to
support this.

On Fri, Dec 16, 2022 at 7:28 PM Patrick Zhai  wrote:
>
> Hi Mike, Robert
>
> Thanks for replying, the system is almost like what Mike has described: one 
> writer is primary,
> and the other is trying to catch up and wait, but in our internal discussion 
> we found there might
> be small chances where the secondary mistakenly think itself as primary (due 
> to errors of other component)
> while primary is still alive and thus goes into the situation I described.
> And because we want to tolerate the error in case we can't prevent it from 
> happening, we're looking for customizing
> filenames.
>
> Thanks again for discussing this with me and I've learnt that playing with 
> filenames can become quite
> troublesome, but still, even out of my own curiosity, I want to understand 
> whether we're able to control
> the segment names in some way?
>
> Best
> Patrick
>
>
> On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov  wrote:
>>
>> +1 trying to coordinate multiple writers running independently will
>> not work. My 2c for availability: you can have a single primary active
>> writer with a backup one waiting, receiving all the segments from the
>> primary. Then if the primary goes down, the secondary one has the most
>> recent commit replicated from the primary (identical commit, same
>> segments etc) and can pick up from there. You would need a mechanism
>> to replay the writes the primary never had a chance to commit.
>>
>> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir  wrote:
>> >
>> > You are still talking "Multiple writers". Like i said, going down this
>> > path (playing tricks with filenames) isn't going to work out well.
>> >
>> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai  wrote:
>> > >
>> > > Hi Robert,
>> > >
>> > > Maybe I didn't explain it clearly but we're not going to constantly 
>> > > switch
>> > > between writers or share effort between writers, it's purely for
>> > > availability: the second writer only kicks in when the first writer is 
>> > > not
>> > > available for some reason.
>> > > And as far as I know the replicator/nrt module has not provided a 
>> > > solution
>> > > on when the primary node (main indexer) is down, how would we recover 
>> > > with
>> > > a back up indexer?
>> > >
>> > > Thanks
>> > > Patrick
>> > >
>> > >
>> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir  wrote:
>> > >
>> > > > This multiple-writer isn't going to work and customizing names won't
>> > > > allow it anyway. Each file also contains a unique identifier tied to
>> > > > its commit so that we know everything is intact.
>> > > >
>> > > > I would look at the segment replication in lucene/replicator and not
>> > > > try to play games with files and mixing multiple writers.
>> > > >
>> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  
>> > > > wrote:
>> > > > >
>> > > > > Hi Folks,
>> > > > >
>> > > > > We're trying to build a search architecture using segment replication
>> > > > (indexer and searcher are separated and indexer shipping new segments 
>> > > > to
>> > > > searchers) right now and one of the problems we're facing is: for
>> > > > availability reason we need to have multiple indexers running, and 
>> > > > when the
>> > > > searcher is switching from consuming one indexer to another, there are
>> > > > chances where the segment names collide with each other (because 
>> > > > segment
>> > > > names are count based) and the searcher have to reload the whole index.
>> > > > > To avoid that we're looking for a way to name the segments so that
>> > > > Lucene is able to tell the difference and load only the difference (by
>> > > > calling `openIfChanged`). I've checked the IndexWriter and the
>> > > > DocumentsWriter and it seems it is controlled by a private final method
>> > > > `newSegmentName()` so likely not possible there. So I wonder whether
>> > > > there's any other ways people are aware of that can help control the
>> > > > segment names?
>> >

Re: Is there a way to customize segment names?

2022-12-16 Thread Robert Muir

You are still talking "Multiple writers". Like i said, going down this
path (playing tricks with filenames) isn't going to work out well.

On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai  wrote:
>
> Hi Robert,
>
> Maybe I didn't explain it clearly but we're not going to constantly switch
> between writers or share effort between writers, it's purely for
> availability: the second writer only kicks in when the first writer is not
> available for some reason.
> And as far as I know the replicator/nrt module has not provided a solution
> on when the primary node (main indexer) is down, how would we recover with
> a back up indexer?
>
> Thanks
> Patrick
>
>
> On Thu, Dec 15, 2022 at 7:16 PM Robert Muir  wrote:
>
> > This multiple-writer isn't going to work and customizing names won't
> > allow it anyway. Each file also contains a unique identifier tied to
> > its commit so that we know everything is intact.
> >
> > I would look at the segment replication in lucene/replicator and not
> > try to play games with files and mixing multiple writers.
> >
> > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  wrote:
> > >
> > > Hi Folks,
> > >
> > > We're trying to build a search architecture using segment replication
> > (indexer and searcher are separated and indexer shipping new segments to
> > searchers) right now and one of the problems we're facing is: for
> > availability reason we need to have multiple indexers running, and when the
> > searcher is switching from consuming one indexer to another, there are
> > chances where the segment names collide with each other (because segment
> > names are count based) and the searcher have to reload the whole index.
> > > To avoid that we're looking for a way to name the segments so that
> > Lucene is able to tell the difference and load only the difference (by
> > calling `openIfChanged`). I've checked the IndexWriter and the
> > DocumentsWriter and it seems it is controlled by a private final method
> > `newSegmentName()` so likely not possible there. So I wonder whether
> > there's any other ways people are aware of that can help control the
> > segment names?
> > >
> > > A example of the situation described above:
> > > Searcher previously consuming from indexer 1, and have following
> > segments: _1, _2, _3, _4
> > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > segments, and produced its own 4th segments (notioned as _4', but it shares
> > the same "_4" name): _1, _2, _3, _4'
> > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> > 2, then when it finished downloading the segments and trying to refresh the
> > reader, it will likely hit the exception here, and seems all we can do
> > right now is to reload the whole index and that could be potentially a high
> > cost.
> > >
> > > Sorry for the long email and thank you in advance for any replies!
> > >
> > > Best
> > > Patrick
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Is there a way to customize segment names?

2022-12-15 Thread Robert Muir

This multiple-writer isn't going to work and customizing names won't
allow it anyway. Each file also contains a unique identifier tied to
its commit so that we know everything is intact.

I would look at the segment replication in lucene/replicator and not
try to play games with files and mixing multiple writers.

On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  wrote:
>
> Hi Folks,
>
> We're trying to build a search architecture using segment replication 
> (indexer and searcher are separated and indexer shipping new segments to 
> searchers) right now and one of the problems we're facing is: for 
> availability reason we need to have multiple indexers running, and when the 
> searcher is switching from consuming one indexer to another, there are 
> chances where the segment names collide with each other (because segment 
> names are count based) and the searcher have to reload the whole index.
> To avoid that we're looking for a way to name the segments so that Lucene is 
> able to tell the difference and load only the difference (by calling 
> `openIfChanged`). I've checked the IndexWriter and the DocumentsWriter and it 
> seems it is controlled by a private final method `newSegmentName()` so likely 
> not possible there. So I wonder whether there's any other ways people are 
> aware of that can help control the segment names?
>
> A example of the situation described above:
> Searcher previously consuming from indexer 1, and have following segments: 
> _1, _2, _3, _4
> Indexer 2 previously sync'd from indexer 1, sharing the first 3 segments, and 
> produced its own 4th segments (notioned as _4', but it shares the same "_4" 
> name): _1, _2, _3, _4'
> Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer 2, 
> then when it finished downloading the segments and trying to refresh the 
> reader, it will likely hit the exception here, and seems all we can do right 
> now is to reload the whole index and that could be potentially a high cost.
>
> Sorry for the long email and thank you in advance for any replies!
>
> Best
> Patrick
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Fw: Need a Jira account in order to create a ticket for lucene-facet

2022-12-07 Thread Robert Muir

Hi Gennadiy,

The lucene project has migrated from JIRA to Github Issues for issue tracking.

Please create an issue here: https://github.com/apache/lucene/issues

On Wed, Dec 7, 2022 at 11:23 AM Gennadiy Vaysman
 wrote:
>
> Hello, Lucene developers,
>
> My email below to iss...@lucene.apache.org could not be delivered. The error 
> I got "Must be sent from an @apache.org address or an address in LDAP"
>
> Anyway, if this reaches you, can you grant me access for creating tickets in 
> Jira, please?
>
> Thanks,
>
> Gennadiy Vaysman
>
> - Forwarded by Gennadiy Vaysman/AbInitio on 12/07/2022 11:18 AM -
>
> From:Gennadiy Vaysman/AbInitio
> To:iss...@lucene.apache.org
> Date:12/07/2022 11:00 AM
> Subject:Need a Jira account in order to create a ticket for 
> lucene-facet
> 
>
>
> Hello,
>
> According to https://infra.apache.org/jira-guidelines.html#who, I have to 
> request credentials for Jira before filing a ticket. Would you be able to 
> provide me credentials, or you'd prefer that before filing a ticket, I post 
> it on user groups? (Well, I can do it any way, but I am pretty sure there is 
> no fix or a reasonable work-around)
>
> Thank you,
>
> Gennadiy Vaysman

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [lucene] 02/03: Fix longstanding bug in path bounds calculation, and hook up efficient isWithin() and distance logic

2022-11-19 Thread Robert Muir

Multiple spatial tests are failing in jenkins... bisected them to this commit.

Can you please look into it? https://github.com/apache/lucene/issues/11956

On Sat, Nov 19, 2022 at 8:22 PM  wrote:
>
> This is an automated email from the ASF dual-hosted git repository.
>
> kwright pushed a commit to branch main
> in repository https://gitbox.apache.org/repos/asf/lucene.git
>
> commit 9bca7a70e10db81b39a5afb4498aab1006402031
> Author: Karl David Wright 
> AuthorDate: Sat Nov 19 17:35:30 2022 -0500
>
> Fix longstanding bug in path bounds calculation, and hook up efficient 
> isWithin() and distance logic
> ---
>  .../geom/{GeoBaseShape.java => GeoBaseBounds.java} |   6 +-
>  .../apache/lucene/spatial3d/geom/GeoBaseShape.java |  24 +-
>  .../apache/lucene/spatial3d/geom/GeoBounds.java|  27 ++
>  .../org/apache/lucene/spatial3d/geom/GeoShape.java |   2 +-
>  .../lucene/spatial3d/geom/GeoStandardPath.java | 277 
> -
>  5 files changed, 140 insertions(+), 196 deletions(-)
>
> diff --git 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
>  
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseBounds.java
> similarity index 90%
> copy from 
> lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
> copy to 
> lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseBounds.java
> index a5992392563..52030b333d3 100644
> --- 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
> +++ 
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseBounds.java
> @@ -17,18 +17,18 @@
>  package org.apache.lucene.spatial3d.geom;
>
>  /**
> - * Base extended shape object.
> + * Base object that supports bounds operations.
>   *
>   * @lucene.internal
>   */
> -public abstract class GeoBaseShape extends BasePlanetObject implements 
> GeoShape {
> +public abstract class GeoBaseBounds extends BasePlanetObject implements 
> GeoBounds {
>
>/**
> * Constructor.
> *
> * @param planetModel is the planet model to use.
> */
> -  public GeoBaseShape(final PlanetModel planetModel) {
> +  public GeoBaseBounds(final PlanetModel planetModel) {
>  super(planetModel);
>}
>
> diff --git 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
>  
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
> index a5992392563..a4b5cd18a62 100644
> --- 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
> +++ 
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBaseShape.java
> @@ -21,7 +21,7 @@ package org.apache.lucene.spatial3d.geom;
>   *
>   * @lucene.internal
>   */
> -public abstract class GeoBaseShape extends BasePlanetObject implements 
> GeoShape {
> +public abstract class GeoBaseShape extends GeoBaseBounds implements GeoShape 
> {
>
>/**
> * Constructor.
> @@ -31,26 +31,4 @@ public abstract class GeoBaseShape extends 
> BasePlanetObject implements GeoShape
>public GeoBaseShape(final PlanetModel planetModel) {
>  super(planetModel);
>}
> -
> -  @Override
> -  public void getBounds(Bounds bounds) {
> -if (isWithin(planetModel.NORTH_POLE)) {
> -  
> bounds.noTopLatitudeBound().noLongitudeBound().addPoint(planetModel.NORTH_POLE);
> -}
> -if (isWithin(planetModel.SOUTH_POLE)) {
> -  
> bounds.noBottomLatitudeBound().noLongitudeBound().addPoint(planetModel.SOUTH_POLE);
> -}
> -if (isWithin(planetModel.MIN_X_POLE)) {
> -  bounds.addPoint(planetModel.MIN_X_POLE);
> -}
> -if (isWithin(planetModel.MAX_X_POLE)) {
> -  bounds.addPoint(planetModel.MAX_X_POLE);
> -}
> -if (isWithin(planetModel.MIN_Y_POLE)) {
> -  bounds.addPoint(planetModel.MIN_Y_POLE);
> -}
> -if (isWithin(planetModel.MAX_Y_POLE)) {
> -  bounds.addPoint(planetModel.MAX_Y_POLE);
> -}
> -  }
>  }
> diff --git 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBounds.java 
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBounds.java
> new file mode 100644
> index 000..935366c5a08
> --- /dev/null
> +++ 
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoBounds.java
> @@ -0,0 +1,27 @@
> +/*
> + * Licensed to the Apache Software Foundation (ASF) under one or more
> + * contributor license agreements.  See the NOTICE file distributed with
> + * this work for additional information regarding copyright ownership.
> + * The ASF licenses this file to You under the Apache License, Version 2.0
> + * (the "License"); you may not use this file except in compliance with
> + * the License.  You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,

Re: [VOTE] Release Lucene 9.4.2 RC1

2022-11-18 Thread Robert Muir

I think he is running this from jenkins job. I suspect agents have
"stacked up" over time take a look with "ps". Every time i run the
smoketester, it "leaks" at least an agent or two.

On Fri, Nov 18, 2022 at 9:48 AM Adrien Grand  wrote:
>
> Reading Uwe's error message more carefully, I had first assumed that the GPG 
> failure was due to the lack of an ultimately trusted signature, but it seems 
> like it's due to "can't connect to the agent: IPC connect call failed" 
> actually, which suggests an issue with the GPG agent?
>
> On Fri, Nov 18, 2022 at 3:00 PM Michael Sokolov  wrote:
>>
>> I got this message when initially downloading the artifacts:
>>
>> Downloading 
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db/lucene/lucene-9.4.2-src.tgz.asc
>> File: 
>> /tmp/smoke_lucene_9.4.2_858d9b437047a577fa9457089afff43eefa461db/lucene.lucene-9.4.2-src.tgz.gpg.verify.log
>> verify trust
>>   GPG: gpg: WARNING: This key is not certified with a trusted signature!
>>
>> is it related?
>>
>> On Fri, Nov 18, 2022 at 8:43 AM Uwe Schindler  wrote:
>> >
>> > The problem is: it is working like this since years - the 9.4.1 release 
>> > worked fine. No change!
>> >
>> > And I can't configure this because GPG uses its own home directory setup 
>> > by smoke tester (see paths below). So it should not look anywhere else? In 
>> > addition "gpg: no ultimately trusted keys found" is just a warning, it 
>> > should not cause gpg to exit.
>> >
>> > Also why does it only happens at the time of Maven? It checks signatures 
>> > before, too. This is why I restarted the build: 
>> > https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console (still 
>> > running)
>> >
>> > Uwe
>> >
>> > Am 18.11.2022 um 14:21 schrieb Adrien Grand:
>> >
>> > Uwe, the error message suggests that Policeman Jenkins is not ultimately 
>> > trusting any of the keys. Does it work if you configure it to ultimately 
>> > trust your "Uwe Schindler (CODE SIGNING KEY) " key 
>> > (which I assume you would be ok with)?
>> >
>> > On Fri, Nov 18, 2022 at 2:18 PM Uwe Schindler  wrote:
>> >>
>> >> I am restarting the build, maybe it was some hickup. Interestingly it 
>> >> only failed for the Maven dependencies. P.S.: Why does it import the key 
>> >> file over and over? It would be enough to do this once at beginning of 
>> >> smoker.
>> >>
>> >> Uwe
>> >>
>> >> Am 18.11.2022 um 14:12 schrieb Uwe Schindler:
>> >>
>> >> Hi,
>> >>
>> >> I get a failure because your key is somehow rejected by GPG (Ubuntu 
>> >> 22.04):
>> >>
>> >> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/24/console
>> >>
>> >> verify maven artifact sigs command "gpg --homedir 
>> >> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg 
>> >> --import /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/KEYS" 
>> >> failed: gpg: keybox 
>> >> '/home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/pubring.kbx'
>> >>  created gpg: 
>> >> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/trustdb.gpg:
>> >>  trustdb created gpg: key B83EA82A0AFCEE7C: public key "Yonik Seeley 
>> >> " imported gpg: can't connect to the agent: IPC connect 
>> >> call failed gpg: key E48025ED13E57FFC: public key "Upayavira 
>> >> " imported [...] gpg: key 051A0FAF76BC6507: public key 
>> >> "Adrien Grand (CODE SIGNING KEY) " imported [...] 
>> >> gpg: key 32423B0E264B5CBA: public key "Julie Tibshirani (New code signing 
>> >> key) " imported gpg: Total number processed: 62 
>> >> gpg: imported: 62 gpg: no ultimately trusted keys found
>> >> It looks like for others it succeeds? No idea why. Maybe Ubuntu 22.04 has 
>> >> a too-new GPG or it needs to use gpg2?
>> >>
>> >> -1 to release until this is sorted out.
>> >>
>> >> Uwe
>> >>
>> >> Am 17.11.2022 um 15:18 schrieb Adrien Grand:
>> >>
>> >> Please vote for release candidate 1 for Lucene 9.4.2
>> >>
>> >> The artifacts can be downloaded from:
>> >> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>> >>
>> >> You can run the smoke tester directly with this command:
>> >>
>> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
>> >> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>> >>
>> >> The vote will be open for at least 72 hours i.e. until 2022-11-20 15:00 
>> >> UTC.
>> >>
>> >> [ ] +1  approve
>> >> [ ] +0  no opinion
>> >> [ ] -1  disapprove (and reason why)
>> >>
>> >> Here is my +1.
>> >>
>> >> --
>> >> Adrien
>> >>
>> >> --
>> >> Uwe Schindler
>> >> Achterdiek 19, D-28357 Bremen
>> >> https://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >>
>> >> --
>> >> Uwe Schindler
>> >> Achterdiek 19, D-28357 Bremen
>> >> https://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >
>> >
>> >
>> > --
>> > Adrien
>> >
>> > --
>> > Uwe Schindler
>> > Achterdiek 19, D-28357 Bremen
>> > https://www.thetaphi.de
>> > eMail:

Re: [VOTE] Release Lucene 9.4.2 RC1

2022-11-17 Thread Robert Muir

+1

SUCCESS! [1:16:29.706409]

On Thu, Nov 17, 2022 at 9:18 AM Adrien Grand  wrote:
>
> Please vote for release candidate 1 for Lucene 9.4.2
>
> The artifacts can be downloaded from:
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>
> The vote will be open for at least 72 hours i.e. until 2022-11-20 15:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1.
>
> --
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [lucene] branch main updated: Prevent NPEs while still handling the polar case for horizontal planes right off the pole

2022-11-17 Thread Robert Muir

if your machine is really 12 cores and 64GB ram but is that slow, then
uninstall that windows shit immediately, that's horrible.

On Thu, Nov 17, 2022 at 5:46 AM Karl Wright  wrote:
>
> Thanks - the target I was using was the complete "build" target on the whole 
> project.  This will be a valuable improvement. ;-)
>
> I have slow network here so it is possible that the entire build was slow for 
> that reason.  The machine is a new Dell laptop, 12 cores, 64GB memory, but I 
> am running under Windows Subsystem for Linux which is a bit slower than 
> native Ubuntu.  Still, the gradlew command you gave takes many minutes (of 
> which a sizable amount is spent in :gitStatus - more than 5 minutes there 
> alone).  Anything less than 10 minutes I deem acceptable, which this doesn't 
> quite manage, but I'll live.
>
> Karl
>
>
> On Thu, Nov 17, 2022 at 5:06 AM Dawid Weiss  wrote:
>>
>>
>>> Thank you for the comment.
>>
>>
>> Sorry if it came out the wrong way - I certainly didn't mean it to be unkind.
>>
>>>
>>> It took me several days just to get things set up so I was able to commit 
>>> again, and I did this through command-line not github.
>>
>>
>> These things are not mutually exclusive - I work with command line as well. 
>> You just push to your own repository (or a branch, if you don't care to have 
>> your own fork on github) and then file a PR from there. If you're on a 
>> slower machine - this is even better since precommit checks run for you 
>> there.
>>
>>>
>>> The full gradlew script takes over 2 hours to run now so if there's a 
>>> faster target I can use to determine these things in advance I'd love to 
>>> know what it is.
>>
>>
>> Well, this is crazy long so I wonder what's happening. I'd love to help but 
>> it'd be good to know what machine this is (disk, cpu, memory?) and what the 
>> build command was. Without knowing these, I'd say - run the tests and checks 
>> for the module you've changed only, not for everything. How long does this 
>> take?
>>
>> ./gradlew check -p lucene/spatial3d
>>
>> It takes roughly 1 minute for me, including startup (after the daemon is 
>> running in the background, it's much faster).
>>
>> There are some workflow examples/ hints I left here:
>> https://github.com/apache/lucene/blob/main/help/workflow.txt#L6-L22
>>
>> Hope it helps,
>> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Release Lucene 9.4.2

2022-11-16 Thread Robert Muir

+1, thanks for the patience. I feel we at least made the effort to
root out any more of these and hopefully prevent a 9.4.3 with another
overflow bug.

On Wed, Nov 16, 2022 at 10:55 AM Adrien Grand  wrote:
>
> It looks like we're good with the changes we wanted to get in for 9.4.2.
>
> I plan on starting the release process tomorrow if there are no objections.
>
> On Fri, Nov 11, 2022 at 4:22 PM Robert Muir  wrote:
>>
>> These are the 9.4.2 completed issues:
>>
>> https://github.com/apache/lucene/pull/11905 <-- bug and associated monster 
>> test
>> https://github.com/apache/lucene/pull/11916 <-- checkindex improvement
>> https://github.com/apache/lucene/pull/11919 <-- checkindex improvement
>>
>> These are the remaining issues:
>>
>> https://github.com/apache/lucene/pull/11918 <-- better error messages,
>> looks close to being merged
>> https://github.com/apache/lucene/issues/11910 <-- static analysis:
>> after discussion on the issue, let's consider just doing a "one-time"
>> pass to look for more problems?
>>
>> On Fri, Nov 11, 2022 at 9:52 AM Michael Sokolov  wrote:
>> >
>> > +1 makes sense. I do think given this is the second similar-flavored
>> > bug we've found that we should be thorough and try to get them all
>> > rather than having a 9.4.3 ...
>> >
>> > On Wed, Nov 9, 2022 at 10:25 AM Julie Tibshirani  
>> > wrote:
>> > >
>> > > +1 from me for a bugfix release once we've solidified testing. Thanks to 
>> > > everyone working on improving tests and static analysis -- this now is 
>> > > our second time encountering a bad arithmetic bug and it's important to 
>> > > get ahead of these issues!
>> > >
>> > > Julie
>> > >
>> > > On Wed, Nov 9, 2022 at 8:26 AM Robert Muir  wrote:
>> > >>
>> > >> Thank you Adrien!
>> > >>
>> > >> I created an issue for the static analysis piece, but I'm not
>> > >> currently working on it yet. This could be a fun one, if anyone is
>> > >> interested, to flush a bunch of these bugs out at once:
>> > >> https://github.com/apache/lucene/issues/11910
>> > >>
>> > >> On Wed, Nov 9, 2022 at 10:48 AM Adrien Grand  wrote:
>> > >> >
>> > >> > Totally Robert, I was not trying to add any time pressure, next week 
>> > >> > is totally fine. I mostly wanted to get the discussion started 
>> > >> > because folks sometimes have one or two bug fixes they'd like to fold 
>> > >> > into a bugfix release so I wanted to give them time to plan. Friday 
>> > >> > is also a public holiday here, celebrating the end of World War 1. :)
>> > >> >
>> > >> > On Wed, Nov 9, 2022 at 4:41 PM Robert Muir  wrote:
>> > >> >>
>> > >> >> Can we please have a few days to improve the test situation? I think
>> > >> >> we need to beef up checkindex to exercise seek() on the vectors, also
>> > >> >> we need to look at static analysis to try to find other similar bugs.
>> > >> >> This would help prevent "whack-a-mole" and improve correctness going 
>> > >> >> forwards.
>> > >> >>
>> > >> >> I want to help more but it's difficult timing-wise, lots of stuff
>> > >> >> going on this week, and in my country friday is Veteran's Day 
>> > >> >> holiday.
>> > >> >>
>> > >> >> On Wed, Nov 9, 2022 at 10:39 AM Adrien Grand  
>> > >> >> wrote:
>> > >> >> >
>> > >> >> > Hello all,
>> > >> >> >
>> > >> >> > A bad integer overflow has been discovered in the KNN vectors 
>> > >> >> > format, which affects segments that have more than ~16M vectors. 
>> > >> >> > I'd like to do a bugfix release when the bug is fixed and we have 
>> > >> >> > a test for such large datasets of KNN vectors. I volunteer to be 
>> > >> >> > the RM for this release.
>> > >> >> >
>> > >> >> > --
>> > >> >> > Adrien
>> > >> >>
>> > >> >> -
>> > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >> >>
>> > >> >
>> > >> >
>> > >> > --
>> > >> > Adrien
>> > >>
>> > >> -
>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >>
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 1386 - Failure!

2022-11-16 Thread Robert Muir

Thanks again for cleaning this hack up Dawid. I was cursing gradle all
night, could not believe that sometimes it uses java.exe (with a bunch
of internal api violations) and othertimes uses javac.exe.

On Wed, Nov 16, 2022 at 4:12 AM Dawid Weiss  wrote:
>
>
> I've committed a fix on main and checked that it works with error prone, in 
> process compilation and alt javac. But double checking would be probably 
> good. :)
>
> Dawid
>
> On Wed, Nov 16, 2022 at 12:18 AM Robert Muir  wrote:
>>
>> It is my fault. I will revert my changes and test with "alternate
>> toolchain". I think we have to hold things a bit differently in that
>> case. Sorry for all the noise.
>>
>> On Tue, Nov 15, 2022 at 6:07 PM Policeman Jenkins Server
>>  wrote:
>> >
>> > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/1386/
>> > Java: 64bit/jdk-18 -XX:-UseCompressedOops -XX:+UseSerialGC
>> >
>> > No tests ran.
>>
>> -
>> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: builds-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Release Lucene 9.4.2

2022-11-11 Thread Robert Muir

These are the 9.4.2 completed issues:

https://github.com/apache/lucene/pull/11905 <-- bug and associated monster test
https://github.com/apache/lucene/pull/11916 <-- checkindex improvement
https://github.com/apache/lucene/pull/11919 <-- checkindex improvement

These are the remaining issues:

https://github.com/apache/lucene/pull/11918 <-- better error messages,
looks close to being merged
https://github.com/apache/lucene/issues/11910 <-- static analysis:
after discussion on the issue, let's consider just doing a "one-time"
pass to look for more problems?

On Fri, Nov 11, 2022 at 9:52 AM Michael Sokolov  wrote:
>
> +1 makes sense. I do think given this is the second similar-flavored
> bug we've found that we should be thorough and try to get them all
> rather than having a 9.4.3 ...
>
> On Wed, Nov 9, 2022 at 10:25 AM Julie Tibshirani  wrote:
> >
> > +1 from me for a bugfix release once we've solidified testing. Thanks to 
> > everyone working on improving tests and static analysis -- this now is our 
> > second time encountering a bad arithmetic bug and it's important to get 
> > ahead of these issues!
> >
> > Julie
> >
> > On Wed, Nov 9, 2022 at 8:26 AM Robert Muir  wrote:
> >>
> >> Thank you Adrien!
> >>
> >> I created an issue for the static analysis piece, but I'm not
> >> currently working on it yet. This could be a fun one, if anyone is
> >> interested, to flush a bunch of these bugs out at once:
> >> https://github.com/apache/lucene/issues/11910
> >>
> >> On Wed, Nov 9, 2022 at 10:48 AM Adrien Grand  wrote:
> >> >
> >> > Totally Robert, I was not trying to add any time pressure, next week is 
> >> > totally fine. I mostly wanted to get the discussion started because 
> >> > folks sometimes have one or two bug fixes they'd like to fold into a 
> >> > bugfix release so I wanted to give them time to plan. Friday is also a 
> >> > public holiday here, celebrating the end of World War 1. :)
> >> >
> >> > On Wed, Nov 9, 2022 at 4:41 PM Robert Muir  wrote:
> >> >>
> >> >> Can we please have a few days to improve the test situation? I think
> >> >> we need to beef up checkindex to exercise seek() on the vectors, also
> >> >> we need to look at static analysis to try to find other similar bugs.
> >> >> This would help prevent "whack-a-mole" and improve correctness going 
> >> >> forwards.
> >> >>
> >> >> I want to help more but it's difficult timing-wise, lots of stuff
> >> >> going on this week, and in my country friday is Veteran's Day holiday.
> >> >>
> >> >> On Wed, Nov 9, 2022 at 10:39 AM Adrien Grand  wrote:
> >> >> >
> >> >> > Hello all,
> >> >> >
> >> >> > A bad integer overflow has been discovered in the KNN vectors format, 
> >> >> > which affects segments that have more than ~16M vectors. I'd like to 
> >> >> > do a bugfix release when the bug is fixed and we have a test for such 
> >> >> > large datasets of KNN vectors. I volunteer to be the RM for this 
> >> >> > release.
> >> >> >
> >> >> > --
> >> >> > Adrien
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
> >> >
> >> >
> >> > --
> >> > Adrien
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Release Lucene 9.4.2

2022-11-09 Thread Robert Muir

Thank you Adrien!

I created an issue for the static analysis piece, but I'm not
currently working on it yet. This could be a fun one, if anyone is
interested, to flush a bunch of these bugs out at once:
https://github.com/apache/lucene/issues/11910

On Wed, Nov 9, 2022 at 10:48 AM Adrien Grand  wrote:
>
> Totally Robert, I was not trying to add any time pressure, next week is 
> totally fine. I mostly wanted to get the discussion started because folks 
> sometimes have one or two bug fixes they'd like to fold into a bugfix release 
> so I wanted to give them time to plan. Friday is also a public holiday here, 
> celebrating the end of World War 1. :)
>
> On Wed, Nov 9, 2022 at 4:41 PM Robert Muir  wrote:
>>
>> Can we please have a few days to improve the test situation? I think
>> we need to beef up checkindex to exercise seek() on the vectors, also
>> we need to look at static analysis to try to find other similar bugs.
>> This would help prevent "whack-a-mole" and improve correctness going 
>> forwards.
>>
>> I want to help more but it's difficult timing-wise, lots of stuff
>> going on this week, and in my country friday is Veteran's Day holiday.
>>
>> On Wed, Nov 9, 2022 at 10:39 AM Adrien Grand  wrote:
>> >
>> > Hello all,
>> >
>> > A bad integer overflow has been discovered in the KNN vectors format, 
>> > which affects segments that have more than ~16M vectors. I'd like to do a 
>> > bugfix release when the bug is fixed and we have a test for such large 
>> > datasets of KNN vectors. I volunteer to be the RM for this release.
>> >
>> > --
>> > Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Release Lucene 9.4.2

2022-11-09 Thread Robert Muir

Can we please have a few days to improve the test situation? I think
we need to beef up checkindex to exercise seek() on the vectors, also
we need to look at static analysis to try to find other similar bugs.
This would help prevent "whack-a-mole" and improve correctness going forwards.

I want to help more but it's difficult timing-wise, lots of stuff
going on this week, and in my country friday is Veteran's Day holiday.

On Wed, Nov 9, 2022 at 10:39 AM Adrien Grand  wrote:
>
> Hello all,
>
> A bad integer overflow has been discovered in the KNN vectors format, which 
> affects segments that have more than ~16M vectors. I'd like to do a bugfix 
> release when the bug is fixed and we have a test for such large datasets of 
> KNN vectors. I volunteer to be the RM for this release.
>
> --
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Expressions greedy advanceExact implementation

2022-10-26 Thread Robert Muir

I think deferring the advance call like this is fine and harmless,
only because this DoubleValues "caches" the result for the current
doc, so its idempotent anyway.

Yes, about "advancing all the operands" as I mentioned, expressions
has no clue about this. If you wanted to change it, you'd have to push
both advancing AND caching down lower into the actual compiled
expression code.

I think this would add way too much complexity, especially when it
would only improve the ternary "if" feature in such cases.

On Wed, Oct 26, 2022 at 10:23 AM Michael Sokolov  wrote:
>
> see https://github.com/apache/lucene/pull/11878 ... it doesn't do what
> I initially asked for (still advances all of the operands), but it
> delays until doubleValue() is called, which is safe and could have
> some impact
>
> On Wed, Oct 26, 2022 at 9:58 AM Michael Sokolov  wrote:
> >
> > Hi, yes, makes sense Mikhail, that will address most of the problem.
> > But I also think, given the way Expressions work (they always return
> > true from advanceExact) there is no reason for them to advance their
> > operands. This shifts the burden/concern from the developer who no
> > longer has to think as hard about this :)  - let me post a PR that
> > shows
> >
> > On Wed, Oct 26, 2022 at 3:52 AM Mikhail Khludnev  wrote:
> > >
> > > Hello, Michael.
> > > I suppose you can bind f2 to custom lazy implementation of 
> > > DoubleValuesSource, which defer advanceExact() by storing doc num and 
> > > returning true always, and actually advancing on doubleValue() only.
> > >
> > > On Tue, Oct 25, 2022 at 8:13 PM Michael Sokolov  
> > > wrote:
> > >>
> > >> ExpressionFunctionValueSource lazily evaluates in doubleValues: an
> > >> expression like
> > >>
> > >>condition ? f1 : f2
> > >>
> > >> will only evaluate one of f1 or f2.
> > >>
> > >> At the same time, the advanceExact() call is greedy -- when you
> > >> advance that expression it will also advance both f1 and f2. But
> > >> here's the thing: it always returns true, regardless of whether f1 and
> > >> f2 advance. Which makes sense from the point of view of the lazy
> > >> evaluation -- if condition is true we don't care whether f2 advances
> > >> or not.
> > >>
> > >> My question is whether we could defer these child advanceExact calls
> > >> until ExpressionFunctionValues.doubleValue()?
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
> > >>
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Expressions greedy advanceExact implementation

2022-10-25 Thread Robert Muir

Iirc the expressions acts like a simple scripting engine where it just
compiles bytecode for your expression and you are able to bind variables
that you pass to the method... I don't know of an easy way to do this.


On Tue, Oct 25, 2022, 1:13 PM Michael Sokolov  wrote:

> ExpressionFunctionValueSource lazily evaluates in doubleValues: an
> expression like
>
>condition ? f1 : f2
>
> will only evaluate one of f1 or f2.
>
> At the same time, the advanceExact() call is greedy -- when you
> advance that expression it will also advance both f1 and f2. But
> here's the thing: it always returns true, regardless of whether f1 and
> f2 advance. Which makes sense from the point of view of the lazy
> evaluation -- if condition is true we don't care whether f2 advances
> or not.
>
> My question is whether we could defer these child advanceExact calls
> until ExpressionFunctionValues.doubleValue()?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [VOTE] Release Lucene 9.4.1 RC1

2022-10-21 Thread Robert Muir

I change my vote to +1 based on Julie's test. It fails for me with
9.4.0 and passes for me with 9.4.1

:lucene:core:test (SUCCESS): 1 test(s)

> Task :lucene:core:wipeTaskTemp
The slowest tests (exceeding 500 ms) during this run:
  8511.54s TestManyKnnVectors.testLargeSegment (:lucene:core)
The slowest suites (exceeding 1s) during this run:
  8512.27s TestManyKnnVectors (:lucene:core)

BUILD SUCCESSFUL in 2h 22m 55s
19 actionable tasks: 13 executed, 6 up-to-date

On Thu, Oct 20, 2022 at 3:57 PM Robert Muir  wrote:
>
> Thank you Julie for the draft test! I will try to reproduce/test with it.
>
> On Thu, Oct 20, 2022 at 3:45 PM Julie Tibshirani  wrote:
> >
> > Thank you Ignacio for taking over as release manager! I ran into some 
> > issues with my signing key and Ignacio saved the day.
> >
> > Robert, I understand your perspective. I uploaded a draft PR with a monster 
> > test you could try out: https://github.com/apache/lucene/pull/11867. It 
> > requires downloading an external dataset based on some StackOverflow data 
> > where I found the bug. Let me know if you run into problems with the test 
> > -- maybe we can discuss on the draft PR.
> >
> > Julie
> >
> > On Thu, Oct 20, 2022 at 10:17 AM Uwe Schindler  wrote:
> >>
> >> Hi,
> >>
> >> Policeman Jenkins ran the smoke checks for me with Java 11 and Java 17, 
> >> still no Java 19 :(
> >>
> >> SUCCESS! [1:26:17.692239]
> >> Finished: SUCCESS
> >>
> >> I did not do any special checks beyond smoke tester. To me the Bugfix 
> >> looks fine.
> >>
> >> +1 to release!
> >>
> >> Uwe
> >>
> >> Am 20. Oktober 2022 16:24:23 MESZ schrieb Ignacio Vera :
> >>>
> >>> Please vote for release candidate 1 for Lucene 9.4.1
> >>>
> >>>
> >>> The artifacts can be downloaded from:
> >>>
> >>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.1-RC1-rev-810817993e0956e63e29906e78a245949501b77d
> >>>
> >>>
> >>> You can run the smoke tester directly with this command:
> >>>
> >>>
> >>> python3 -u dev-tools/scripts/smokeTestRelease.py \
> >>>
> >>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.1-RC1-rev-810817993e0956e63e29906e78a245949501b77d
> >>>
> >>>
> >>> The vote will be open for at least 72 hours i.e. until 2022-10-23 15:00 
> >>> UTC.
> >>>
> >>>
> >>> [ ] +1  approve
> >>>
> >>> [ ] +0  no opinion
> >>>
> >>> [ ] -1  disapprove (and reason why)
> >>>
> >>>
> >>> Here is my +1
> >>>
> >>> 
> >>
> >> --
> >> Uwe Schindler
> >> Achterdiek 19, 28357 Bremen
> >> https://www.thetaphi.de

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene 9.4.1 RC1

2022-10-20 Thread Robert Muir

Thank you Julie for the draft test! I will try to reproduce/test with it.

On Thu, Oct 20, 2022 at 3:45 PM Julie Tibshirani  wrote:
>
> Thank you Ignacio for taking over as release manager! I ran into some issues 
> with my signing key and Ignacio saved the day.
>
> Robert, I understand your perspective. I uploaded a draft PR with a monster 
> test you could try out: https://github.com/apache/lucene/pull/11867. It 
> requires downloading an external dataset based on some StackOverflow data 
> where I found the bug. Let me know if you run into problems with the test -- 
> maybe we can discuss on the draft PR.
>
> Julie
>
> On Thu, Oct 20, 2022 at 10:17 AM Uwe Schindler  wrote:
>>
>> Hi,
>>
>> Policeman Jenkins ran the smoke checks for me with Java 11 and Java 17, 
>> still no Java 19 :(
>>
>> SUCCESS! [1:26:17.692239]
>> Finished: SUCCESS
>>
>> I did not do any special checks beyond smoke tester. To me the Bugfix looks 
>> fine.
>>
>> +1 to release!
>>
>> Uwe
>>
>> Am 20. Oktober 2022 16:24:23 MESZ schrieb Ignacio Vera :
>>>
>>> Please vote for release candidate 1 for Lucene 9.4.1
>>>
>>>
>>> The artifacts can be downloaded from:
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.1-RC1-rev-810817993e0956e63e29906e78a245949501b77d
>>>
>>>
>>> You can run the smoke tester directly with this command:
>>>
>>>
>>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.1-RC1-rev-810817993e0956e63e29906e78a245949501b77d
>>>
>>>
>>> The vote will be open for at least 72 hours i.e. until 2022-10-23 15:00 UTC.
>>>
>>>
>>> [ ] +1  approve
>>>
>>> [ ] +0  no opinion
>>>
>>> [ ] -1  disapprove (and reason why)
>>>
>>>
>>> Here is my +1
>>>
>>> 
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene 9.4.1 RC1

2022-10-20 Thread Robert Muir

+0 SUCCESS! [0:39:31.979476]

I say +0 instead of +1, because i am still worried that we release
with a bugfix without any test.

I am happy to change vote to a +1 if we even have a hacky test in a
draft PR. the release artifacts don't need to contain such a test or
anything like that. i just want to run it, to verify that big vectors
segments work.

On Thu, Oct 20, 2022 at 10:24 AM Ignacio Vera  wrote:
>
> Please vote for release candidate 1 for Lucene 9.4.1
>
>
> The artifacts can be downloaded from:
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.1-RC1-rev-810817993e0956e63e29906e78a245949501b77d
>
>
> You can run the smoke tester directly with this command:
>
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.1-RC1-rev-810817993e0956e63e29906e78a245949501b77d
>
>
> The vote will be open for at least 72 hours i.e. until 2022-10-23 15:00 UTC.
>
>
> [ ] +1  approve
>
> [ ] +0  no opinion
>
> [ ] -1  disapprove (and reason why)
>
>
> Here is my +1
>
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Code coverage check for PRs

2022-10-05 Thread Robert Muir

I would recommend a search for "github actions jacoco" to review
what's common out there.

If we change 'gradle test' to 'gradle coverage' in our existing
PR-test action, the next step is to just not throw away the reports,
but make them available. See
https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts
for some documentation on this.

Seems common for PR workflows to have the action "comment on the PR"
with coverage information. Not sure if we want that as it could result
in a ton of comments.

Finally, the current "gradle coverage" builds a separate coverage
report for each lucene module, I think. So we may want to think about
adding support to "merge" the jacoco data across all the modules and
build one monster report for all of lucene, too. This would just be
some work with the gradle build: but I think it would make the
information a lot easier to digest. This is already happening with the
"jenkins coverage build" which presents one monster report, but I
think it may be something on the jenkins side doing it?
https://ci-builds.apache.org/job/Lucene/job/Lucene-Coverage-main/lastBuild/jacoco/


On Wed, Oct 5, 2022 at 8:58 AM Patrick Zhai  wrote:
>
> Make sense to me, I'll try to look into it!
>
> On Tue, Oct 4, 2022, 16:50 Robert Muir  wrote:
>>
>> We already have code coverage integrated into the build. See the
>> documentation on how to generate the reports:
>> https://github.com/apache/lucene/blob/main/help/tests.txt
>>
>> I think we should stick with jacoco and not some commercial stuff for
>> measuring coverage. Jacoco works great. We just have to put the
>> reports or stats somewhere useful.
>>
>> On Tue, Oct 4, 2022 at 5:45 PM Patrick Zhai  wrote:
>> >
>> > Hi Robert, thank you for commenting, yeah the functionality I want to add 
>> > is actually the line by line code coverage stats for the new/changed line 
>> > that are in the patch so that we don't need to wonder about "whether that 
>> > line is covered by the test?". But I'm against using the code coverage as 
>> > any kind of hard criteria, like coverage must be kept at a certain % or 
>> > all the new lines must be covered, that will drive people crazy. I think 
>> > that should be just treated as a helpful thing to check when 
>> > reviewing/creating the PR.
>> >
>> > I searched a little on google and found this: https://about.codecov.io/, 
>> > it's free for open source and seems to have the functionality we need. Let 
>> > me know if anyone has ideas about this, or otherwise I can try it a little 
>> > bit with my own repo first and then try to add it to lucene.
>> >
>> > Best
>> > Patrick
>> >
>> >
>> >
>> > On Tue, Oct 4, 2022, 06:36 Robert Muir  wrote:
>> >>
>> >> btw, you can look at the current reports created by jenkins here:
>> >> https://ci-builds.apache.org/job/Lucene/job/Lucene-Coverage-main/lastBuild/jacoco/
>> >>
>> >> On Tue, Oct 4, 2022 at 6:51 AM Robert Muir  wrote:
>> >> >
>> >> > we can run the tests with coverage option and produce coverage graph
>> >> > from the github actions, but need to look at the docs to see where to
>> >> > put it so it will be available.
>> >> >
>> >> > I want us to be careful about the word "check" as I'm adamantly
>> >> > against any such automated check (e.g. coverage > N%) in the logic.
>> >> > Coverage report is just a tool to help us and the moment we do stupid
>> >> > shit like that, is the moment people start gaming it just to make the
>> >> > build pass.
>> >> >
>> >> > On Mon, Oct 3, 2022 at 10:57 PM Patrick Zhai  wrote:
>> >> > >
>> >> > > Hi folks,
>> >> > > I'm not sure whether people have already discussed this but I'm 
>> >> > > wondering whether we want to add a workflow that pulls out the code 
>> >> > > coverage whenever a PR was created? It should be easier for both the 
>> >> > > reviewers and the contributors to figure out what can be improved, or 
>> >> > > at least figure out a part that is probably not covered by the tests?
>> >> > >
>> >> > > Best
>> >> > > Patrick
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Code coverage check for PRs

2022-10-04 Thread Robert Muir

We already have code coverage integrated into the build. See the
documentation on how to generate the reports:
https://github.com/apache/lucene/blob/main/help/tests.txt

I think we should stick with jacoco and not some commercial stuff for
measuring coverage. Jacoco works great. We just have to put the
reports or stats somewhere useful.

On Tue, Oct 4, 2022 at 5:45 PM Patrick Zhai  wrote:
>
> Hi Robert, thank you for commenting, yeah the functionality I want to add is 
> actually the line by line code coverage stats for the new/changed line that 
> are in the patch so that we don't need to wonder about "whether that line is 
> covered by the test?". But I'm against using the code coverage as any kind of 
> hard criteria, like coverage must be kept at a certain % or all the new lines 
> must be covered, that will drive people crazy. I think that should be just 
> treated as a helpful thing to check when reviewing/creating the PR.
>
> I searched a little on google and found this: https://about.codecov.io/, it's 
> free for open source and seems to have the functionality we need. Let me know 
> if anyone has ideas about this, or otherwise I can try it a little bit with 
> my own repo first and then try to add it to lucene.
>
> Best
> Patrick
>
>
>
> On Tue, Oct 4, 2022, 06:36 Robert Muir  wrote:
>>
>> btw, you can look at the current reports created by jenkins here:
>> https://ci-builds.apache.org/job/Lucene/job/Lucene-Coverage-main/lastBuild/jacoco/
>>
>> On Tue, Oct 4, 2022 at 6:51 AM Robert Muir  wrote:
>> >
>> > we can run the tests with coverage option and produce coverage graph
>> > from the github actions, but need to look at the docs to see where to
>> > put it so it will be available.
>> >
>> > I want us to be careful about the word "check" as I'm adamantly
>> > against any such automated check (e.g. coverage > N%) in the logic.
>> > Coverage report is just a tool to help us and the moment we do stupid
>> > shit like that, is the moment people start gaming it just to make the
>> > build pass.
>> >
>> > On Mon, Oct 3, 2022 at 10:57 PM Patrick Zhai  wrote:
>> > >
>> > > Hi folks,
>> > > I'm not sure whether people have already discussed this but I'm 
>> > > wondering whether we want to add a workflow that pulls out the code 
>> > > coverage whenever a PR was created? It should be easier for both the 
>> > > reviewers and the contributors to figure out what can be improved, or at 
>> > > least figure out a part that is probably not covered by the tests?
>> > >
>> > > Best
>> > > Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Code coverage check for PRs

2022-10-04 Thread Robert Muir

btw, you can look at the current reports created by jenkins here:
https://ci-builds.apache.org/job/Lucene/job/Lucene-Coverage-main/lastBuild/jacoco/

On Tue, Oct 4, 2022 at 6:51 AM Robert Muir  wrote:
>
> we can run the tests with coverage option and produce coverage graph
> from the github actions, but need to look at the docs to see where to
> put it so it will be available.
>
> I want us to be careful about the word "check" as I'm adamantly
> against any such automated check (e.g. coverage > N%) in the logic.
> Coverage report is just a tool to help us and the moment we do stupid
> shit like that, is the moment people start gaming it just to make the
> build pass.
>
> On Mon, Oct 3, 2022 at 10:57 PM Patrick Zhai  wrote:
> >
> > Hi folks,
> > I'm not sure whether people have already discussed this but I'm wondering 
> > whether we want to add a workflow that pulls out the code coverage whenever 
> > a PR was created? It should be easier for both the reviewers and the 
> > contributors to figure out what can be improved, or at least figure out a 
> > part that is probably not covered by the tests?
> >
> > Best
> > Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Code coverage check for PRs

2022-10-04 Thread Robert Muir

we can run the tests with coverage option and produce coverage graph
from the github actions, but need to look at the docs to see where to
put it so it will be available.

I want us to be careful about the word "check" as I'm adamantly
against any such automated check (e.g. coverage > N%) in the logic.
Coverage report is just a tool to help us and the moment we do stupid
shit like that, is the moment people start gaming it just to make the
build pass.

On Mon, Oct 3, 2022 at 10:57 PM Patrick Zhai  wrote:
>
> Hi folks,
> I'm not sure whether people have already discussed this but I'm wondering 
> whether we want to add a workflow that pulls out the code coverage whenever a 
> PR was created? It should be easier for both the reviewers and the 
> contributors to figure out what can be improved, or at least figure out a 
> part that is probably not covered by the tests?
>
> Best
> Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-9.x-Linux (64bit/jdk-18) - Build # 5866 - Unstable!

2022-09-30 Thread Robert Muir

As far as the issue, I've been meaning to look into it... just have
not found the time yet. I think it will work to run the reproducing
seed with -Ptests.profile=true -Ptests.profile.mode=heap, to debug the
heap allocations with JFR and see where its blowing up wrt to the
multiplier setting. I don't have the energy in me to inspect a heap
dump :)

On Fri, Sep 30, 2022 at 4:01 AM Uwe Schindler  wrote:
>
> Thanks Robert for the link to issue. I have to get used to Github issues
> :-). I was looking in JIRA :-)
>
> We should have a big warning on JIRA that it is not only read-only but
> also not uptodate. Spring framework redirects all JIRA issues and the
> JIRA homepage to the corresponding Github issues. This can be done by a
> RewriteMap in the Proxy Server before JIRA. The map could easily
> generated from our collected metadata (JIRA issue number -> Github number).
>
> Uwe
>
> Am 30.09.2022 um 09:51 schrieb Robert Muir:
> > I've seen this failure before here:
> > https://github.com/apache/lucene/issues/11754
> >
> >  From what I remember, seems something blows up with the multiplier
> > that causes the usage.
> >
> > On Fri, Sep 30, 2022 at 3:17 AM Uwe Schindler  wrote:
> >> Hi,
> >>
> >> I have never seen this before. It looks like something in this test ran
> >> out of memory (too many clauses, to large top-n?).
> >>
> >> Uwe
> >>
> >> Am 30.09.2022 um 07:48 schrieb Policeman Jenkins Server:
> >>> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/5866/
> >>> Java: 64bit/jdk-18 -XX:-UseCompressedOops -XX:+UseParallelGC
> >>>
> >>> 1 tests failed.
> >>> FAILED:  org.apache.lucene.search.TestBoolean2.testRandomQueries
> >>>
> >>> Error Message:
> >>> java.lang.OutOfMemoryError: Java heap space
> >>>
> >>> Stack Trace:
> >>> java.lang.OutOfMemoryError: Java heap space
> >>>at 
> >>> __randomizedtesting.SeedInfo.seed([1ACF9F67D361176D:44E42F8BEBBB3AF3]:0)
> >>>at 
> >>> org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:96)
> >>>at 
> >>> org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:43)
> >>>at 
> >>> org.apache.lucene.search.FieldValueHitQueue.(FieldValueHitQueue.java:123)
> >>>at 
> >>> org.apache.lucene.search.FieldValueHitQueue$OneComparatorFieldValueHitQueue.(FieldValueHitQueue.java:59)
> >>>at 
> >>> org.apache.lucene.search.FieldValueHitQueue.create(FieldValueHitQueue.java:159)
> >>>at 
> >>> org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:454)
> >>>at 
> >>> org.apache.lucene.search.TopFieldCollector$1.newCollector(TopFieldCollector.java:501)
> >>>at 
> >>> org.apache.lucene.search.TopFieldCollector$1.newCollector(TopFieldCollector.java:493)
> >>>at 
> >>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:669)
> >>>at 
> >>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:656)
> >>>at 
> >>> org.apache.lucene.search.TestBoolean2.testRandomQueries(TestBoolean2.java:406)
> >>>at 
> >>> java.base/java.lang.invoke.LambdaForm$DMH/0x000800d6.invokeVirtual(LambdaForm$DMH)
> >>>at 
> >>> java.base/java.lang.invoke.LambdaForm$MH/0x000800ca4800.invoke(LambdaForm$MH)
> >>>at 
> >>> java.base/java.lang.invoke.Invokers$Holder.invokeExact_MT(Invokers$Holder)
> >>>at 
> >>> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(DirectMethodHandleAccessor.java:154)
> >>>at 
> >>> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> >>>at java.base/java.lang.reflect.Method.invoke(Method.java:577)
> >>>at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> >>>at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> >>>at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> >>>at 
> >>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> >>>at 
> >>> org.apache.lucene.tests.util

Re: [JENKINS] Lucene-9.x-Linux (64bit/jdk-18) - Build # 5866 - Unstable!

2022-09-30 Thread Robert Muir

I've seen this failure before here:
https://github.com/apache/lucene/issues/11754

>From what I remember, seems something blows up with the multiplier
that causes the usage.

On Fri, Sep 30, 2022 at 3:17 AM Uwe Schindler  wrote:
>
> Hi,
>
> I have never seen this before. It looks like something in this test ran
> out of memory (too many clauses, to large top-n?).
>
> Uwe
>
> Am 30.09.2022 um 07:48 schrieb Policeman Jenkins Server:
> > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/5866/
> > Java: 64bit/jdk-18 -XX:-UseCompressedOops -XX:+UseParallelGC
> >
> > 1 tests failed.
> > FAILED:  org.apache.lucene.search.TestBoolean2.testRandomQueries
> >
> > Error Message:
> > java.lang.OutOfMemoryError: Java heap space
> >
> > Stack Trace:
> > java.lang.OutOfMemoryError: Java heap space
> >   at 
> > __randomizedtesting.SeedInfo.seed([1ACF9F67D361176D:44E42F8BEBBB3AF3]:0)
> >   at org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:96)
> >   at org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:43)
> >   at 
> > org.apache.lucene.search.FieldValueHitQueue.(FieldValueHitQueue.java:123)
> >   at 
> > org.apache.lucene.search.FieldValueHitQueue$OneComparatorFieldValueHitQueue.(FieldValueHitQueue.java:59)
> >   at 
> > org.apache.lucene.search.FieldValueHitQueue.create(FieldValueHitQueue.java:159)
> >   at 
> > org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:454)
> >   at 
> > org.apache.lucene.search.TopFieldCollector$1.newCollector(TopFieldCollector.java:501)
> >   at 
> > org.apache.lucene.search.TopFieldCollector$1.newCollector(TopFieldCollector.java:493)
> >   at 
> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:669)
> >   at 
> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:656)
> >   at 
> > org.apache.lucene.search.TestBoolean2.testRandomQueries(TestBoolean2.java:406)
> >   at 
> > java.base/java.lang.invoke.LambdaForm$DMH/0x000800d6.invokeVirtual(LambdaForm$DMH)
> >   at 
> > java.base/java.lang.invoke.LambdaForm$MH/0x000800ca4800.invoke(LambdaForm$MH)
> >   at 
> > java.base/java.lang.invoke.Invokers$Holder.invokeExact_MT(Invokers$Holder)
> >   at 
> > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(DirectMethodHandleAccessor.java:154)
> >   at 
> > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> >   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
> >   at 
> > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> >   at 
> > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> >   at 
> > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> >   at 
> > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> >   at 
> > org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> >   at 
> > org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> >   at 
> > org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> >   at 
> > org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> >   at 
> > org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> >   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> >   at 
> > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> >   at 
> > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> >   at 
> > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> >   at 
> > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> >   at 
> > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> >
> > -
> > To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: builds-h...@lucene.apache.org
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: IMPORTANT: Please update your gradle.properties file in your Lucene checkout!

2022-09-27 Thread Robert Muir

the 'gradlew -q javaToolChains' command is useful to see which JVMs
gradle knows about.

On Tue, Sep 27, 2022 at 3:34 PM Uwe Schindler  wrote:
>
> You just need to recreate Gradle properties, e.g. by deleting the old file.
>
> If you do not change anything Gradle will just work. On first build it will 
> autoprovision JDK 19 like any other dependency into the Gradle cache like 
> Maven artifacts and use it to compile the java 19 specific classes. The java 
> home should still point to java 17 or java 11.
>
> The environment var is just needed if you have JDK 19 at a non standard 
> location AND you don't want Gradle to download it automatically (Robert did 
> not want).
>
> Uwe
>
> P.S.: It will also check before downloading of you have a version of 19 
> installed in the OS dependent Standard locations (Ubuntu,...) Or from windows 
> registry or MacOS installer.
>
> Am 27. September 2022 19:48:05 MESZ schrieb David Smiley :
>>
>> > If you do not want Gradle to auto-provision the Java 19 for compilation of 
>> > those Preview classes, pass environment variable 
>> > JAVA19_HOME=/path/to/jdk19 to your build!
>>
>> That seems inverted; maybe I misunderstand?  If say we're working locally 
>> without Java 19 and don't want to bother it during dev, we should still have 
>> an env variable pointing to it?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Mon, Sep 26, 2022 at 9:57 AM Uwe Schindler  wrote:
>>>
>>> Hi,
>>>
>>> with deleting the file, I meant the "gradle.properties" in the lucene 
>>> checkout.
>>>
>>> Uwe
>>>
>>> Am 26.09.2022 um 15:44 schrieb Uwe Schindler:
>>>
>>> Hey,
>>>
>>> after merge of Java 19 support to main, 9.x and to-be-released 9.4, there 
>>> is a small change needed in your gradle.properties file. In earlier version 
>>> we disabled auto-provisioning of JDK releases for compilation, but now it 
>>> is required.
>>>
>>> If your build hangs at :lucene:core:compileMain19Java saying that theres no 
>>> release of Java 19 available, please change your gradle.properties in your 
>>> home folder to enable this feature:
>>>
>>> org.gradle.java.installations.auto-download=true
>>>
>>> If you delete the file and let the build system regenerate it, all will 
>>> work out of box. So you have the choice: Delete the file to regenerate 
>>> defaults or modify above property!
>>>
>>> Please also not that depending on your build system, the classes in 
>>> lucene/core/src/java19 may not compile (e.g. in Eclipse). I will work on 
>>> this in the following weeks. For now just ignore the compilation unit or 
>>> delete it from your IDE config. I may do something automatically using our 
>>> IDE autoconfiguration.
>>>
>>> If you do not want Gradle to auto-provision the Java 19 for compilation of 
>>> those Preview classes, pass environment variable JAVA19_HOME=/path/to/jdk19 
>>> to your build!
>>>
>>> To actually test the new code: Build the Lucene JAR and run the test suite 
>>> with RUNTIME_JAVA_HOME=/path/to/jdk19; alternatively compile your 
>>> application and pass "--enable-preview" to the Java command line!
>>>
>>> Thanks,
>>>
>>> Uwe
>>>
>>> --
>>> Uwe Schindler
>>> Achterdiek 19, D-28357 Bremen
>>> https://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>> --
>>> Uwe Schindler
>>> Achterdiek 19, D-28357 Bremen
>>> https://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene 9.4.0 RC3

2022-09-27 Thread Robert Muir

+1

Smoketester works for me again without hassles, thanks Uwe.

I tested both java 11 and java 17.

SUCCESS! [2:49:13.336252]

P.S. It would be nice option in the future to be able to test other
versions that we have MR-jar'd code for (e.g. 19 in this case).

On Tue, Sep 27, 2022 at 9:15 AM Michael Sokolov  wrote:
>
> Please vote for release candidate 3 for Lucene 9.4.0
>
> The artifacts can be downloaded from:
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.0-RC3-rev-d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.0-RC3-rev-d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956
>
> The vote will be open for at least 72 hours i.e. until 2022-09-30 14:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene 9.4.0 RC1

2022-09-21 Thread Robert Muir

+1

Ran the smoketester with both java 11 and 17:

SUCCESS! [2:41:19.024193]

On Tue, Sep 20, 2022 at 10:10 PM Michael Sokolov  wrote:
>
> Please vote for release candidate 1 for Lucene 9.4.0
>
> The artifacts can be downloaded from:
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.0-RC1-rev-f5d0646daa5651f2192282ac85551bca667e34f9
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.0-RC1-rev-f5d0646daa5651f2192282ac85551bca667e34f9
>
> The vote will be open for at least 72 hours i.e. until 2022-09-24 02:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 759 - Failure!

2022-09-13 Thread Robert Muir

Can also potentially avoid them and reduce the amount of back-n-forth
by pulling from the ultimate URL instead of redirecting around:
https://raw.githubusercontent.com/gradle/gradle/v7.3.3/gradle/wrapper/gradle-wrapper.jar

On Tue, Sep 13, 2022 at 3:20 AM Dawid Weiss  wrote:
>
> These 500/503s are getting annoying. Let's see if it's something that
> can be fixed with a simple retry mechanism.
>
> https://github.com/apache/lucene/pull/11766
>
> Dawid
>
> On Tue, Sep 13, 2022 at 7:59 AM Apache Jenkins Server
>  wrote:
> >
> > Build: 
> > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/759/
> >
> > No tests ran.
> >
> > Build Log:
> > [...truncated 33 lines...]
> > ERROR: Could not download gradle-wrapper.jar (Server returned HTTP response 
> > code: 500 for URL: 
> > https://github.com/gradle/gradle/raw/v7.3.3/gradle/wrapper/gradle-wrapper.jar).
> > Build step 'Invoke Gradle script' changed build result to FAILURE
> > Build step 'Invoke Gradle script' marked build as failure
> > Archiving artifacts
> > Recording test results
> > ERROR: Step ‘Publish JUnit test result report’ failed: No test report files 
> > were found. Configuration error?
> > Email was triggered for: Failure - Any
> > Sending email for trigger: Failure - Any
> >
> > -
> > To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: builds-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: release notes question

2022-09-02 Thread Robert Muir

Take a look here for the older ones:

https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes

On one hand you have to deal with confluence, but using the wiki has
the advantage that other ppl can edit it. So you can basically
copy-paste from a previous one as a template and enlist help from
others summarizing features and stuff.


On Fri, Sep 2, 2022 at 3:46 PM Michael Sokolov  wrote:
>
> Hi Lucene devs, I'm going through the release manager script, and
> coming to the point where it talks about writing release notes. It
> suggests starting from a previous release note on the confluence wiki,
> but it seems we haven't been using that for 9.x releases. Can previous
> release managers give some guidance on where to start here? I see the
> previous release notes on the web site (eg
> https://lucene.apache.org/core/corenews.html#apache-lucenetm-930-available),
> but do we have a standard place where we keep these in archived
> editable form?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [lucene] branch main updated: SimpleText knn vectors; fix searchExhaustively and suppress a byte format test case (#11725)

2022-08-31 Thread Robert Muir

thanks for fixing!

On Wed, Aug 31, 2022 at 2:43 PM Michael Sokolov  wrote:
>
> Oh -- sorry, I guess I forgot to backport. Thanks for tracking it down
> - I'll push to branch_9x shortly
>
> On Wed, Aug 31, 2022 at 10:25 AM Robert Muir  wrote:
> >
> > can we backport to 9.x if you get a chance? I'm still seeing this test
> > trip in 9.x jenkins builds.
> >
> >
> > On Mon, Aug 29, 2022 at 11:50 AM  wrote:
> > >
> > > This is an automated email from the ASF dual-hosted git repository.
> > >
> > > sokolov pushed a commit to branch main
> > > in repository https://gitbox.apache.org/repos/asf/lucene.git
> > >
> > >
> > > The following commit(s) were added to refs/heads/main by this push:
> > >  new 61ef031f7fa SimpleText knn vectors; fix searchExhaustively and 
> > > suppress a byte format test case (#11725)
> > > 61ef031f7fa is described below
> > >
> > > commit 61ef031f7fa3abdd7c8c2f36db71ad2289b66131
> > > Author: Michael Sokolov 
> > > AuthorDate: Mon Aug 29 11:49:52 2022 -0400
> > >
> > > SimpleText knn vectors; fix searchExhaustively and suppress a byte 
> > > format test case (#11725)
> > > ---
> > >  .../apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java | 6 
> > > +++---
> > >  lucene/core/src/test/org/apache/lucene/document/TestField.java  | 4 
> > > 
> > >  2 files changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git 
> > > a/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> > >  
> > > b/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> > > index e4b0ceb5916..10700f5de6f 100644
> > > --- 
> > > a/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> > > +++ 
> > > b/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> > > @@ -41,7 +41,6 @@ import 
> > > org.apache.lucene.store.BufferedChecksumIndexInput;
> > >  import org.apache.lucene.store.ChecksumIndexInput;
> > >  import org.apache.lucene.store.IOContext;
> > >  import org.apache.lucene.store.IndexInput;
> > > -import org.apache.lucene.util.BitSet;
> > >  import org.apache.lucene.util.Bits;
> > >  import org.apache.lucene.util.BytesRef;
> > >  import org.apache.lucene.util.BytesRefBuilder;
> > > @@ -187,8 +186,9 @@ public class SimpleTextKnnVectorsReader extends 
> > > KnnVectorsReader {
> > >@Override
> > >public TopDocs searchExhaustively(
> > >String field, float[] target, int k, DocIdSetIterator acceptDocs) 
> > > throws IOException {
> > > -int numDocs = (int) acceptDocs.cost();
> > > -return search(field, target, k, BitSet.of(acceptDocs, numDocs), 
> > > Integer.MAX_VALUE);
> > > +FieldInfo info = readState.fieldInfos.fieldInfo(field);
> > > +VectorSimilarityFunction vectorSimilarity = 
> > > info.getVectorSimilarityFunction();
> > > +return exhaustiveSearch(getVectorValues(field), acceptDocs, 
> > > vectorSimilarity, target, k);
> > >}
> > >
> > >@Override
> > > diff --git 
> > > a/lucene/core/src/test/org/apache/lucene/document/TestField.java 
> > > b/lucene/core/src/test/org/apache/lucene/document/TestField.java
> > > index 781f2b613c6..6aa5518f33b 100644
> > > --- a/lucene/core/src/test/org/apache/lucene/document/TestField.java
> > > +++ b/lucene/core/src/test/org/apache/lucene/document/TestField.java
> > > @@ -20,6 +20,7 @@ import static 
> > > org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
> > >
> > >  import java.io.StringReader;
> > >  import java.nio.charset.StandardCharsets;
> > > +import org.apache.lucene.codecs.Codec;
> > >  import org.apache.lucene.index.DirectoryReader;
> > >  import org.apache.lucene.index.IndexReader;
> > >  import org.apache.lucene.index.IndexWriter;
> > > @@ -513,6 +514,9 @@ public class TestField extends LuceneTestCase {
> > >}
> > >
> > >public void testKnnVectorField() throws Exception {
> > > +if (Codec.getDefault().getName().equals("SimpleText")) {
> > > +  return;
> > > +}
> > >  try (Directory dir = newDirectory();
> > >  IndexWriter w = new IndexWriter(dir, newIndexWriterConfig())) {
> > >Document doc = new Document();
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene » Lucene-Check-main - Build # 6584 - Failure!

2022-08-31 Thread Robert Muir

maybe the OOMKiller kicked in.

On Wed, Aug 31, 2022 at 3:06 PM Dawid Weiss  wrote:
>
>
> I think Lucene tests killed the job runner. :)
>
> > Task :lucene:analysis:nori:spotlessJavaCheck
> > Task :lucene:analysis:nori:spotlessCheck
> FATAL: command execution failed
> java.io.IOException: Backing channel 'lucene-solr-1' is disconnected.
> [...]
>
> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
>
>
>
> On Wed, Aug 31, 2022 at 8:15 PM Apache Jenkins Server 
>  wrote:
>>
>> Build: https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main/6584/
>>
>> No tests ran.
>>
>> Build Log:
>> [...truncated 959 lines...]
>> FATAL: command execution failed
>> java.io.IOException: Backing channel 'lucene-solr-1' is disconnected.
>> at 
>> hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:215)
>> at 
>> hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
>> at com.sun.proxy.$Proxy179.isAlive(Unknown Source)
>> at 
>> hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1215)
>> at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1207)
>> at hudson.Launcher$ProcStarter.join(Launcher.java:524)
>> at hudson.plugins.gradle.Gradle.perform(Gradle.java:317)
>> at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>> at 
>> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:814)
>> at hudson.model.Build$BuildExecution.build(Build.java:199)
>> at hudson.model.Build$BuildExecution.doRun(Build.java:164)
>> at 
>> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:522)
>> at hudson.model.Run.execute(Run.java:1896)
>> at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
>> at 
>> hudson.model.ResourceController.execute(ResourceController.java:101)
>> at hudson.model.Executor.run(Executor.java:442)
>> Caused by: java.io.IOException: Pipe closed after 0 cycles
>> at 
>> org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:126)
>> at 
>> org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:105)
>> at 
>> hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
>> at 
>> hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:75)
>> at 
>> hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
>> at 
>> hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
>> at 
>> hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
>> at 
>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
>> Build step 'Invoke Gradle script' changed build result to FAILURE
>> Build step 'Invoke Gradle script' marked build as failure
>> ERROR: Step ‘Archive the artifacts’ failed: no workspace for 
>> Lucene/Lucene-Check-main #6584
>> ERROR: Step ‘Publish JUnit test result report’ failed: no workspace for 
>> Lucene/Lucene-Check-main #6584
>> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
>> Email was triggered for: Failure - Any
>> Sending email for trigger: Failure - Any
>> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
>> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
>> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
>> ERROR: lucene-solr-1 is offline; cannot locate jdk_17_latest
>>
>> -
>> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: builds-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [lucene] branch main updated: SimpleText knn vectors; fix searchExhaustively and suppress a byte format test case (#11725)

2022-08-31 Thread Robert Muir

can we backport to 9.x if you get a chance? I'm still seeing this test
trip in 9.x jenkins builds.


On Mon, Aug 29, 2022 at 11:50 AM  wrote:
>
> This is an automated email from the ASF dual-hosted git repository.
>
> sokolov pushed a commit to branch main
> in repository https://gitbox.apache.org/repos/asf/lucene.git
>
>
> The following commit(s) were added to refs/heads/main by this push:
>  new 61ef031f7fa SimpleText knn vectors; fix searchExhaustively and 
> suppress a byte format test case (#11725)
> 61ef031f7fa is described below
>
> commit 61ef031f7fa3abdd7c8c2f36db71ad2289b66131
> Author: Michael Sokolov 
> AuthorDate: Mon Aug 29 11:49:52 2022 -0400
>
> SimpleText knn vectors; fix searchExhaustively and suppress a byte format 
> test case (#11725)
> ---
>  .../apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java | 6 
> +++---
>  lucene/core/src/test/org/apache/lucene/document/TestField.java  | 4 
>  2 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git 
> a/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
>  
> b/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> index e4b0ceb5916..10700f5de6f 100644
> --- 
> a/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> +++ 
> b/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
> @@ -41,7 +41,6 @@ import org.apache.lucene.store.BufferedChecksumIndexInput;
>  import org.apache.lucene.store.ChecksumIndexInput;
>  import org.apache.lucene.store.IOContext;
>  import org.apache.lucene.store.IndexInput;
> -import org.apache.lucene.util.BitSet;
>  import org.apache.lucene.util.Bits;
>  import org.apache.lucene.util.BytesRef;
>  import org.apache.lucene.util.BytesRefBuilder;
> @@ -187,8 +186,9 @@ public class SimpleTextKnnVectorsReader extends 
> KnnVectorsReader {
>@Override
>public TopDocs searchExhaustively(
>String field, float[] target, int k, DocIdSetIterator acceptDocs) 
> throws IOException {
> -int numDocs = (int) acceptDocs.cost();
> -return search(field, target, k, BitSet.of(acceptDocs, numDocs), 
> Integer.MAX_VALUE);
> +FieldInfo info = readState.fieldInfos.fieldInfo(field);
> +VectorSimilarityFunction vectorSimilarity = 
> info.getVectorSimilarityFunction();
> +return exhaustiveSearch(getVectorValues(field), acceptDocs, 
> vectorSimilarity, target, k);
>}
>
>@Override
> diff --git a/lucene/core/src/test/org/apache/lucene/document/TestField.java 
> b/lucene/core/src/test/org/apache/lucene/document/TestField.java
> index 781f2b613c6..6aa5518f33b 100644
> --- a/lucene/core/src/test/org/apache/lucene/document/TestField.java
> +++ b/lucene/core/src/test/org/apache/lucene/document/TestField.java
> @@ -20,6 +20,7 @@ import static 
> org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
>
>  import java.io.StringReader;
>  import java.nio.charset.StandardCharsets;
> +import org.apache.lucene.codecs.Codec;
>  import org.apache.lucene.index.DirectoryReader;
>  import org.apache.lucene.index.IndexReader;
>  import org.apache.lucene.index.IndexWriter;
> @@ -513,6 +514,9 @@ public class TestField extends LuceneTestCase {
>}
>
>public void testKnnVectorField() throws Exception {
> +if (Codec.getDefault().getName().equals("SimpleText")) {
> +  return;
> +}
>  try (Directory dir = newDirectory();
>  IndexWriter w = new IndexWriter(dir, newIndexWriterConfig())) {
>Document doc = new Document();
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Label vs. Milestone for version management?

2022-08-25 Thread Robert Muir

On Thu, Aug 25, 2022 at 9:47 AM Michael Sokolov  wrote:
>
> I agree; I've always used CHANGES for a quick historical view. What
> about the release manager use case? I haven't done a release, but I
> think we generally want to know if people are targeting changes for an
> upcoming release, especially if they are blockers. We could just use
> email to find out about these, but I think it's better if we can look
> them up in the issue db.

Mark them as priority blocker with a tag? that's all you could do with
JIRA, too.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Label vs. Milestone for version management?

2022-08-25 Thread Robert Muir

On Thu, Aug 25, 2022 at 6:11 AM Michael Sokolov  wrote:
>
> The milestone looks appealing since it is prominent and relatively easy to 
> use. The only drawback I have heard is that it is single valued. It still 
> seems we could use it to document the first version in which something is 
> released, although it wouldn't be possible to record other releases into 
> which a fix it feature is back ported.

The fix-version stuff seems like a JIRA relic to me. there are at
least two other places to get the information. If someone wants to
know this, they can see all the commits to the branches, can check
CHANGES.txt, etc?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable!

2022-08-24 Thread Robert Muir

On Wed, Aug 24, 2022 at 11:40 AM Uwe Schindler  wrote:
>
> Hi,
>
> this is the MacOS virtualbox. This one often hast timeshifts caused by
> Virtualbox and the NTP daemon of OSX is bullshit (no chrony).
>
> Actually earlier versions of MacOS had a bug in their OS libc
> segfaulting the app to crash on backwards jumps of wall time, which was
> fixed a few years ago. Now it looks like sometimes only Gradle/Java
> hangs because of this. Macos and backwards-jumping time do not fit well!
> Maybe a reason why Apple does not like their OS virtualized :-) Their
> bullshit kernel only works for 100% INTEL CPUs with all hardware
> behaving exactly in order to time.
>

Honestly, some of it is the virtualbox, too. Once you eliminate or
workaround wall-clock time and just deal with monotonic time, there
can still be annoying issues with just monotonic time. With a linux
guest, you'll see strange stuff, such as kernel's softlockup detector
trip a lot when this happens. There are corresponding errors printed
in the vbox logging too. I set VBOX_RELEASE_LOG_DEST to allow
archiving the virtualbox VM log for jenkins pickup along with other
logs: it helps with debugging shit like this. For linux guest, I
basically exhausted all possible kernel clock sources, and found the
kvm-clock virtualized one that happens by default is the best by far.
I'm guessing MacOS may not support this, which probably makes things
worse there. I found in my environment for linux guests, remaining
timer issues can be greatly improved with a 'vboxmanage setextradata
 VBoxInternal/TM/TSCModeSwitchAllowed 0'. Don't ask me what it
does :)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable!

2022-08-24 Thread Robert Muir

If we look at the 7687 issue, there's definitely some that can be
explained by unruly tests randomly behaving badly. But a few of those
(such as simple stemmer tests) look suspicious to me.
I've fought the issue with my own tests (non-java) and its amazing how
much stuff can break, if it relies on wall-clock time and the clock
gets stepped. I'm talking about basic 20-year old mature C code too :)
It is also surprising how large these clock corrections can be with
virtual machines.

To really confirm it, we'd need "system logs" as well to correlate the
NTP activity with the failure. With virtualbox jenkins builds, I do
this by enabling a serial console to file, and configure syslog to log
to /dev/console. And this "system log file" is just another artifact
that jenkins saves away for debugging. That's how i found the problem
in my own tests.

On Wed, Aug 24, 2022 at 9:08 AM Dawid Weiss  wrote:
>
> Damn. I know about it but never had it happen to me. You're right in
> that it could be a reason and it's definitely one of the aspects I can
> take off the checklist. It looks strange because those timeouts are
> fairly high - the time correction would indeed have to be significant
> for this to fail (and in the middle of the process?!). Anyway, I'll
> look into this - thanks for the pointer!
>
> Dawid
>
> On Wed, Aug 24, 2022 at 1:39 PM Robert Muir  wrote:
> >
> > Hi Dawid, I looked at this and also 
> > https://github.com/apache/lucene/issues/7687
> >
> > If you look at the instances and how sporadic they are, the problem
> > could be caused by TimeoutSuite using wall-clock time in
> > com.carrotsearch.randomizedtesting? Especially in virtual machines,
> > wall-clock time can be extremely inaccurate when you spin them up,
> > then there's a big correction (via NTP or VM agent).
> >
> > I have no proof this is what is happening, except to say, I think it
> > would be better if randomizedtesting used monotonic time (nanoTime)
> > rather than wall-clock time (currentTimeMillis). It would make it more
> > robust.
> >
> >
> > On Wed, Aug 24, 2022 at 4:48 AM Dawid Weiss  wrote:
> > >
> > > A test timed out. I've beasted with the same settings but can't
> > > reproduce. Either JVM bug somewhere or cosmic interference...
> > >
> > > Dawid
> > >
> > > On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
> > >  wrote:
> > > >
> > > > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
> > > > Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
> > > >
> > > > 2 tests failed.
> > > > FAILED:  
> > > > org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
> > > >
> > > > Error Message:
> > > > java.lang.Exception: Test abandoned because suite timeout was reached.
> > > >
> > > > Stack Trace:
> > > > java.lang.Exception: Test abandoned because suite timeout was reached.
> > > > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> > > >
> > > >
> > > > FAILED:  
> > > > org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
> > > >
> > > > Error Message:
> > > > java.lang.Exception: Suite timeout exceeded (>= 720 msec).
> > > >
> > > > Stack Trace:
> > > > java.lang.Exception: Suite timeout exceeded (>= 720 msec).
> > > > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> > > >
> > > > -
> > > > To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: builds-h...@lucene.apache.org
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable!

2022-08-24 Thread Robert Muir

Hi Dawid, I looked at this and also https://github.com/apache/lucene/issues/7687

If you look at the instances and how sporadic they are, the problem
could be caused by TimeoutSuite using wall-clock time in
com.carrotsearch.randomizedtesting? Especially in virtual machines,
wall-clock time can be extremely inaccurate when you spin them up,
then there's a big correction (via NTP or VM agent).

I have no proof this is what is happening, except to say, I think it
would be better if randomizedtesting used monotonic time (nanoTime)
rather than wall-clock time (currentTimeMillis). It would make it more
robust.


On Wed, Aug 24, 2022 at 4:48 AM Dawid Weiss  wrote:
>
> A test timed out. I've beasted with the same settings but can't
> reproduce. Either JVM bug somewhere or cosmic interference...
>
> Dawid
>
> On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
>  wrote:
> >
> > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
> > Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
> >
> > 2 tests failed.
> > FAILED:  
> > org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
> >
> > Error Message:
> > java.lang.Exception: Test abandoned because suite timeout was reached.
> >
> > Stack Trace:
> > java.lang.Exception: Test abandoned because suite timeout was reached.
> > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> >
> >
> > FAILED:  
> > org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
> >
> > Error Message:
> > java.lang.Exception: Suite timeout exceeded (>= 720 msec).
> >
> > Stack Trace:
> > java.lang.Exception: Suite timeout exceeded (>= 720 msec).
> > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> >
> > -
> > To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: builds-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Boolean query regression after migrating from Lucene 8.5 to 9.2

2022-08-19 Thread Robert Muir

On Thu, Aug 18, 2022 at 1:47 PM Alexander Lukyanchikov
 wrote:

>
> Currently we are trying to avoid switching to MMAP because there is another 
> process running on the same host and extensively utilizes the FS cache.
>

This makes no sense, NIOFSDirectory uses the FS cache the exact same
way as mmap. it just uses read() interface instead.

A self-created problem!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [RESULT] [VOTE] Migration to GitHub issue from Jira

2022-06-18 Thread Robert Muir

On Sat, Jun 18, 2022, 7:42 AM Tomoko Uchida 
wrote:

> User id mapping is an important consideration for me.
>
> Can we find a mapping from Jira user id to GitHub account anywhere?
>

I think we would have to create it. But my hope would be that maybe 50-100
names would cover large majority of issues.

> Don't we have to gain the consent of each individual to map both accounts?
>

No, we don't have to ask permission to mention someone with an @username


> 2022年6月18日(土) 18:52 Robert Muir :
> >
> > I looked at some related projects on github:
> > https://github.com/Skraeda/jira-2-github
> > Does the barebones basics but helps you think of the inputs: "username
> > mapping", "release -> milestone mapping", etc. Of course for a
> > username mapping, maybe its best to just handle the top 99% or so and
> > let the long-tail just come across as "full name". I also find plenty
> > of projects that convert "special jira language" to markdown, e.g.
> > https://github.com/catcombo/jira2markdown
> > I'm not convinced conversion would be degraded, with a little bit of
> > thought into the conversion, I think it could actually be *better*.
> > github issues can do everything jira can, just without the fussy UI.
> > e.g. issues can have attachments (for all the patch files), and
> > attachment names can have duplicates. Issues can link to other issues,
> > commits, or PRs easily.
> >
> > It just depends on how much we want to invest into it. If we want to
> > really go whole-hog, then when we do the initial JIRA->issue
> > conversion, we should *save that mapping* as a .CSV file or similar.
> > Because later we could then use it to find/replace URLs in
> > Changes.txt, source code, benchmark annotations, etc etc. Let's at
> > least leave the possibility open to do that work as followup.
> >
> > I find the idea that we're stuck looking at JIRA forever ridiculous.
> >
> > On Sat, Jun 18, 2022 at 3:19 AM Dawid Weiss 
> wrote:
> > >
> > >
> > > I honestly don't know what can be done and what has to be sacrificed.
> I'm pretty sure it'll be more difficult than svn->git conversion because
> more factors are involved. One tough thing to somehow preserve may be user
> names (reporters, etc.). I'm not sure how other projects dealt with that.
> > >
> > > Perhaps a way to do it incrementally would be to create a json/xml
> (structured) dump of jira content and then write a converter into a similar
> json/xml dump for importing into github. I remember it took many iterations
> and trial and error for svn->git conversion to eventually reach the final
> shape and it was simpler  and faster to do it locally.
> > >
> > > Dawid
> > >
> > > On Sat, Jun 18, 2022 at 8:59 AM Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> > >>
> > >> I'll give it a try though, I'm really skeptical that it can be done
> > >> with a satisfactory level of quality (we want to "preserve" issue
> > >> history, not just to have shallow/degraded copies, right?), and the
> > >> migration will be significantly delayed to figure out the way to
> > >> properly moving all issues to GitHub.
> > >> if there is another way to bypass this challenge - please let me know.
> > >>
> > >> Tomoko
> > >>
> > >> 2022年6月18日(土) 15:44 Dawid Weiss :
> > >>
> > >> >
> > >> >
> > >> > Hi Tomoko,
> > >> >
> > >> > I've added a few bullet points that script could/should handle
> under LUCENE-10557, hope you don't mind. If you place these script(s) in
> the open then perhaps indeed we could try to collaborate and see what can
> be done.
> > >> >
> > >> > Dawid
> > >> >
> > >> > On Sat, Jun 18, 2022 at 5:33 AM Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> > >> >>
> > >> >> Replying to myself - Jira issues can be read via REST API without
> any
> > >> >> access token and we can iterate all issues by issue number.
> > >> >> curl -s
> https://issues.apache.org/jira/rest/api/latest/issue/LUCENE-10557
> > >> >>
> > >> >> Would you please hold the discussion for a while - it's a waste of
> our
> > >> >> time without a working prototype to me. I will be back here with a
> > >> >> sandbox github repo where part of existing jira issues are migrated
> > >> >> (with

Re: [RESULT] [VOTE] Migration to GitHub issue from Jira

2022-06-18 Thread Robert Muir

; >> > >
>> >> > > Tomoko
>> >> > >
>> >> > > 2022年6月18日(土) 9:26 Tomoko Uchida :
>> >> > > >
>> >> > > > I don't intend to neglect histories in Jira... it's an important,
>> >> > > > valuable asset for all of us and possible contributors in the 
>> >> > > > future.
>> >> > > >
>> >> > > > It's important, *therefore*, I don't want to have the degraded 
>> >> > > > copies
>> >> > > > of them on GitHub.
>> >> > > > We cannot preserve all of history - again, there should be tons of
>> >> > > > unignorable information losses (timestamp, reporter, assignee,
>> >> > > > markdown, metadata that cannot be ported to GitHub) if we attempt to
>> >> > > > migrate the whole Jira history into Github. Rather than trying to 
>> >> > > > have
>> >> > > > such incomplete copies, I would preserve Jira issues in the 
>> >> > > > perfectly
>> >> > > > archived status, then simply refer to them.
>> >> > > >
>> >> > > > Tomoko
>> >> > > >
>> >> > > > 2022年6月18日(土) 7:47 Gus Heck :
>> >> > > > >
>> >> > > > > I hope you count me as someone who sees history as important. 
>> >> > > > > It's important in more ways than one however. You gave the 
>> >> > > > > example of trying to understand something, and looking at the 
>> >> > > > > issue history directly. I also give weight to the scenario where 
>> >> > > > > someone has written a blog post about the topic and linked the 
>> >> > > > > issue "For the latest see LUCENE-" for example... Or someone 
>> >> > > > > planning upgrades has a spreadsheet of things to track down... 
>> >> > > > > The existing links should point to a *complete* history of the 
>> >> > > > > issue.
>> >> > > > >
>> >> > > > > I don't see the migration of everything to github as being as 
>> >> > > > > critical as you do but I'm not at all against migrating things 
>> >> > > > > that are closed if someone wants to do that work, and perhaps 
>> >> > > > > even copying over existing open issues periodically as they 
>> >> > > > > become closed (and accelerating the close rate by aggressive 
>> >> > > > > closing of silent issues). No new issues in Jira sounds fine, 
>> >> > > > > even better if enforced by Jira. Proceed from here in Github 
>> >> > > > > since that's where the community wants to go. Links to the 
>> >> > > > > migrated version automatically added to Jira and/or backlinks to 
>> >> > > > > Jira would be just fine too since readers might (hopefully 
>> >> > > > > needlessly) worry that something didn't get migrated, we should 
>> >> > > > > make it easy to check.
>> >> > > > >
>> >> > > > > What I don't want is for someone to land on an issue via link or 
>> >> > > > > via google search (or via search in jira because they are using 
>> >> > > > > Jira already for some other apache project), read through it and 
>> >> > > > > think A) it never got resolved when it did or B) miss the fact 
>> >> > > > > that it got reopened and further changes were made and only have 
>> >> > > > > half the story... or any other scenario where they are looking at 
>> >> > > > > an incomplete record of the issue. (thus obfuscating/splitting 
>> >> > > > > the very important rich history across systems).
>> >> > > > >
>> >> > > > > So that's why I feel issues should be completely tracked in the 
>> >> > > > > system where they were created. Syncing old closed stuff into a 
>> >> > > > > new system probably is fine so long as there are periodic sweeps 
>> >> > > > > to pull in reopens or newly completed issues. We could even sync 
>> >> > > > > open things so long as they are clearly marked in the title as 
>> >> > > >

Re: [RESULT] [VOTE] Migration to GitHub issue from Jira

2022-06-17 Thread Robert Muir

On Fri, Jun 17, 2022 at 3:27 PM Dawid Weiss  wrote:
>
> I'd be more afraid of what happens to github issues in two years (or longer). 
> Will it look the same? Will it be different? Will it be gone (and how do we 
> get a backup of the isse history then)? Contrary to the apache-hosted Jira, 
> github is very much an independent entity. If Elon Musk decides to buy and 
> close it tomorrow... then what? :)
>

We already have a ton of github "issues" (pull requests, since PRs are issues).
If you want to "back them up", its easy, you can paginate thru them
100 at a time, e.g. run this command, incrementing 'page' until it
returns empty list:

  curl -H "Accept: application/vnd.github.v3+json"
"https://api.github.com/repos/apache/lucene/issues?per_page=100=1=asc=all;
> file1.json

Yeah of course if you want to backup the comments and stuff, you'll
need to do more.
But it is already the case today, that a ton of this "history" is
already in github issues, as PRs. Most recent JIRAs are just useless
placeholders.
Also the same risks apply to JIRA, except are not theoretical and real
concerns, no? I thought Atlassian had deprecated "onsite" JIRA to try
to sucker you into their "Atlassian Cloud":
https://www.theregister.com/2020/10/19/atlassian_server_licenses/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [RESULT] [VOTE] Migration to GitHub issue from Jira

2022-06-17 Thread Robert Muir

On Fri, Jun 17, 2022 at 12:08 PM Michael McCandless
 wrote:
>
> I agree the embedded links are tricky.  Not sure whether we could do a big 
> rewrite of those links or not ... seems a chicken/egg situation.  We could 1) 
> append a forwarding link comment on the Jira issue to its GitHub version, and 
> 2) make Jira read-only so the risk of a user adding a Jira comment on an old 
> issue that then goes into /dev/null, is gone.
>

Couldn't we solve this with 2 passes?

First pass: create GH issue corresponding to each JIRA issue:
LUCENE-1000 -> issue #564
Second pass: "correct" references in the texts: LUCENE-1000 -> #564

You could also handle special "links" in JIRA issues with the same
stuff, e.g. at the beginning of the GH issue, it could just have some
markdown like:
Links:
  * Relates to #476

Could also be a way to transfer over subtasks with some sanity, e.g.
add "Subtask of #200" to the text somewhere. Then these would be
"linked" in GH.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: exposing per-field storage usage

2022-06-14 Thread Robert Muir

On Tue, Jun 14, 2022 at 10:37 AM Michael Sokolov  wrote:
>
> Oh, yes that's a clever idea. It seems it would take quite a while
> (tens of minutes?) for a larger index though? Much faster than the
> force-merge solution for sure. I guess to get faster we would have to
> instrument each format. I mean they generally do know how much space
> each field is occupying, but perhaps it's too much API change to
> expose that.

Why tens of minutes? That simple first doc/last doc works for the term
vectors and docvalues too. For the postings, Terms.java has methods
getMin() and getMax(), so it is possible to seek to the first and last
term for the field.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: exposing per-field storage usage

2022-06-13 Thread Robert Muir

On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
 wrote:
>
> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API 
> estimates the storage of each field by iterating its structures (i.e., 
> inverted index, doc-values, stored fields, etc.) and tracking the number of 
> read-bytes. The result is pretty fast and accurate.
>
> I am +1 to the proposal.
>

I like an approach such as this, enumerate the index, using something
like FilterDirectory to track the bytes. It doesn't require you to
force-merge all the data through addIndexes, and at the same time it
doesn't invade the codec apis.
The user can always force-merge the data themselves for situations
such as benchmarks/tracking space over time, otherwise the
fluctuations from merges could create too much noise.
Personally, I would suggest separate api/tool from CheckIndex, perhaps
this tracking could mask bugs? No reason to mix the two concerns.
Also, the tool can be much more efficient than checkindex, e.g. for
stored fields and vectors it can just retrieve the first and last
documents, whereas checkindex should verify all of the documents
slowly.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Migration to GitHub issue from Jira (LUCENE-10557)

2022-06-07 Thread Robert Muir

+1

On Mon, May 30, 2022 at 11:40 AM Tomoko Uchida
 wrote:
>
> Hi everyone!
>
> As we had previous discussion thread [1], I propose migration to GitHub issue 
> from Jira.
> It'd be technically possible (see [2] for details) and I think it'd be good 
> for the project - not only for welcoming new developers who are not familiar 
> with Jira, but also for improving the experiences of long-term 
> committers/contributors by consolidating the conversation platform.
>
> You can see a short summary of the discussion, some stats on current Jira 
> issues, and a draft migration plan in [2].
> Please review [2] if you haven't seen it and vote for this proposal.
>
> The vote will be open until 2022-06-06 16:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
>
> *IMPORTANT NOTE*
> I set a local protocol for this vote.
> There are 95 committers on this project [3] - the vote will be effective if 
> it successfully gains more than 15% of voters (>= 15) from committers 
> (including PMC members). This means, that although only PMC member votes are 
> counted for the final result, the votes from all committers are important to 
> make the vote result effective.
>
> If there are less than 15 votes at 2022-06-06 16:00 UTC, I will expand the 
> term to 2022-06-13 16:00 UTC. If this fails to get sufficient voters after 
> the expanded time limit, I'll cancel this vote regardless of the result.
> But why do I set such an extra bar? My fear is that if such things are 
> decided by the opinions of a few members, the result shouldn't yield a good 
> outcome for the future. It isn't my goal to just pass the vote [4].
>
> [1] https://lists.apache.org/thread/78wj0vll73sct065m5jjm4z8gqb5yffk
> [2] https://issues.apache.org/jira/browse/LUCENE-10557
> [3] https://projects.apache.org/committee.html?lucene
> [4] I'm sorry for being overly cautious, but I have never met in person or 
> virtually any of the committers (with a very few exceptions), therefore 
> cannot assess if the vote result is reliable or not unless there is certain 
> explicit feedback.
>
> Tomoko

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Adding a new PointDocValuesField

2022-05-26 Thread Robert Muir

On Thu, May 26, 2022 at 11:49 AM Greg Miller  wrote:
>
> I agree that technically it's just as good. I also think it's less
> clear for a user. The concept of "points" is something we've
> established in Lucene, so I think it makes sense for users to think
> about indexing points as a doc value as opposed to having to manage
> multiple fields for all their dimensions in this sort of unsorted
> field. But that's just my opinion as a user. But that's maybe a bit
> philosophical at this point and I think we can "agree to disagree" for
> now because...

Users don't deal with low level docvalues codec APIs, so I see this
"as a user" as irrelevant, sorry. Higher-level classes (e.g. Field
class) could impl it this way as implementation detail.

>
> ... just to be clear, I'm _not_ suggesting we add a new doc value type
> at this time. I'm not even necessarily advocating that we ever add it.
> I think it's perfectly reasonable to define a new Field class that
> builds on top of BDV (as Marc has done in his PR) that allows users to
> add "point" fields to their documents that get indexed as doc values
> (using BDV). This is very similar to LatLonDocValuesField,
> LongRangeDocValuesField, etc. Is that an acceptable approach to you,
> or are you advocating that we shouldn't do that and should instead
> create these new "unsorted" numeric fields now? I'm even fine if we
> put this in the sandbox module for now while we "kick the tires." In
> fact, I think I'd advocate for that.

+1 to build a field class in sandbox, using BDV behind the scenes. I
don't want to add any new DV types, trust me. I am just especially
opinionated against multidimensional stuff pushed down to docvalues
level, when it makes no sense from a DV perspective (column stride
fields). If you have 3 dimensions of numbers, at a low level it would
just  make 3 columns at the end of the day anyway: IMO it would only
make codec code more complicated with no benefit. So that's why I was
listing out other alternatives.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Adding a new PointDocValuesField

2022-05-25 Thread Robert Muir

On Wed, May 25, 2022 at 2:08 PM Greg Miller  wrote:
>
>
> I guess with an “unsorted” numeric DV type we could get there with aligned 
> indices, as you describe, but that seems less appealing than supporting 
> multi-dim points directly.
>

Name one technical reason why?
Unsorted would be exactly just as good, except also more general
purpose. The number of docvalues types should be kept to a strict
minimum, and should be generally useful to a variety of common
use-cases. Each type has a huge maintenance cost, and never goes away.
Every codec must implement every type.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Adding a new PointDocValuesField

2022-05-25 Thread Robert Muir

On Wed, May 25, 2022 at 12:17 AM Greg Miller  wrote:
>
>  A "two separate field approach" would
> consist of indexing year and make separately, and you'd lose the
> information that only certain combinations are valid. Am I overlooking
> something with your suggestion? Maybe there's something we can do with
> Lucene already that solves for this case and I'm just not aware of it?
> That's entirely possible and I'd love to learn more if there is!

This makes no sense to me. If there are two dimensions, there's no
difference in faceting code calling fieldA.value and fieldB.value,
than calling field.valueA and field.valueB.

In other words, doesn't make any sense to needlessly "pack dimensions
together" at docvalues level, especially for what should be a
column-stride field. There's really no difference from the app
perspective. Any issues you have here seem to be issues around facet
module and not docvalues...

>
> As for MultiRangeQuery and the mention of sandbox modules, I think
> that's a bit of a different use-case. MultiRangeQuery lets you filter
> by a disjunction of ranges. The "multi" part doesn't relate to
> "multiple values in a doc" (but it does support that, as do the
> "standard" range queries).
>
> Where I see a gap right now, beyond just faceting, is that we can
> represent N-dim points in the points index and filter on them (using
> the points index), but we have no doc values equivalent. This means,
> 1) we can't facet, and 2) we can't create a "slow" query that does
> post-filtering instead of using the points index (which could be a
> very real advantage in cases with a sparse match set but a dense
> points index). So I like the idea of creating that concept and being
> able to facet and filter on it. Whether-or-not this is a "formal" doc
> values type or sits on top of BDV, I have less of a strong opinion.

We shouldn't add new docvalues types because of "slow queries", I'm
really against that. The root problem is that points impl can't filter
well (like the inverted index can), and as a hack, docvalues "picks up
the slack". If its becoming a major issue, address this with points
directly?

>
> And finally... it really should be multi-valued. The points index
> supports multiple points-per-field within a single document. Seems
> like a big gap that we wouldn't support that with a doc value field.
> Because BDV is inherently single-valued, I propose we come up with an
> encoding scheme that encodes multiple points on top of that "single"
> BDV entry. This is where building on BDV started to feel a little icky
> to me and it seemed like it might be a good use-case for actually
> formalizing a format/encoding, but again, no strong preference. We
> could certainly do something more quickly on top of BDV and formalize
> an encoding later if/as necessary.

Doesn't matter that points index supports it. Do the use-cases make
sense? It's especially stupid that e.g. LatLonDocValueField supports
multi-values. Really? What kind of quantum documents are in multiple
locations at the same time?

The sortedset/sortednumeric exist to support use-cases on String and
int, where user wants to "sort on a multivalued field", which is
really crazy if you think about it. So they both sort the numbers at
index-time, so that you can pick a "representative" value
(min/max/median) in constant time. I think a lot of this existing
stuff is just brain-damage from the no-sql fads, alternatively we
could remove this multivalued nonsense and the crazy servers that want
to follow no-sql fads could index just the "representative value"
(min/max/median) in a single-valued field.

Sorry, I'm just not seeing a lot of strong use-cases here to justify
creating a new DV field, which we should really avoid, as its a hugely
expensive cost. I would recommend prototyping stuff with
BinaryDocValues, using the sandbox, etc. See if the features get
popular and people use them.

If they really "catch on", and we think its more efficient, then we
can think about how the stuff could be best encoded/compressed/etc.
But adding a new type should be the last resort. Adding some
specialized multi-dimensional type is IMO out of the question. It
would be a lot less horrible to just use separate DV fields, one for
each dimension. If there is *strong* compelling use-cases for
multi-valued stuff, then in the worst case we could think about
something like a UnsortedNumericDV, which would allow fieldA[0] to
align with fieldB[0] and fieldA[1] to align with fieldB[1], which
would solve the issue for faceting. Just don't allow sorting. And
probably not any "slow" query stuff too.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Adding a new PointDocValuesField

2022-05-25 Thread Robert Muir

On Wed, May 25, 2022 at 8:04 AM Michael Sokolov  wrote:
>
> Also, there should be examples from other fields. Suppose you are
> indexing map data and want to support a UI that shows "hot spots" on
> the map where there is a lot of let's say ... activity of some sort.
> You'd like to facet on 2-d areas.

then use LatLonDocValuesField

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Adding a new PointDocValuesField

2022-05-24 Thread Robert Muir

This seems really exotic feature to add a dedicated docvalues field for.

We should let BINARY be the catchall for stuff like this.

On Mon, May 23, 2022 at 10:17 PM Marc D'Mello  wrote:
>
> Hi,
>
> Some background: I've been working on this PR to add hyper rectangle faceting 
> capabilities to Lucene facets and I needed to create a new doc values field 
> to support this feature. Initially, I had a field that just extended 
> BinaryDocValues, but then a discussion came up about whether to add a 
> completely new DocValues field, maybe something like PointDocValuesField (and 
> SortedPointDocValuesField as the multivalued version) to add first class 
> support for this new field. Here is the link to the discussion. I think there 
> are a few benefits to this:
>
> Formalize how we would store points as doc values rather than just packing 
> points into a BinaryDocValues field in a format that could change at any time
> NumericDocValues enables us to create a SortedNumericDocValuesRange query 
> which can be used with IndexOrDocValuesQuery to make some range queries more 
> efficient. Adding this new doc values field would let us do the same thing 
> with higher dimensional ranges
>
> I'm sure I could be missing some benefits, and I also am not super 
> experienced with Lucene so there could be drawbacks I am missing as well :). 
> From what I understand though, Lucene doesn't have a lot of DocValues fields 
> and there should be some thought put into adding new ones, so I was wondering 
> if I could get some feedback about the idea. Thanks!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene 9.2.0 RC1

2022-05-18 Thread Robert Muir

I opened issue about this. It shouldn't block the release, but it is
pretty crazy and something to improve.

https://issues.apache.org/jira/browse/LUCENE-10579

On Wed, May 18, 2022 at 3:10 PM Robert Muir  wrote:
>
> It seems strange the way that
> confirmAllReleasesAreTestedForBackCompat() parses the output of the
> test's results, especially with test.verbose enabled.
> It uses regex to "look" for certain prints from the test in order to
> "see" that each release is tested.
> Maybe regex failed because something randomly got printed to stdout at
> that time (its passing tests.verbose, could be anything)?
> Would it be better to just inspect the backward_index folder and look
> at the .zip filenames?
>
> On Wed, May 18, 2022 at 2:54 PM Robert Muir  wrote:
> >
> > The smoketester failed for me. Strange that it didn't fail for anyone
> > else. My command line:
> >
> > $ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk/
> > $ python3 -u dev-tools/scripts/smokeTestRelease.py --test-java17
> > /home/rmuir/Downloads/jdk
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-9.2.0-RC1-rev-978eef5459c7683038ddcca4ec56e4baa63715d0
> >
> > (after many hours)...
> >
> >   confirm all releases have coverage in TestBackwardsCompatibility
> > find all past Lucene releases...
> > run TestBackwardsCompatibility..
> > Releases that don't seem to be tested:
> >   8.6.1
> > Traceback (most recent call last):
> >   File "/home/rmuir/workspace/lucene/dev-tools/scripts/smokeTestRelease.py",
> > line 1188, in 
> > main()
> >   File "/home/rmuir/workspace/lucene/dev-tools/scripts/smokeTestRelease.py",
> > line 1122, in main
> > smokeTest(c.java, c.url, c.revision, c.version, c.tmp_dir,
> > c.is_signed, c.local_keys, ' '.join(c.test_args),
> >   File "/home/rmuir/workspace/lucene/dev-tools/scripts/smokeTestRelease.py",
> > line 1176, in smokeTest
> > unpackAndVerify(java, tmpDir, 'lucene-%s-src.tgz' % version,
> > gitRevision, version, testArgs)
> >   File "/home/rmuir/workspace/lucene/dev-tools/scripts/smokeTestRelease.py",
> > line 524, in unpackAndVerify
> > verifyUnpacked(java, artifact, unpackPath, gitRevision, version, 
> > testArgs)
> >   File "/home/rmuir/workspace/lucene/dev-tools/scripts/smokeTestRelease.py",
> > line 629, in verifyUnpacked
> > confirmAllReleasesAreTestedForBackCompat(version, unpackPath)
> >   File "/home/rmuir/workspace/lucene/dev-tools/scripts/smokeTestRelease.py",
> > line 1108, in confirmAllReleasesAreTestedForBackCompat
> > raise RuntimeError('some releases are not tested by
> > TestBackwardsCompatibility?')
> > RuntimeError: some releases are not tested by TestBackwardsCompatibility?
> >
> > On Wed, May 18, 2022 at 10:54 AM Michael McCandless
> >  wrote:
> > >
> > > +1
> > >
> > > SUCCESS! [0:35:17.914586]
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Wed, May 18, 2022 at 8:59 AM Alan Woodward  
> > > wrote:
> > >>
> > >> Please vote for release candidate 1 for Lucene 9.2.0
> > >>
> > >> The artifacts can be downloaded from:
> > >> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.2.0-RC1-rev-978eef5459c7683038ddcca4ec56e4baa63715d0
> > >>
> > >> You can run the smoke tester directly with this command:
> > >>
> > >> python3 -u dev-tools/scripts/smokeTestRelease.py \
> > >> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.2.0-RC1-rev-978eef5459c7683038ddcca4ec56e4baa63715d0
> > >>
> > >> Given that we have a weekend coming up, the vote will be open for at 
> > >> least 5 days i.e. until 2022-05-23 13:00 UTC.
> > >>
> > >> [ ] +1  approve
> > >> [ ] +0  no opinion
> > >> [ ] -1  disapprove (and reason why)
> > >>
> > >> Here is my +1
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
> > >>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 17291 matches

Mail list logo