Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Michael Sokolov
Thanks for digging into this. I do think it will be helpful for
developers that blithely access the IndexInput from multiple threads
:)

On Sat, Oct 21, 2023 at 3:53 PM Chris Hostetter
 wrote:
>
>
> Uwe: In your PR, you should add these details to the javadocs of
> ByteBufferIndexInput.alreadyClosed(), so future code spelunkers understand
> the choice being made here is intentional :)
>
> : please don't add the NPE here as cause (except for debugging). The NPE is 
> only
> : catched to NOT add extra checks in the highly performance sensitive code.
> : Actually the NPE is catched to detect the case where the bytebuffer was
> : already unset to trigger the already closed. The code uses setting the 
> buffers
> : to NULL to signal cause, but it does NOT add a NULL check everywhere. This
> : allows Hotspot to compile this code without any bounds checks and signal the
> : AlreadyClosedException only when a NPE happens. Adding the NPE as cause 
> would
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Chris Hostetter


Uwe: In your PR, you should add these details to the javadocs of 
ByteBufferIndexInput.alreadyClosed(), so future code spelunkers understand 
the choice being made here is intentional :)

: please don't add the NPE here as cause (except for debugging). The NPE is only
: catched to NOT add extra checks in the highly performance sensitive code.
: Actually the NPE is catched to detect the case where the bytebuffer was
: already unset to trigger the already closed. The code uses setting the buffers
: to NULL to signal cause, but it does NOT add a NULL check everywhere. This
: allows Hotspot to compile this code without any bounds checks and signal the
: AlreadyClosedException only when a NPE happens. Adding the NPE as cause would



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Uwe Schindler

Hi Mike,

here is the PR improving the situation while not making the signalling 
NPE visible to code: https://github.com/apache/lucene/pull/12705


In MemorySegmentIndexInput theres also InvalidStateException which 
cannot be throws on wrong parameters. This one is also internal and 
always mapped to AlreadyClosedException (also without cause).


Uwe

Am 21.10.2023 um 10:23 schrieb Uwe Schindler:


Hi,

I have a good idea: We should only wrap as AlreadyClosedException if 
and only if the bytebuffers/memorysegemnts are null (see . In all 
other cases rethrow the NPE:


   AlreadyClosedException alreadyClosed(RuntimeException npe) {
     if (npe == null || this.buffers == null) { // buffers == null if input 
closed!
   return new AlreadyClosedException("Already closed: " + this);
 }
 throw npe;
   }

(this would need the same change in all MemorySegmentIndexInput in a 
similar way).


This would keep the NPE on wrong usage, but in the case of a closed 
ByteBufferIndexInput / MemorySegmentIndexInput it would throw plain 
AlreadyClosedEx.


I can provide a PR, but give me a week, I am very busy.

Uwe

Am 21.10.2023 um 10:01 schrieb Uwe Schindler:

Hi,

please don't add the NPE here as cause (except for debugging). The 
NPE is only catched to NOT add extra checks in the highly performance 
sensitive code. Actually the NPE is catched to detect the case where 
the bytebuffer was already unset to trigger the already closed. The 
code uses setting the buffers to NULL to signal cause, but it does 
NOT add a NULL check everywhere. This allows Hotspot to compile this 
code without any bounds checks and signal the AlreadyClosedException 
only when a NPE happens. Adding the NPE as cause would bring 
confusion to end user, as we only want to tell that IndexInput was 
closed, but the NPE should be kept behind scenes as it would be a 
support nightmare ("your code has no good null checks, it is 
broken"). The NPE is a signal here, not the cause.


I think the issue you have seen may be cause by passing a NULL 
parameter to one of the methods like a float array to readFloats(). 
This is not detected (P.S.: this also affects MemorySegmentIndexInput).


I can looks at the code to figure out how to better detect the NPE 
when parameters of methods are NULL, but no way to add the cause 
here. I would say: If you have to debug, do it temporarily or ose 
another dircetory impl.


Uwe

Am 20.10.2023 um 21:20 schrieb Chris Hostetter:

FWIW: The choice to ignore the original exception goes back to here...

https://issues.apache.org/jira/browse/LUCENE-3588

...circa 2011, where it was focused on catching NPE and throwing
AlreadyClosedException instead, w/o any particular discussion as to 
why to

throw away the original NPE.

If i had to guess it's simply because at that time 
AlreadyClosedException

didn't support wrapping any other Throwable.  That wasn't added until
LUCENE-5958 (circa 2014) which was focused on making sure "tragic" 
errors

kept a record of what caused the tragedy, and then include that as the
'cause' of the AlreadyClosedExceptions throw by 'ensureOpen()'

There didn't seem to be any discussion at that time about reviewing 
other

code that might be throwing AlreadyClosedException from a 'catch' block
that could also be updated to include the cause.

I'd say open a PR to review & update all code that results in
AlreadyClosedException originating from a catch block?



: Date: Tue, 17 Oct 2023 11:24:03 -0400
: From: Michael Sokolov 
: Reply-To: dev@lucene.apache.org
: To: Lucene Dev 
: Subject: ByteBufferIndexInput.alreadyClosed creates an exception 
that doesn't

: track its cause
:
: I was messing around with something that was resulting in
: AlreadyClosedException being thrown and I noticed that we weren't
: tracking the exception that caused it. I found this in
: ByteBufferIndexInput:
:
:    // the unused parameter is just to silence javac about unused 
variables

:    AlreadyClosedException alreadyClosed(RuntimeException unused) {
: -    return new AlreadyClosedException("Already closed: " + this);
: +    return new AlreadyClosedException("Already closed: " + this, 
unused);

:    }
:
: and added the cause there, which helped me find and fix my wicked
: ways. Is there a reason we decided not to wrap the "unused"
: RuntimeException there?
:
: -
: To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
: For additional commands, e-mail: dev-h...@lucene.apache.org
:
:

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Update TermInSetQuery Example?

2023-10-21 Thread Michael Wechner

Hi Uwe

Thanks for the hints re the other source code samples, will do this and 
will create a PR.


Thanks

Michael

Am 21.10.23 um 09:52 schrieb Uwe Schindler:

Hi Michael,

Go ahead. Maybe scan through the remaining source files with a 
grep/regex:


$ fgrep -R 'new BooleanQuery(' *
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: 
return new BooleanQuery(minimumNumberShouldMatch, clauses.toArray(new 
BooleanClause[0]));
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: * 
BooleanQuery bq = new BooleanQuery();
lucene/queries/src/java/org/apache/lucene/queries/spans/package-info.java: 
* Query query = new BooleanQuery();
lucene/spatial-extras/src/java/org/apache/lucene/spatial/bbox/BBoxStrategy.java: 
// BooleanQuery qNotDisjoint = new BooleanQuery();


The first one is a false positive (the builder calls the BQ ctor, but 
all others should possibly be fixed. There may be other combinations 
not detected because of source code formatting.


Uwe

Am 20.10.2023 um 23:46 schrieb Michael Wechner:

Hi

I recently found TermInSetQuery example at

https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/TermInSetQuery.html 



but if I understand correctly one should use now BooleanQuery.Builder 
instead BooleanQuery itself, right?


BooleanQuery.Builder bqb = new BooleanQuery.Builder();
bqb.add(new TermQuery(new Term("field", "foo")), 
BooleanClause.Occcur.SHOULD);
bqb.add(new TermQuery(new Term("field", "bar")), 
BooleanClause.Occcur.SHOULD);

Query q2 = new ConstantScoreQuery(bqb.build());

If so, I would be happy to do a minor pull request or feel free to 
update it directly.


Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Uwe Schindler

Hi,

I have a good idea: We should only wrap as AlreadyClosedException if and 
only if the bytebuffers/memorysegemnts are null (see . In all other 
cases rethrow the NPE:


  AlreadyClosedException alreadyClosed(RuntimeException npe) {

    if (npe == null || this.buffers == null) { // buffers == null if input 
closed!
  return new AlreadyClosedException("Already closed: " + this);
}
throw npe;

  }

(this would need the same change in all MemorySegmentIndexInput in a 
similar way).


This would keep the NPE on wrong usage, but in the case of a closed 
ByteBufferIndexInput / MemorySegmentIndexInput it would throw plain 
AlreadyClosedEx.


I can provide a PR, but give me a week, I am very busy.

Uwe

Am 21.10.2023 um 10:01 schrieb Uwe Schindler:

Hi,

please don't add the NPE here as cause (except for debugging). The NPE 
is only catched to NOT add extra checks in the highly performance 
sensitive code. Actually the NPE is catched to detect the case where 
the bytebuffer was already unset to trigger the already closed. The 
code uses setting the buffers to NULL to signal cause, but it does NOT 
add a NULL check everywhere. This allows Hotspot to compile this code 
without any bounds checks and signal the AlreadyClosedException only 
when a NPE happens. Adding the NPE as cause would bring confusion to 
end user, as we only want to tell that IndexInput was closed, but the 
NPE should be kept behind scenes as it would be a support nightmare 
("your code has no good null checks, it is broken"). The NPE is a 
signal here, not the cause.


I think the issue you have seen may be cause by passing a NULL 
parameter to one of the methods like a float array to readFloats(). 
This is not detected (P.S.: this also affects MemorySegmentIndexInput).


I can looks at the code to figure out how to better detect the NPE 
when parameters of methods are NULL, but no way to add the cause here. 
I would say: If you have to debug, do it temporarily or ose another 
dircetory impl.


Uwe

Am 20.10.2023 um 21:20 schrieb Chris Hostetter:

FWIW: The choice to ignore the original exception goes back to here...

https://issues.apache.org/jira/browse/LUCENE-3588

...circa 2011, where it was focused on catching NPE and throwing
AlreadyClosedException instead, w/o any particular discussion as to 
why to

throw away the original NPE.

If i had to guess it's simply because at that time 
AlreadyClosedException

didn't support wrapping any other Throwable.  That wasn't added until
LUCENE-5958 (circa 2014) which was focused on making sure "tragic" 
errors

kept a record of what caused the tragedy, and then include that as the
'cause' of the AlreadyClosedExceptions throw by 'ensureOpen()'

There didn't seem to be any discussion at that time about reviewing 
other

code that might be throwing AlreadyClosedException from a 'catch' block
that could also be updated to include the cause.

I'd say open a PR to review & update all code that results in
AlreadyClosedException originating from a catch block?



: Date: Tue, 17 Oct 2023 11:24:03 -0400
: From: Michael Sokolov 
: Reply-To: dev@lucene.apache.org
: To: Lucene Dev 
: Subject: ByteBufferIndexInput.alreadyClosed creates an exception 
that doesn't

: track its cause
:
: I was messing around with something that was resulting in
: AlreadyClosedException being thrown and I noticed that we weren't
: tracking the exception that caused it. I found this in
: ByteBufferIndexInput:
:
:    // the unused parameter is just to silence javac about unused 
variables

:    AlreadyClosedException alreadyClosed(RuntimeException unused) {
: -    return new AlreadyClosedException("Already closed: " + this);
: +    return new AlreadyClosedException("Already closed: " + this, 
unused);

:    }
:
: and added the cause there, which helped me find and fix my wicked
: ways. Is there a reason we decided not to wrap the "unused"
: RuntimeException there?
:
: -
: To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
: For additional commands, e-mail: dev-h...@lucene.apache.org
:
:

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Uwe Schindler

Hi Hoss,

Hi, see my other response, this is not the reason why it isn't wrapped:


If i had to guess it's simply because at that time AlreadyClosedException
didn't support wrapping any other Throwable.  That wasn't added until
LUCENE-5958 (circa 2014) which was focused on making sure "tragic" errors
kept a record of what caused the tragedy, and then include that as the
'cause' of the AlreadyClosedExceptions throw by 'ensureOpen()'


See my other mail for an explanation, in short: The NPE is only used as 
signal and not as error condition and MUST be hidden from end user 
(except for debbugging).


Uwe

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Could we allow an IndexInput to read from a still writing IndexOutput?

2023-10-21 Thread Uwe Schindler
Hi, the biggest problem is with some IndexInputs that work on FS Cache 
(mmapdir). The file size changes while you are writing therefore it 
could cause strange issues. Especially the mapping of mmap may not see 
the changes you have already written as there is no happens-before 
relationship.


Basically the IO model of Lucene is WORM. So something thats visible to 
readers must never change anymore.


So as said by the others, if you need stuff already written, keep it in 
memory (like nodes). We should really not change our IO model for this 
singleton. 1% slowdown while writing due to some caching of buffering 
does not matter and risk us corrupting indexes or run into errors while 
reading.


Uwe

Am 19.10.2023 um 15:47 schrieb Michael McCandless:

Hi Team,

Today, Lucene's Directory abstraction does not allow opening an 
IndexInput on a file until the file is fully written and closed via 
IndexOutput.  We enforce this in tests, and some of our core Directory 
implementations demand this (e.g. caching the file's length on opening 
an IndexInput).


Yet, most filesystems will easily allow simultaneous read/append of a 
single file.  We just don't expose this IO semantics to Lucene, but 
could we allow random-access reads with append-only writes on one 
file?  Is there a strong reason that we don't allow this?


Quick TL/DR context: we are trying to enable FST compilation to write 
off-heap (directly to disk), enabling creating arbitrarily large FSTs 
with bounded heap, matching how FSTs can now be read off-heap, and it 
would be much much more RAM efficient if we could read/append the same 
file at once.


Full gory details context: inspired by how Tantivy 
 (awesome and fast Rust 
search engine!) writes its FSTs 
, over in this issue 
 and PR 
, 
we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST 
building to immediately stream the FST to disk, instead of buffering 
the whole thing in RAM and then writing to disk.


This would allow building arbitrarily large FSTs without using up 
heap, and symmetrically matches how we can now read FSTs off-heap, 
plus FST building is already (mostly) append-only. This would also 
allow removing some of the crazy abstractions we have for writing FST 
bytes into RAM (FSTStore, BytesStore).  It would enable interesting 
things like a Codec whose term dictionary is stored entirely in an FST 
 (also inspired by Tantivy).


The wrinkle is that, while the FST is building, it sometimes looks 
back and reads previously written bytes, to share suffixes and create 
a minimal (or near minimal) FST.  So if IndexInput could read those 
bytes, even as the FST is still appending to IndexOutput, it would 
"just work".


Failing that, our plan B is to wastefully duplicate the byte[] slices 
from the already written bytes into our own private (heap resident, 
boo) copy, which would use quite a bit more RAM while building the 
FST, and make less minimal FSTs for a given RAM budget.  I haven't 
measured the added wasted RAM if we have to go this route but I fear 
it is sizable in practice, i.e. it strongly negates the whole idea of 
writing an FST off-heap since its effectively storing a possibly large 
portion of the FST in many duplicated byte[] fragments (in the NodeHash).


So ... could we somehow relax Lucene's Directory semantics to allow 
opening an IndexInput on a still appending IndexOutput, since most 
filesystems are fine with this?


Mike McCandless

http://blog.mikemccandless.com


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't track its cause

2023-10-21 Thread Uwe Schindler

Hi,

please don't add the NPE here as cause (except for debugging). The NPE 
is only catched to NOT add extra checks in the highly performance 
sensitive code. Actually the NPE is catched to detect the case where the 
bytebuffer was already unset to trigger the already closed. The code 
uses setting the buffers to NULL to signal cause, but it does NOT add a 
NULL check everywhere. This allows Hotspot to compile this code without 
any bounds checks and signal the AlreadyClosedException only when a NPE 
happens. Adding the NPE as cause would bring confusion to end user, as 
we only want to tell that IndexInput was closed, but the NPE should be 
kept behind scenes as it would be a support nightmare ("your code has no 
good null checks, it is broken"). The NPE is a signal here, not the cause.


I think the issue you have seen may be cause by passing a NULL parameter 
to one of the methods like a float array to readFloats(). This is not 
detected (P.S.: this also affects MemorySegmentIndexInput).


I can looks at the code to figure out how to better detect the NPE when 
parameters of methods are NULL, but no way to add the cause here. I 
would say: If you have to debug, do it temporarily or ose another 
dircetory impl.


Uwe

Am 20.10.2023 um 21:20 schrieb Chris Hostetter:

FWIW: The choice to ignore the original exception goes back to here...

https://issues.apache.org/jira/browse/LUCENE-3588

...circa 2011, where it was focused on catching NPE and throwing
AlreadyClosedException instead, w/o any particular discussion as to why to
throw away the original NPE.

If i had to guess it's simply because at that time AlreadyClosedException
didn't support wrapping any other Throwable.  That wasn't added until
LUCENE-5958 (circa 2014) which was focused on making sure "tragic" errors
kept a record of what caused the tragedy, and then include that as the
'cause' of the AlreadyClosedExceptions throw by 'ensureOpen()'

There didn't seem to be any discussion at that time about reviewing other
code that might be throwing AlreadyClosedException from a 'catch' block
that could also be updated to include the cause.

I'd say open a PR to review & update all code that results in
AlreadyClosedException originating from a catch block?



: Date: Tue, 17 Oct 2023 11:24:03 -0400
: From: Michael Sokolov 
: Reply-To: dev@lucene.apache.org
: To: Lucene Dev 
: Subject: ByteBufferIndexInput.alreadyClosed creates an exception that doesn't
: track its cause
:
: I was messing around with something that was resulting in
: AlreadyClosedException being thrown and I noticed that we weren't
: tracking the exception that caused it. I found this in
: ByteBufferIndexInput:
:
:// the unused parameter is just to silence javac about unused variables
:AlreadyClosedException alreadyClosed(RuntimeException unused) {
: -return new AlreadyClosedException("Already closed: " + this);
: +return new AlreadyClosedException("Already closed: " + this, unused);
:}
:
: and added the cause there, which helped me find and fix my wicked
: ways. Is there a reason we decided not to wrap the "unused"
: RuntimeException there?
:
: -
: To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
: For additional commands, e-mail: dev-h...@lucene.apache.org
:
:

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Update TermInSetQuery Example?

2023-10-21 Thread Uwe Schindler

Hi Michael,

Go ahead. Maybe scan through the remaining source files with a grep/regex:

$ fgrep -R 'new BooleanQuery(' *
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: return 
new BooleanQuery(minimumNumberShouldMatch, clauses.toArray(new 
BooleanClause[0]));
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: * 
BooleanQuery bq = new BooleanQuery();
lucene/queries/src/java/org/apache/lucene/queries/spans/package-info.java: 
* Query query = new BooleanQuery();
lucene/spatial-extras/src/java/org/apache/lucene/spatial/bbox/BBoxStrategy.java: 
// BooleanQuery qNotDisjoint = new BooleanQuery();


The first one is a false positive (the builder calls the BQ ctor, but 
all others should possibly be fixed. There may be other combinations not 
detected because of source code formatting.


Uwe

Am 20.10.2023 um 23:46 schrieb Michael Wechner:

Hi

I recently found TermInSetQuery example at

https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/TermInSetQuery.html 



but if I understand correctly one should use now BooleanQuery.Builder 
instead BooleanQuery itself, right?


BooleanQuery.Builder bqb = new BooleanQuery.Builder();
bqb.add(new TermQuery(new Term("field", "foo")), 
BooleanClause.Occcur.SHOULD);
bqb.add(new TermQuery(new Term("field", "bar")), 
BooleanClause.Occcur.SHOULD);

Query q2 = new ConstantScoreQuery(bqb.build());

If so, I would be happy to do a minor pull request or feel free to 
update it directly.


Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org