Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Michael Sokolov
This is definitely a confusing error condition. If we can add more
information without creating an undue burden for the indexer it would
be nice, but I think this will be very challenging here since the
exception is thrown at a low level in the code where there might not
be a lot of useful info (ie the field name) to provide. And I expect
there are other places that make a similar assumption we would have to
track down?

On Tue, May 7, 2024 at 9:10 AM Jerven Tjalling Bolleman
 wrote:
>
> Dear Michael,
>
> Looking deeper into this. I think we overflowed a term frequency field.
> Looking in some statistics, in a previous release we had 1,288,526,281
> of a certain field, this would be larger now. Each of these would have
> had a limited set of values. But crucially nearly all of them would have
> had the term "positional" or "non-positional" added to the document.
>
> There is no good reason to do this today, we should just turn this into
> a boolean field and update the UI. I will do this and report back.
>
> Do you think that a patch for a try/catch for a more informative log
> message be appreciated by the community? e.g. mentioning the field name
> in the exception?
>
> Regards,
> Jerven
>
> On 5/7/24 14:52, Jerven Tjalling Bolleman wrote:
> > Dear Michael,
> >
> > Thank you for your help.
> >
> > We don't use custom term frequencies (I just double checked with a code
> > search).
> > We also always merge down to one segment (historical but also we index
> > once and then there are no changes for a week to a month and then we
> > reindex every document from scratch).
> >
> > Your response is very helpful already and I very much appreciate it as
> > it cuts down the search space significantly.
> >
> > Regards,
> > Jerven
> >
> >
> > On 5/7/24 14:03, Michael Sokolov wrote:
> >> It seems as if the term frequency for some term exceeded the maximum.
> >> This can happen if you supplied custom term frequencies eg with
> >> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
> >> . The behavior didn't change since 8.x but it's possible that the
> >> merging brought together some very "high frequency" terms that were
> >> previously not in the same segment?
> >>
> >> On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
> >>  wrote:
> >>>
> >>> Dear Lucene community,
> >>>
> >>> This morning I found this exception in our logs. This was the first time
> >>> we indexed this data with lucene 9.10. Before we were still on the
> >>> lucene 8.x branch. between the last indexing with 8 and this one with
> >>> 9.10 we have a bit more data so it could be something else that went
> >>> over an limit.
> >>>
> >>> Unfortunately, from this log message I am at a loss for what is going
> >>> on. And what I could do to prevent this from happening. Does anyone have
> >>> any ideas?
> >>>
> >>> Regards,
> >>> Jerven Bolleman
> >>>
> >>>
> >>> Exception in thread "Lucene Merge Thread #202"
> >>> org.apache.lucene.index.MergePolicy$MergeException:
> >>> java.lang.ArithmeticException: integer overflow
> >>> at
> >>> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
> >>> at
> >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
> >>> Caused by: java.lang.ArithmeticException: integer overflow
> >>> at java.base/java.lang.Math.toIntExact(Math.java:1135)
> >>> at
> >>> org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
> >>> at
> >>> org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
> >>> at
> >>> org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
> >>> at
> >>> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
> >>> at
> >>> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
> >>> at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
> >>> at
> >>> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
> >>> at
> >>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
> >>> at
> >>> org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
> >>> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
> >>> at
> >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)
> >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
> >>> at
> >>> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
> >>> at
> >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
> >>> at
> >>> 

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Jerven Tjalling Bolleman

Dear Michael,

Looking deeper into this. I think we overflowed a term frequency field. 
Looking in some statistics, in a previous release we had 1,288,526,281 
of a certain field, this would be larger now. Each of these would have 
had a limited set of values. But crucially nearly all of them would have 
had the term "positional" or "non-positional" added to the document.


There is no good reason to do this today, we should just turn this into 
a boolean field and update the UI. I will do this and report back.


Do you think that a patch for a try/catch for a more informative log 
message be appreciated by the community? e.g. mentioning the field name 
in the exception?


Regards,
Jerven

On 5/7/24 14:52, Jerven Tjalling Bolleman wrote:

Dear Michael,

Thank you for your help.

We don't use custom term frequencies (I just double checked with a code 
search).
We also always merge down to one segment (historical but also we index 
once and then there are no changes for a week to a month and then we 
reindex every document from scratch).


Your response is very helpful already and I very much appreciate it as
it cuts down the search space significantly.

Regards,
Jerven


On 5/7/24 14:03, Michael Sokolov wrote:

It seems as if the term frequency for some term exceeded the maximum.
This can happen if you supplied custom term frequencies eg with
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
. The behavior didn't change since 8.x but it's possible that the
merging brought together some very "high frequency" terms that were
previously not in the same segment?

On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
 wrote:


Dear Lucene community,

This morning I found this exception in our logs. This was the first time
we indexed this data with lucene 9.10. Before we were still on the
lucene 8.x branch. between the last indexing with 8 and this one with
9.10 we have a bit more data so it could be something else that went
over an limit.

Unfortunately, from this log message I am at a loss for what is going
on. And what I could do to prevent this from happening. Does anyone have
any ideas?

Regards,
Jerven Bolleman


Exception in thread "Lucene Merge Thread #202"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArithmeticException: integer overflow
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.toIntExact(Math.java:1135)
at 
org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)

at
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
at
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
at
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
at
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)

at
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
at
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Jerven Tjalling Bolleman

Dear Michael,

Thank you for your help.

We don't use custom term frequencies (I just double checked with a code 
search).
We also always merge down to one segment (historical but also we index 
once and then there are no changes for a week to a month and then we 
reindex every document from scratch).


Your response is very helpful already and I very much appreciate it as
it cuts down the search space significantly.

Regards,
Jerven


On 5/7/24 14:03, Michael Sokolov wrote:

It seems as if the term frequency for some term exceeded the maximum.
This can happen if you supplied custom term frequencies eg with
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
. The behavior didn't change since 8.x but it's possible that the
merging brought together some very "high frequency" terms that were
previously not in the same segment?

On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
 wrote:


Dear Lucene community,

This morning I found this exception in our logs. This was the first time
we indexed this data with lucene 9.10. Before we were still on the
lucene 8.x branch. between the last indexing with 8 and this one with
9.10 we have a bit more data so it could be something else that went
over an limit.

Unfortunately, from this log message I am at a loss for what is going
on. And what I could do to prevent this from happening. Does anyone have
any ideas?

Regards,
Jerven Bolleman


Exception in thread "Lucene Merge Thread #202"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArithmeticException: integer overflow
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.toIntExact(Math.java:1135)
at org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
at
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
at
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
at
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
at
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
at
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
at
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Michael Sokolov
It seems as if the term frequency for some term exceeded the maximum.
This can happen if you supplied custom term frequencies eg with
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
. The behavior didn't change since 8.x but it's possible that the
merging brought together some very "high frequency" terms that were
previously not in the same segment?

On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
 wrote:
>
> Dear Lucene community,
>
> This morning I found this exception in our logs. This was the first time
> we indexed this data with lucene 9.10. Before we were still on the
> lucene 8.x branch. between the last indexing with 8 and this one with
> 9.10 we have a bit more data so it could be something else that went
> over an limit.
>
> Unfortunately, from this log message I am at a loss for what is going
> on. And what I could do to prevent this from happening. Does anyone have
> any ideas?
>
> Regards,
> Jerven Bolleman
>
>
> Exception in thread "Lucene Merge Thread #202"
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.ArithmeticException: integer overflow
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
> Caused by: java.lang.ArithmeticException: integer overflow
> at java.base/java.lang.Math.toIntExact(Math.java:1135)
> at org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
> at
> org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
> at
> org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
> at
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
> at
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
> at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
> at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
> at
> org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
> at
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Jerven Tjalling Bolleman

Dear Lucene community,

This morning I found this exception in our logs. This was the first time 
we indexed this data with lucene 9.10. Before we were still on the 
lucene 8.x branch. between the last indexing with 8 and this one with 
9.10 we have a bit more data so it could be something else that went 
over an limit.


Unfortunately, from this log message I am at a loss for what is going 
on. And what I could do to prevent this from happening. Does anyone have 
any ideas?


Regards,
Jerven Bolleman


Exception in thread "Lucene Merge Thread #202" 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.ArithmeticException: integer overflow
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)

Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.toIntExact(Math.java:1135)
at org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
at 
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
at 
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)

at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)

at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
at 
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)

at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org