[jira] [Commented] (LUCENE-9037) ArrayIndexOutOfBoundsException due to repeated IOException during indexing

2019-11-09 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970986#comment-16970986
 ] 

Ilan Ginzburg commented on LUCENE-9037:
---

I've redone the fix and updated the pull request that now includes the test as 
well [[https://github.com/apache/lucene-solr/pull/998].]


Undid my changes to {{DefaultIndexingChain}} and instead made 
{{DocumentsWriter.updateDocument()}} and {{DocumentsWriter.updateDocuments()}} 
check if flush needed in case of non aborting exceptions. This makes the code a 
bit convoluted there, since we want to release the lock before calling 
{{postUpdate()}}.
Also removed some checks/asserts on {{getNumDocsInRAM() > 0}} as we might now 
need to free {{DocumentsWriterPerThread}} having no docs.

I'm not totally clear on flush vs. reset when there are no docs, but everything 
seems to work ok.

> ArrayIndexOutOfBoundsException due to repeated IOException during indexing
> --
>
> Key: LUCENE-9037
> URL: https://issues.apache.org/jira/browse/LUCENE-9037
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.1
>Reporter: Ilan Ginzburg
>Priority: Minor
> Attachments: TestIndexWriterTermsHashOverflow.java
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a limit to the number of tokens that can be held in memory by Lucene 
> when docs are indexed using DocumentsWriter, then bad things happen. The 
> limit can be reached by submitting a really large document, by submitting a 
> large number of documents without doing a commit (see LUCENE-8118) or by 
> repeatedly submitting documents that fail to get indexed in some specific 
> ways, leading to Lucene not cleaning up the in memory data structures that 
> eventually overflow.
> The overflow is due to a 32 bit (signed) integer wrapping around to negative 
> territory, then causing an ArrayIndexOutOfBoundsException. 
> The failure path that we are reliably hitting is due to an IOException during 
> doc tokenization. A tokenizer implementing TokenStream throws an exception 
> from incrementToken() which causes indexing of that doc to fail. 
> The IOException bubbles back up to DocumentsWriter.updateDocument() (or 
> DocumentsWriter.updateDocuments() in some other cases) where it is not 
> treated as an AbortingException therefore it is not causing a reset of the 
> DocumentsWriterPerThread. On repeated failures (without any successful 
> indexing in between) if the upper layer (client via Solr) resubmits the doc 
> that fails again, DocumentsWriterPerThread will eventually cause 
> TermsHashPerField data structures to grow and overflow, leading to an 
> exception stack similar to the one in LUCENE-8118 (below stack trace copied 
> from a test run repro on 7.1):
> java.lang.ArrayIndexOutOfBoundsException: 
> -65536java.lang.ArrayIndexOutOfBoundsException: -65536
>  at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) 
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>  at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
>  at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>  at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
>  at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481)
>  at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) 
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462)
> Using tokens composed only of lowercase letters, it takes less than 
> 130,000,000 different tokens (the shortest ones) to overflow 
> TermsHashPerField.
> Using a single document (composed of the 20,000 shortest lowercase tokens) 
> submitted repeatedly for indexing requires 6352 submissions all failing with 
> an IOException on incrementToken() to trigger the 
> ArrayIndexOutOfBoundsException.
> A proposed fix is to treat in DocumentsWriter.updateDocument() and 
> DocumentsWriter.updateDocuments() an IOException in the same way we treat an 
> AbortingException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LUCENE-9037) ArrayIndexOutOfBoundsException due to repeated IOException during indexing

2019-11-08 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970622#comment-16970622
 ] 

Ilan Ginzburg commented on LUCENE-9037:
---

Thanks [~mikemccand].

What about moving up the call to 
{{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the 
bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in 
{{DocumentsWriter.updateDocument/s()}}?
Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do 
after _successful or failed_ document".

Exploring that path see if I can make it work (and existing tests pass).

> ArrayIndexOutOfBoundsException due to repeated IOException during indexing
> --
>
> Key: LUCENE-9037
> URL: https://issues.apache.org/jira/browse/LUCENE-9037
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.1
>Reporter: Ilan Ginzburg
>Priority: Minor
> Attachments: TestIndexWriterTermsHashOverflow.java
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a limit to the number of tokens that can be held in memory by Lucene 
> when docs are indexed using DocumentsWriter, then bad things happen. The 
> limit can be reached by submitting a really large document, by submitting a 
> large number of documents without doing a commit (see LUCENE-8118) or by 
> repeatedly submitting documents that fail to get indexed in some specific 
> ways, leading to Lucene not cleaning up the in memory data structures that 
> eventually overflow.
> The overflow is due to a 32 bit (signed) integer wrapping around to negative 
> territory, then causing an ArrayIndexOutOfBoundsException. 
> The failure path that we are reliably hitting is due to an IOException during 
> doc tokenization. A tokenizer implementing TokenStream throws an exception 
> from incrementToken() which causes indexing of that doc to fail. 
> The IOException bubbles back up to DocumentsWriter.updateDocument() (or 
> DocumentsWriter.updateDocuments() in some other cases) where it is not 
> treated as an AbortingException therefore it is not causing a reset of the 
> DocumentsWriterPerThread. On repeated failures (without any successful 
> indexing in between) if the upper layer (client via Solr) resubmits the doc 
> that fails again, DocumentsWriterPerThread will eventually cause 
> TermsHashPerField data structures to grow and overflow, leading to an 
> exception stack similar to the one in LUCENE-8118 (below stack trace copied 
> from a test run repro on 7.1):
> java.lang.ArrayIndexOutOfBoundsException: 
> -65536java.lang.ArrayIndexOutOfBoundsException: -65536
>  at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) 
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>  at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
>  at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>  at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
>  at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481)
>  at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) 
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462)
> Using tokens composed only of lowercase letters, it takes less than 
> 130,000,000 different tokens (the shortest ones) to overflow 
> TermsHashPerField.
> Using a single document (composed of the 20,000 shortest lowercase tokens) 
> submitted repeatedly for indexing requires 6352 submissions all failing with 
> an IOException on incrementToken() to trigger the 
> ArrayIndexOutOfBoundsException.
> A proposed fix is to treat in DocumentsWriter.updateDocument() and 
> DocumentsWriter.updateDocuments() an IOException in the same way we treat an 
> AbortingException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9037) ArrayIndexOutOfBoundsException due to repeated IOException during indexing

2019-11-07 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969319#comment-16969319
 ] 

Michael McCandless commented on LUCENE-9037:


What a fun test case :)  This is indeed as bug in {{IndexWriter}} ... we 
already added "best effort" checks to detect when a single in-memory segment 
({{DocumentsWriterPerThread}}) was close to its limit, through 
{{IndexWriterConfig.setRAMPerThreadHardLimitMB}}, but obviously they don't 
detect this case properly.

I don't think we should make all {{IOException}} aborting – that's overkill and 
would cause "normal" cases of {{IOException}} to abort your {{IndexWriter}} 
unexpectedly.  On {{IOException}} I think IW should simply delete that one 
document because something went wrong while iterating its tokens.

I think, instead, we should fix {{DocumentsWriterPerThread}} to better detect 
when it has hit the {{setRAMPerThreadHardLimitMB}} and throw a meaningful 
exception, deleting the unlucky document that ran into that limit.  We should 
improve the best effort check we have today.

> ArrayIndexOutOfBoundsException due to repeated IOException during indexing
> --
>
> Key: LUCENE-9037
> URL: https://issues.apache.org/jira/browse/LUCENE-9037
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.1
>Reporter: Ilan Ginzburg
>Priority: Minor
> Attachments: TestIndexWriterTermsHashOverflow.java
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a limit to the number of tokens that can be held in memory by Lucene 
> when docs are indexed using DocumentsWriter, then bad things happen. The 
> limit can be reached by submitting a really large document, by submitting a 
> large number of documents without doing a commit (see LUCENE-8118) or by 
> repeatedly submitting documents that fail to get indexed in some specific 
> ways, leading to Lucene not cleaning up the in memory data structures that 
> eventually overflow.
> The overflow is due to a 32 bit (signed) integer wrapping around to negative 
> territory, then causing an ArrayIndexOutOfBoundsException. 
> The failure path that we are reliably hitting is due to an IOException during 
> doc tokenization. A tokenizer implementing TokenStream throws an exception 
> from incrementToken() which causes indexing of that doc to fail. 
> The IOException bubbles back up to DocumentsWriter.updateDocument() (or 
> DocumentsWriter.updateDocuments() in some other cases) where it is not 
> treated as an AbortingException therefore it is not causing a reset of the 
> DocumentsWriterPerThread. On repeated failures (without any successful 
> indexing in between) if the upper layer (client via Solr) resubmits the doc 
> that fails again, DocumentsWriterPerThread will eventually cause 
> TermsHashPerField data structures to grow and overflow, leading to an 
> exception stack similar to the one in LUCENE-8118 (below stack trace copied 
> from a test run repro on 7.1):
> java.lang.ArrayIndexOutOfBoundsException: 
> -65536java.lang.ArrayIndexOutOfBoundsException: -65536
>  at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) 
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>  at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
>  at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>  at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
>  at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481)
>  at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) 
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462)
> Using tokens composed only of lowercase letters, it takes less than 
> 130,000,000 different tokens (the shortest ones) to overflow 
> TermsHashPerField.
> Using a single document (composed of the 20,000 shortest lowercase tokens) 
> submitted repeatedly for indexing requires 6352 submissions all failing with 
> an IOException on incrementToken() to trigger the 
> ArrayIndexOutOfBoundsException.
> A proposed fix is to treat in DocumentsWriter.updateDocument() and 
> 

[jira] [Commented] (LUCENE-9037) ArrayIndexOutOfBoundsException due to repeated IOException during indexing

2019-11-06 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968784#comment-16968784
 ] 

Ilan Ginzburg commented on LUCENE-9037:
---

[^TestIndexWriterTermsHashOverflow.java]

I'm trying to port my fix and test from 7.1 to master, but the code is 
different.

The test class (attached) works great. Seeing same issues as in 7.1 (i.e. able 
to reproduce the ArrayIndexOutOfBoundsException in about 2:30 minutes).

But in Solr 8 as opposed to 7.1, there is no exception catching in 
DocumentsWriter (AbortingException has vanished altogether). A call to 
DocumentsWriterPerThread.onAbortingException() is used to notify of an aborting 
exception (and DocumentsWriterPerThread.hasHitAbortingException() to later 
check if one was hit).

Effect of an exception causing abort (difference from 7.1) is that the 
IndexWriter is now closed (and with it the DocumentsWriterPerThread). So the 
tests in the attached class fail on stock Solr 8 with 
ArrayIndexOutOfBoundsException, and with the patch to make IOException 
considered aborting 
([[https://github.com/apache/lucene-solr/pull/998]|[https://github.com/apache/lucene-solr/pull/998]]),
 they fail too "AlreadyClosedException: this IndexWriter is closed" (except 
testSingleLargeDocFails that verifies a failure that happens with or without 
the patch).

Changing the test to reopen an IndexWriter after each failure is not an option 
(creates a new DocumentsWriterPerThread). So basically I have this test showing 
a failure in Solr 8 but I have no way of showing what my proposed fix fixes.

 

> ArrayIndexOutOfBoundsException due to repeated IOException during indexing
> --
>
> Key: LUCENE-9037
> URL: https://issues.apache.org/jira/browse/LUCENE-9037
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.1
>Reporter: Ilan Ginzburg
>Priority: Minor
> Attachments: TestIndexWriterTermsHashOverflow.java
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a limit to the number of tokens that can be held in memory by Lucene 
> when docs are indexed using DocumentsWriter, then bad things happen. The 
> limit can be reached by submitting a really large document, by submitting a 
> large number of documents without doing a commit (see LUCENE-8118) or by 
> repeatedly submitting documents that fail to get indexed in some specific 
> ways, leading to Lucene not cleaning up the in memory data structures that 
> eventually overflow.
> The overflow is due to a 32 bit (signed) integer wrapping around to negative 
> territory, then causing an ArrayIndexOutOfBoundsException. 
> The failure path that we are reliably hitting is due to an IOException during 
> doc tokenization. A tokenizer implementing TokenStream throws an exception 
> from incrementToken() which causes indexing of that doc to fail. 
> The IOException bubbles back up to DocumentsWriter.updateDocument() (or 
> DocumentsWriter.updateDocuments() in some other cases) where it is not 
> treated as an AbortingException therefore it is not causing a reset of the 
> DocumentsWriterPerThread. On repeated failures (without any successful 
> indexing in between) if the upper layer (client via Solr) resubmits the doc 
> that fails again, DocumentsWriterPerThread will eventually cause 
> TermsHashPerField data structures to grow and overflow, leading to an 
> exception stack similar to the one in LUCENE-8118 (below stack trace copied 
> from a test run repro on 7.1):
> java.lang.ArrayIndexOutOfBoundsException: 
> -65536java.lang.ArrayIndexOutOfBoundsException: -65536
>  at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) 
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>  at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
>  at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>  at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
>  at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481)
>  at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) 
> at