[jira] [Comment Edited] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-12-11 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248133#comment-17248133
 ] 

Michael Froh edited comment on LUCENE-9621 at 12/11/20, 7:21 PM:
-

Regarding the assertion failure, it looks like the call to 
{{adjustPendingNumDocs}} in {{rollbackInternalNoCommit}} is being call with 0 
(as both {{totalMaxDoc}} and {{rollbackMaxDoc}} are both 0).

It feels to me like when we roll back on tragedy, the {{IndexWriter}} is known 
to be in a bad state, so it's not really surprising that {{pendingNumDocs}} and 
{{segmentInfos.totalMaxDoc()}} are out of sync. Maybe the fix is to skip that 
assertion when called from {{maybeCloseOnTragicEvent}}, so that it doesn't mask 
the real tragedy?


was (Author: msfroh):
Regarding the assertion failure, it looks like the call to 
{{adjustPendingNumDocs}} in {{rollbackInternalNoCommit}} is being call with 0 
(as both {{totalMaxDoc}} and {{rollbackMaxDoc}} are both 0).

It feels to me like when we roll back on tragedy, the {{IndexWriter}} is known 
to be in a bad state, so it's not really surprising that {{pendingNumDocs}} and 
{{segmentInfos.totalMaxDoc()}} are out of sync. Maybe the fix is to skip that 
assertion when called from {{maybeCloseOnTragicEvent, so that it doesn't mask 
the real tragedy?}}

> pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()
> --
>
> Key: LUCENE-9621
> URL: https://issues.apache.org/jira/browse/LUCENE-9621
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.6.3
>Reporter: Michael Froh
>Priority: Major
>
> While implementing a test to trigger an OutOfMemoryError on flush() in 
> https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
> followed by an assertion failure on rollback with the following stacktrace:
> {code:java}
> java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
>   at 
> __randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
>   at 
> org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
>   at 
> org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
>   at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
>   at 
> org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)
> {code}
> We should probably look into how exactly we behave with this kind of tragedy 
> on flush().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-12-11 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248133#comment-17248133
 ] 

Michael Froh commented on LUCENE-9621:
--

Regarding the assertion failure, it looks like the call to 
{{adjustPendingNumDocs}} in {{rollbackInternalNoCommit}} is being call with 0 
(as both {{totalMaxDoc}} and {{rollbackMaxDoc}} are both 0).

It feels to me like when we roll back on tragedy, the {{IndexWriter}} is known 
to be in a bad state, so it's not really surprising that {{pendingNumDocs}} and 
{{segmentInfos.totalMaxDoc()}} are out of sync. Maybe the fix is to skip that 
assertion when called from {{maybeCloseOnTragicEvent, so that it doesn't mask 
the real tragedy?}}

> pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()
> --
>
> Key: LUCENE-9621
> URL: https://issues.apache.org/jira/browse/LUCENE-9621
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.6.3
>Reporter: Michael Froh
>Priority: Major
>
> While implementing a test to trigger an OutOfMemoryError on flush() in 
> https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
> followed by an assertion failure on rollback with the following stacktrace:
> {code:java}
> java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
>   at 
> __randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
>   at 
> org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
>   at 
> org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
>   at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
>   at 
> org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)
> {code}
> We should probably look into how exactly we behave with this kind of tragedy 
> on flush().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-12-11 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248119#comment-17248119
 ] 

Michael Froh edited comment on LUCENE-9621 at 12/11/20, 6:55 PM:
-

I added a {{printStackTrace}} to {{onTragicEvent}} and got the following:
{code:java}
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:125)
at 
org.apache.lucene.index.FieldInfos$Builder.finish(FieldInfos.java:645)
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:291)
at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480)
at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:660)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3899)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:499)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
{code}
This is the leak that I called out and fixed in 
https://issues.apache.org/jira/browse/LUCENE-9617. If we add documents and call 
{{deleteAll}} on the same {{IndexWriter}} repeatedly, it leaks field numbers 
and tries allocating a huge array in {{FieldInfos}} to accommodate the largest 
known field number.


was (Author: msfroh):
I added a {{printStackTrace}} to {{onTragicEvent}} and got the following:
{code:java}
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:125)
at 
org.apache.lucene.index.FieldInfos$Builder.finish(FieldInfos.java:645)
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:291)
at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480)
at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:660)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3899)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:499)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 

[jira] [Commented] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-12-11 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248119#comment-17248119
 ] 

Michael Froh commented on LUCENE-9621:
--

I added a {{printStackTrace}} to {{onTragicEvent}} and got the following:
{code:java}
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:125)
at 
org.apache.lucene.index.FieldInfos$Builder.finish(FieldInfos.java:645)
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:291)
at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480)
at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:660)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3899)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:499)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
{code}
This is the leak that I called out and fixed in 
https://issues.apache.org/jira/browse/LUCENE-9617. If we call {{deleteAll}} on 
the same {{IndexWriter}} repeatedly, it leaks field numbers and tries 
allocating a huge array in {{FieldInfos}} to accommodate the largest known 
field number.

> pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()
> --
>
> Key: LUCENE-9621
> URL: https://issues.apache.org/jira/browse/LUCENE-9621
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.6.3
>Reporter: Michael Froh
>Priority: Major
>
> While implementing a test to trigger an OutOfMemoryError on flush() in 
> https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
> followed by an assertion failure on rollback with the following stacktrace:
> {code:java}
> java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
>   at 
> __randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
>   at 
> org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
>   at 
> org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
>   at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
>   at 

[jira] [Updated] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-11-27 Thread Michael Froh (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Froh updated LUCENE-9621:
-
Description: 
While implementing a test to trigger an OutOfMemoryError on flush() in 
https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
followed by an assertion failure on rollback with the following stacktrace:


{{java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
at 
__randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
at 
org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
at 
org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
at 
org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)}}


We should probably look into how exactly we behave with this kind of tragedy on 
flush().

  was:
While implementing a test to trigger an OutOfMemoryError on flush() in 
https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
followed by an assertion failure on rollback with the following stacktrace:

{{
java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
at 
__randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
at 
org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
at 
org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
at 
org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)
}}

We should probably look into how exactly we behave with this kind of tragedy on 
flush().


> pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()
> --
>
> Key: LUCENE-9621
> URL: https://issues.apache.org/jira/browse/LUCENE-9621
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.6.3
>Reporter: Michael Froh
>Priority: Major
>
> While implementing a test to trigger an OutOfMemoryError on flush() in 
> https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
> followed by an assertion failure on rollback with the following stacktrace:
> {{java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
>   at 
> __randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
>   at 
> org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
>   at 
> org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
>   at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
>   at 
> org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)}}
> We should probably look into how exactly we behave with this kind of tragedy 
> on flush().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-11-27 Thread Michael Froh (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Froh updated LUCENE-9621:
-
Description: 
While implementing a test to trigger an OutOfMemoryError on flush() in 
https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
followed by an assertion failure on rollback with the following stacktrace:



{code:java}
java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
at 
__randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
at 
org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
at 
org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
at 
org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)
{code}



We should probably look into how exactly we behave with this kind of tragedy on 
flush().

  was:
While implementing a test to trigger an OutOfMemoryError on flush() in 
https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
followed by an assertion failure on rollback with the following stacktrace:


{{java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
at 
__randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
at 
org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
at 
org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
at 
org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)}}


We should probably look into how exactly we behave with this kind of tragedy on 
flush().


> pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()
> --
>
> Key: LUCENE-9621
> URL: https://issues.apache.org/jira/browse/LUCENE-9621
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.6.3
>Reporter: Michael Froh
>Priority: Major
>
> While implementing a test to trigger an OutOfMemoryError on flush() in 
> https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
> followed by an assertion failure on rollback with the following stacktrace:
> {code:java}
> java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
>   at 
> __randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
>   at 
> org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
>   at 
> org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
>   at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
>   at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
>   at 
> org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)
> {code}
> We should probably look into how exactly we behave with this kind of tragedy 
> on flush().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9621) pendingNumDocs doesn't match totalMaxDoc if tragedy on flush()

2020-11-27 Thread Michael Froh (Jira)
Michael Froh created LUCENE-9621:


 Summary: pendingNumDocs doesn't match totalMaxDoc if tragedy on 
flush()
 Key: LUCENE-9621
 URL: https://issues.apache.org/jira/browse/LUCENE-9621
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 8.6.3
Reporter: Michael Froh


While implementing a test to trigger an OutOfMemoryError on flush() in 
https://github.com/apache/lucene-solr/pull/2088, I noticed that the OOME was 
followed by an assertion failure on rollback with the following stacktrace:

{{
java.lang.AssertionError: pendingNumDocs 1 != 0 totalMaxDoc
at 
__randomizedtesting.SeedInfo.seed([ABBF17C4E0FCDEE5:DDC8E99910AFC8FF]:0)
at 
org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2398)
at 
org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:5196)
at 
org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:5186)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3932)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3874)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3853)
at 
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllRepeated(TestIndexWriterDelete.java:496)
}}

We should probably look into how exactly we behave with this kind of tragedy on 
flush().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9617) FieldNumbers.clear() should reset lowestUnassignedFieldNumber

2020-11-18 Thread Michael Froh (Jira)
Michael Froh created LUCENE-9617:


 Summary: FieldNumbers.clear() should reset 
lowestUnassignedFieldNumber
 Key: LUCENE-9617
 URL: https://issues.apache.org/jira/browse/LUCENE-9617
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 8.7
Reporter: Michael Froh


A call to IndexWriter.deleteAll() should completely reset the state of the 
index. Part of that is a call to globalFieldNumbersMap.clear(), which purges 
all knowledge of fields by clearing name -> number and number -> name maps. 
However, it does not reset lowestUnassignedFieldNumber.

If we have loop that adds some documents, calls deleteAll(), adds documents, 
etc. lowestUnassignedFieldNumber keeps counting up. Since FieldInfos allocates 
an array for number -> FieldInfo, this array will get larger and larger, 
effectively leaking memory.

We can fix this by resetting lowestUnassignedFieldNumber to -1 in 
FieldNumbers.clear().

I'll write a unit test and attach a patch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-25 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184699#comment-17184699
 ] 

Michael Froh commented on LUCENE-8962:
--

8.6 added ability to merge small segments on commit.

The more recent changes add the ability to merge on getReader calls (which is 
what the original issue was asking for -- merging on commit was a slightly 
easier step on the way there).

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053640#comment-17053640
 ] 

Michael Froh commented on LUCENE-8962:
--

bq. With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. This would in turn mean that we are doing the work 
twice ie. the IW would do its normal work and might merge later etc.

Just to provide a bit more context, for the case where my team uses this 
change, we're replicating the index (think Solr master/slave) from "writers" to 
many "searchers", so we're avoiding doing the work many times.

An earlier (less invasive) approach I tried to address the small flushed 
segments problem was roughly: call commit on writer, hard link the commit files 
to another filesystem directory to "clone" the index, open an IW on that 
directory, merge small segments on the clone, let searchers replicate from the 
clone. That approach does mean that the merging work happens twice (since the 
"real" index doesn't benefit from the merge on the clone), but it doesn't 
involve any changes in Lucene.

Maybe that less-invasive approach is a better way to address this. It's 
certainly more consistent with [~simonw]'s suggestion above.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-04 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051571#comment-17051571
 ] 

Michael Froh commented on LUCENE-8962:
--

I updated https://github.com/apache/lucene-solr/pull/1313 with that proposed 
fix (adding a {{boolean}} field to OneMerge that gets set once a merge is 
successfully committed).

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-04 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051558#comment-17051558
 ] 

Michael Froh edited comment on LUCENE-8962 at 3/4/20, 7:32 PM:
---

It's not immediately obvious to me how to fix the failure on 
{{TestIndexWriterExceptions2}}.

A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge 
readers, which calls the custom {{mergeFinished}} override, which assumes the 
merge completed (since it wasn't aborted), and tries to reference the files for 
the merged segment (to increment their reference counts). That triggers an 
{{IllegalStateException}} because the files weren't set (because we didn't get 
that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of 
telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the 
call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the 
{{IllegalStateException}} means that the merge failed, but that would fail to 
properly handle the case where, say, an IOException occurred when committing 
the merge (after {{SegmentInfo.setFiles()}} was called, but before the files 
were actually written to disk).

I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a 
merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.


was (Author: msfroh):
It's not immediately obvious to me how to fix the failure on 
{{TestIndexWriterExceptions2}}.

A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge 
readers, which calls the custom {{mergeFinished}} override, which assumes the 
merge completed (since it wasn't aborted), and tries to reference the files for 
the merged segment (to increment their reference counts). That triggers an 
{{IllegalStateException}} because the files weren't set (because we didn't get 
that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of 
telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the 
call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the 
{{IllegalStateException}} means that the merge failed, but that would fail to 
catch an IOException when e.g. committing the merge.

I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a 
merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-04 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051558#comment-17051558
 ] 

Michael Froh commented on LUCENE-8962:
--

It's not immediately obvious to me how to fix the failure on 
{{TestIndexWriterExceptions2}}.

A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge 
readers, which calls the custom {{mergeFinished}} override, which assumes the 
merge completed (since it wasn't aborted), and tries to reference the files for 
the merged segment (to increment their reference counts). That triggers an 
{{IllegalStateException}} because the files weren't set (because we didn't get 
that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of 
telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the 
call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the 
{{IllegalStateException}} means that the merge failed, but that would fail to 
catch an IOException when e.g. committing the merge.

I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a 
merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-03 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050607#comment-17050607
 ] 

Michael Froh commented on LUCENE-8962:
--

I ended up splitting testMergeOnCommit into two test cases.

One runs through the basic invariants on a single thread and confirms that 
everything behaves as expected.

The other tries indexing and committing from multiple threads, but doesn't 
really make any assumptions about the segment topology in the end (since 
randomness and concurrency can lead to all kinds of possible valid segment 
counts). Instead it just verifies that it doesn't fail and doesn't lose any 
documents.

https://github.com/apache/lucene-solr/pull/1313

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049700#comment-17049700
 ] 

Michael Froh commented on LUCENE-8962:
--

Posted a PR with fixes for the above test failures: 
[https://github.com/apache/lucene-solr/pull/1307]

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049666#comment-17049666
 ] 

Michael Froh commented on LUCENE-8962:
--

I was able to reproduce the {{testMergeOnCommit}} failure on master sometimes 
with the following options:

{{-Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit 
-Dtests.seed=F8DD5AD20994FDDF -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=fi-FI -Dtests.timezone=America/Danmarkshavn -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8}}

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049651#comment-17049651
 ] 

Michael Froh commented on LUCENE-8962:
--

Regarding {{TestIndexWriter.testThreadInterruptDeadlock}}, I think that's a bug 
in the implementation. 

When waiting for merges to complete, I added a {{catch}} for 
{{InterruptedException}} that sets the interrupt flag and throws an 
{{IOException}}. The documented behavior of {{IndexWriter}} is to clear the 
interrupt flag and throw {{ThreadInterruptedException}}. 

Again, not sure why the tests on master didn't fail. Maybe we just got lucky 
with the branch_8x tests.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049641#comment-17049641
 ] 

Michael Froh commented on LUCENE-8962:
--

I think the failure in {{testMergeOnCommit}} occurs because of a difference in 
the random behavior of the test. 

Specifically, sometimes the last writing thread happens to choose to 
{{commit()}} at the end, so there are no pending changes by the time we do the 
last {{commit()}} which should merge all segments (or abandon the merge, if it 
takes too long).

If we add one more doc before that last commit (ensuring that the 
{{anyChanges}} check in {{IndexWriter.prepareCommitInternal()}} is {{true}}), 
the test passes consistently. 

I'm not sure why we don't see the same failure sometimes on master, though.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049564#comment-17049564
 ] 

Michael Froh commented on LUCENE-8962:
--

I'm looking into the {{branch_8x}} failures. 

I'm able to reproduce on my machine and will step through to see what's 
different.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-02-02 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028727#comment-17028727
 ] 

Michael Froh commented on LUCENE-8962:
--

bq. Yeah I think you are right!  That would be a nice simplification.  Probably 
this can just be folded into the existing MergePolicy API as a different 
MergeTrigger.  Though then I wonder why e.g. forceMerge or expungeDeletes are 
not also simply different triggers ... Michael Froh what do you think?

As I was first writing this, I added a {{MergeTrigger.COMMIT}} value and used 
that, rather than adding a dedicated method. 

Then I realized that any time I've ever written a custom implementation of 
{{MergePolicy.findMerges()}}, I've ignored the {{MergeTrigger}} value, because 
I didn't really care what triggered the merge -- I just wanted to define the 
{{MergeSpecification}}. Even {{TieredMergePolicy.findMerges()}}} doesn't look 
at the {{MergeTrigger}} parameter. 

If I had made {{IndexWriter}} call {{findMerges}} with a 
{{MergeTrigger.COMMIT}} trigger, anyone with a similar {{MergePolicy}} would 
have probably ended up running (and blocking on) some pretty expensive merges 
on commit. The best way I could think of to be backwards compatible with the 
"old" behavior by default was to add a no-op method to the base class.

Looking through the history, it looks like {{forceMerge}} and 
{{expungeDeletes}} predate {{MergeTrigger}}, so that could explain them.

I really like the idea of controlling this with a {{MergeTrigger}}, but I'm 
concerned about breaking existing {{MergePolicy}} implementations that ignore 
the {{MergeTrigger}} (which I suspect may be most of them).

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-30 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027019#comment-17027019
 ] 

Michael Froh commented on LUCENE-8962:
--

[~dsmiley] – in your test, the merge executes after the commit updates the 
{{IndexWriter}}'s live {{SegmentInfos}}. When you call 
{{DirectoryReader.open}}, it takes another clone of that live SegmentInfos 
(which has 1 segment).

However, the clone of the {{SegmentInfos}} that was written in the commit is 
from before the merge. If you were to open a fresh {{DirectoryReader}} from the 
on-disk directory, I believe you would still see 9 segments.

With the approach I took, the cheap merge (or merges) asynchronously updates 
the commit's {{SegmentInfos}} clone before the commit happens.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-28 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025313#comment-17025313
 ] 

Michael Froh commented on LUCENE-8962:
--

Thanks [~msoko...@gmail.com] for the feedback on the PR! I've updated it to 
incorporate your suggestions.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-19 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019012#comment-17019012
 ] 

Michael Froh commented on LUCENE-8962:
--

Here's a before and after comparison of the average number of segments searched 
per request since I applied this change (with a TieredMergePolicy subclass that 
tries to merge all segments smaller than 100MB into a single segment on commit, 
with floorSegmentMB of 500). It lowers the overall count, but especially 
significantly reduced the variance.

 

!LUCENE-8962_demo.png!

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-19 Thread Michael Froh (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Froh updated LUCENE-8962:
-
Attachment: LUCENE-8962_demo.png

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-07 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010210#comment-17010210
 ] 

Michael Froh commented on LUCENE-8962:
--

I ended up needing something like this, not for NRT readers, but rather on 
commit.

 

I added a mechanism to compute cheap "commit merges" from within the 
prepareCommitInternal() call, and block until they complete (updating the 
"toCommit" SegmentInfos as they finish). I posted a PR for that here: 
[https://github.com/apache/lucene-solr/pull/1155]

 

I think we could do something similar from IndexWriter.getReader() to handle 
the NRT case, but I haven't tried working on that yet.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org