[ 
https://issues.apache.org/jira/browse/LUCENE-9337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-9337.
-------------------------------------
    Fix Version/s: 8.6
                   master (9.0)
         Assignee: Simon Willnauer
       Resolution: Fixed

> CMS might miss to pickup pending merges when maxMergeCount changes while 
> merges are running
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-9337
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9337
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Major
>             Fix For: master (9.0), 8.6
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> We found a test hanging on an IW#forceMerge on elastics CI on an innocent 
> looking test:
> {noformat}
> 14:52:06    [junit4]   2>         at 
> java.base@11.0.2/java.lang.Object.wait(Native Method)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4722)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2034)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1960)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:500)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1301)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1258)
> 14:52:06    [junit4]   2>         at 
> app//org.apache.lucene.index.BaseDocValuesFormatTestCase.testZeroOrMin(BaseDocValuesFormatTestCase.java:2423)
> {noformat}
> after spending quite some time trying to reproduce without any luck I tried 
> to review all involved code again to understand possible threading issues. 
> What I found is that if maxMergeCount gets changed on CMS while there are 
> merges running and the forceMerge gets kicked off at the same time the 
> running merges return we might miss to pick up the final pending merges which 
> causes the forceMerge to hang. I was able to build a test-case that is very 
> likely to fail on every run without the fix. While I think this is not a 
> critical bug from how likely it is to happen in practice, if it happens it's 
> basically a deadlock unless the IW sees any other change that kicks off a 
> merge.
> Lemme walk through the issue. Lets say we have 1 pending merge and 2 merge 
> threads running on CMS. The forceMerge is already waiting for merges to 
> finish. Once the first merge thread finishes we try to check if we need to 
> stall it 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L580]
>  but since it's a merge thread we return 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L596]
>  and don't pick up another merge 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L526].
>  
> Now the second running merge thread checks the condition 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L580]
>   while the first one is finishing up. But before it can actually update the 
> internal datastructures 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L688]
>  it releases the CMS lock and the calculation in the stall method on how many 
> threads are running is off causing the second thread also to step out of the 
> maybeStall method not picking up the pending merge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to