Simon Willnauer created LUCENE-9337:
---------------------------------------

             Summary: CMS might miss to pickup pending merges when 
maxMergeCount changes while merges are running
                 Key: LUCENE-9337
                 URL: https://issues.apache.org/jira/browse/LUCENE-9337
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Simon Willnauer


We found a test hanging on an IW#forceMerge on elastics CI on an innocent 
looking test:
{noformat}
14:52:06    [junit4]   2>         at 
java.base@11.0.2/java.lang.Object.wait(Native Method)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4722)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2034)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1960)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:500)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1301)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1258)
14:52:06    [junit4]   2>         at 
app//org.apache.lucene.index.BaseDocValuesFormatTestCase.testZeroOrMin(BaseDocValuesFormatTestCase.java:2423)
{noformat}
after spending quite some time trying to reproduce without any luck I tried to 
review all involved code again to understand possible threading issues. What I 
found is that if maxMergeCount gets changed on CMS while there are merges 
running and the forceMerge gets kicked off at the same time the running merges 
return we might miss to pick up the final pending merges which causes the 
forceMerge to hang. I was able to build a test-case that is very likely to fail 
on every run without the fix. While I think this is not a critical bug from how 
likely it is to happen in practice, if it happens it's basically a deadlock 
unless the IW sees any other change that kicks off a merge.

Lemme walk through the issue. Lets say we have 1 pending merge and 2 merge 
threads running on CMS. The forceMerge is already waiting for merges to finish. 
Once the first merge thread finishes we try to check if we need to stall it 
[here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L580]
 but since it's a merge thread we return 
[here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L596]
 and don't pick up another merge 
[here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L526].
 
Now the second running merge thread checks the condition 
[here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L580]
  while the first one is finishing up. But before it can actually update the 
internal datastructures 
[here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.5.1/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L688]
 it releases the CMS lock and the calculation in the stall method on how many 
threads are running is off causing the second thread also to step out of the 
maybeStall method not picking up the pending merge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to