[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190870#comment-17190870 ] Kezhu Wang commented on FLINK-18815: [~trohrmann] Your are right, I made mistake. After revise, I found {{AbstractCloseableRegistryTest.testClose}} itself is faulty for {{SafetyNetCloseableRegistry}}. The failed assertion could happens before {{innerCloseable.close}}. This could be easily verified by place endless loop before {{innerCloseable.close}}. I think we can fix this with following changes: 1. Join {{CloseableReaperThread}} before assertion for {{SafetyNetCloseableRegistryTest}}. This should fix recently failed cases. 2. Replace {{CloseableReaperThread.interrupt}} with {{CloseableReaperThread.stop}}, eg. set {{CloseableReaperThread.running}} to false without interrupting {{CloseableReaperThread}}. Replace {{CloseableReaperThread.referenceQueue.remove()}} with {{CloseableReaperThread.referenceQueue.remove(long timeout)}}. This prevent blocking {{Closeable.close}} from interruption, though I think it is not a must since {{Closeable.close}} are not allowed to throw {{InterruptedException}} in spec. It is a friendly improvement for registered {{Closeable}}. > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Robert Metzger >Assignee: Kezhu Wang >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.2, 1.12.0, 1.11.2 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190687#comment-17190687 ] Till Rohrmann commented on FLINK-18815: --- Thanks for the analysis [~kezhuw]. I am not totally convinced that what you describe is the problem we are seeing here. The problem with your explanation is that {{SafetyNetCloseableRegistry.close()}} should explicitly close all registered closeables which are still contained in {{AbstractCloseableRegistry.closeableToRef}}. Hence, it should not matter whether the reaper thread still have some phantom references enqueued in its {{referenceQueue}}. Once {{close()}} is called, no further {{Closeables}} should be able to register at the registry. The only way I could see this problem happening is after we have removed the closeable from {{closeableRegistry}} in {{PhantomDelegatingCloseableRef.close()}} we interrupt the {{innerCloseable.close()}} call (https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/SafetyNetCloseableRegistry.java#L183). However, in the test implementation we are using {{TestStream}} as {{Closeable}} and the {{close()}} implementation is not blocking. > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Robert Metzger >Assignee: Kezhu Wang >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.2, 1.12.0, 1.11.2 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190472#comment-17190472 ] Dian Fu commented on FLINK-18815: - Another instance: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6180=logs=3b6ec2fd-a816-5e75-c775-06fb87cb6670=b33fdd4f-3de5-542e-2624-5d53167bb672] > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Robert Metzger >Assignee: Kezhu Wang >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.2, 1.12.0, 1.11.2 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190276#comment-17190276 ] Kezhu Wang commented on FLINK-18815: Seems that recently two cases are leaking while previous cases are duplicated closing. I think these two cases are caused by {{SafetyNetCloseableRegistry.close}} which interrupt reaper thread. Suppose that: 1. A closeable became phantom reachable and queued in {{CloseableReaperThread.referenceQueue}} but did not get a chance to close. 2. {{SafetyNetCloseableRegistry.close}} calls {{CloseableReaperThread.interrupt}} which set {{CloseableReaperThread.running}} to false and interrupt that java thread. 3. {{CloseableReaperThread}} terminates due to false {{CloseableReaperThread.running}} or {{InterruptedException}}. 4. That enqueued closeable leaks. I think there are two different approaches to fix this issue: * Use at most one {{CloseableReaperThread}}, and don't close it. This may cause leaking if Flink is embedded as guest in other host application. * Count registered phantom references, and close reaper thread only if all registered phantom references are popped and {{CloseableReaperThread}} is dropped by caller. Since Flink is not an end stop application, I think the counting approach maybe more appropriate ? As a analogy, {{java.lang.ref.Cleaner}} has no close-like method, it [tracks all registered referents|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java#L65], its underlying thread will terminate after [itself|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java#L101] and [all registered references|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java#L133] are cleaned. [~kevin.cyj] [~dian.fu] [~trohrmann] Any thoughts ? > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Robert Metzger >Assignee: Kezhu Wang >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.2, 1.12.0, 1.11.2 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190076#comment-17190076 ] Yingjie Cao commented on FLINK-18815: - Another instance: [https://dev.azure.com/kevin-flink/flink/_build/results?buildId=77=logs=6e58d712-c5cc-52fb-0895-6ff7bd56c46b=f30a8e80-b2cf-535c-9952-7f521a4ae374] > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Robert Metzger >Assignee: Kezhu Wang >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.2, 1.12.0, 1.11.2 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189796#comment-17189796 ] Dian Fu commented on FLINK-18815: - This issue reproduced on master: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6116=logs=3b6ec2fd-a816-5e75-c775-06fb87cb6670=2aff8966-346f-518f-e6ce-de64002a5034] > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Robert Metzger >Assignee: Kezhu Wang >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.2, 1.12.0, 1.11.2 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175360#comment-17175360 ] Till Rohrmann commented on FLINK-18815: --- Hi [~kezhuw], thanks a lot for this detailed analysis. I think it is correct. As you suggested the proper fix would be to let {{PhantomDelegatingCloseableRef}} call {{innerCloseable.close()}} only iff {{innerCloseable}} could be successfully removed from the closeable registry via {{closeableRegistry.removeCloseableInternal(innerCloseable)}}. Please go ahead and provide a fix for it. I've assigned this ticket to you. > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable
[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173926#comment-17173926 ] Kezhu Wang commented on FLINK-18815: I think this is caused by {{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}}. It unconditionally closes underlying {{innerCloseable}} without checking whether it was closed or unregistered before. Suppose that: # At some point, {{innerCloseable}}'s wrapping closeable was finalized and become phantom reference, its phantom reference was pushed to {{SafetyNetCloseableRegistry#REAPER_THREAD#referenceQueue}}. # The closeable registry containing {{innerCloseable}} was closed thus {{innerCloseable}} got closed and {{AbstractCloseableRegistry#closeableToRef}} got cleared. # {{SafetyNetCloseableRegistry#REAPER_THREAD}} executes {{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}} which close {{innerCloseable}} unconditionally. If we change {{Thread.sleep(2)}} from {{AbstractCloseableRegistryTest.ProducerThread}} to {{Thread.sleep(0)}}, then {{AbstractCloseableRegistryTest#testClose}} fails quite often. Though {{Closeable#close}} declares itself be idempotent, but it does not specify any thread-safe guarantee, so I think this should be treated as bug and thus need to fix. I think this could be fixed by closing {{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#innerCloseable}} iff pair of {{innerCloseable}} and {{PhantomDelegatingCloseableRef}} was successfully from {{AbstractCloseableRegistry#closeableToRef}}. I think this should be sufficient to fix concurrent removing/closing of {{Closeable}} in {{SafetyNetCloseableRegistry}} due to following observation: # There are only four removing paths: {{SafetyNetCloseableRegistry#close}}, {{ClosingFSDataInputStream#close}}, {{SafetyNetCloseableRegistry#unregisterCloseable}} and {{PhantomDelegatingCloseableRef#close}}. # I think the first three paths by design should have no concurrent contention. # {{PhantomDelegatingCloseableRef#close}} has no concurrent contention with {{SafetyNetCloseableRegistry#unregisterCloseable}} and {{ClosingFSDataInputStream#close}} since it is phantom reference of later removing/closing stream. [~rmetzger] [~dianfu] [~aljoscha] [~srichter] Any opinions on this ? If we tend to fix this, could I take over it ? > AbstractCloseableRegistryTest.testClose unstable > > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)