[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-09-04 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190870#comment-17190870
 ] 

Kezhu Wang commented on FLINK-18815:


[~trohrmann] Your are right, I made mistake. After revise, I found 
{{AbstractCloseableRegistryTest.testClose}} itself is faulty for 
{{SafetyNetCloseableRegistry}}. The failed assertion could happens before 
{{innerCloseable.close}}. This could be easily verified by place endless loop 
before {{innerCloseable.close}}.

I think we can fix this with following changes:
 1. Join {{CloseableReaperThread}} before assertion for 
{{SafetyNetCloseableRegistryTest}}.  This should fix recently failed cases.
 2. Replace {{CloseableReaperThread.interrupt}} with 
{{CloseableReaperThread.stop}}, eg. set {{CloseableReaperThread.running}} to 
false without interrupting {{CloseableReaperThread}}. Replace 
{{CloseableReaperThread.referenceQueue.remove()}} with 
{{CloseableReaperThread.referenceQueue.remove(long timeout)}}. This prevent 
blocking {{Closeable.close}} from interruption, though I think it is not a must 
since {{Closeable.close}} are not allowed to throw {{InterruptedException}} in 
spec. It is a friendly improvement for registered {{Closeable}}.

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Robert Metzger
>Assignee: Kezhu Wang
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.2, 1.12.0, 1.11.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-09-04 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190687#comment-17190687
 ] 

Till Rohrmann commented on FLINK-18815:
---

Thanks for the analysis [~kezhuw]. I am not totally convinced that what you 
describe is the problem we are seeing here. The problem with your explanation 
is that {{SafetyNetCloseableRegistry.close()}} should explicitly close all 
registered closeables which are still contained in 
{{AbstractCloseableRegistry.closeableToRef}}. Hence, it should not matter 
whether the reaper thread still have some phantom references enqueued in its 
{{referenceQueue}}. Once {{close()}} is called, no further {{Closeables}} 
should be able to register at the registry.

The only way I could see this problem happening is after we have removed the 
closeable from {{closeableRegistry}} in 
{{PhantomDelegatingCloseableRef.close()}} we interrupt the 
{{innerCloseable.close()}} call 
(https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/SafetyNetCloseableRegistry.java#L183).
 However, in the test implementation we are using {{TestStream}} as 
{{Closeable}} and the {{close()}} implementation is not blocking.

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Robert Metzger
>Assignee: Kezhu Wang
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.2, 1.12.0, 1.11.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-09-03 Thread Dian Fu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190472#comment-17190472
 ] 

Dian Fu commented on FLINK-18815:
-

Another instance: 
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6180=logs=3b6ec2fd-a816-5e75-c775-06fb87cb6670=b33fdd4f-3de5-542e-2624-5d53167bb672]

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Robert Metzger
>Assignee: Kezhu Wang
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.2, 1.12.0, 1.11.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-09-03 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190276#comment-17190276
 ] 

Kezhu Wang commented on FLINK-18815:


Seems that recently two cases are leaking while previous cases are duplicated 
closing. I think these two cases are caused by 
{{SafetyNetCloseableRegistry.close}} which interrupt reaper thread. Suppose 
that:
 1. A closeable became phantom reachable and queued in 
{{CloseableReaperThread.referenceQueue}} but did not get a chance to close.
 2. {{SafetyNetCloseableRegistry.close}} calls 
{{CloseableReaperThread.interrupt}} which set {{CloseableReaperThread.running}} 
to false and interrupt that java thread.
 3. {{CloseableReaperThread}} terminates due to false 
{{CloseableReaperThread.running}} or {{InterruptedException}}.
 4. That enqueued closeable leaks.

I think there are two different approaches to fix this issue:
 * Use at most one {{CloseableReaperThread}}, and don't close it. This may 
cause leaking if Flink is embedded as guest in other host application.
 * Count registered phantom references, and close reaper thread only if all 
registered phantom references are popped and {{CloseableReaperThread}} is 
dropped by caller.

Since Flink is not an end stop application, I think the counting approach maybe 
more appropriate ?

As a analogy, {{java.lang.ref.Cleaner}} has no close-like method, it [tracks 
all registered 
referents|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java#L65],
 its underlying thread will terminate after 
[itself|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java#L101]
 and [all registered 
references|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java#L133]
 are cleaned.

[~kevin.cyj] [~dian.fu] [~trohrmann] Any thoughts ?

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Robert Metzger
>Assignee: Kezhu Wang
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.2, 1.12.0, 1.11.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-09-03 Thread Yingjie Cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190076#comment-17190076
 ] 

Yingjie Cao commented on FLINK-18815:
-

Another instance: 
[https://dev.azure.com/kevin-flink/flink/_build/results?buildId=77=logs=6e58d712-c5cc-52fb-0895-6ff7bd56c46b=f30a8e80-b2cf-535c-9952-7f521a4ae374]

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Robert Metzger
>Assignee: Kezhu Wang
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.2, 1.12.0, 1.11.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-09-02 Thread Dian Fu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189796#comment-17189796
 ] 

Dian Fu commented on FLINK-18815:
-

This issue reproduced on master:
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6116=logs=3b6ec2fd-a816-5e75-c775-06fb87cb6670=2aff8966-346f-518f-e6ce-de64002a5034]

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Robert Metzger
>Assignee: Kezhu Wang
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.2, 1.12.0, 1.11.2
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-08-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175360#comment-17175360
 ] 

Till Rohrmann commented on FLINK-18815:
---

Hi [~kezhuw], thanks a lot for this detailed analysis. I think it is correct. 
As you suggested the proper fix would be to let 
{{PhantomDelegatingCloseableRef}} call {{innerCloseable.close()}} only iff 
{{innerCloseable}} could be successfully removed from the closeable registry 
via {{closeableRegistry.removeCloseableInternal(innerCloseable)}}.

Please go ahead and provide a fix for it. I've assigned this ticket to you.

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18815) AbstractCloseableRegistryTest.testClose unstable

2020-08-09 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173926#comment-17173926
 ] 

Kezhu Wang commented on FLINK-18815:


I think this is caused by 
{{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}}. It 
unconditionally closes underlying {{innerCloseable}} without checking whether 
it was closed or unregistered before. Suppose that:
 # At some point, {{innerCloseable}}'s wrapping closeable was finalized and 
become phantom reference, its phantom reference was pushed to 
{{SafetyNetCloseableRegistry#REAPER_THREAD#referenceQueue}}.
 # The closeable registry containing {{innerCloseable}} was closed thus 
{{innerCloseable}} got closed and {{AbstractCloseableRegistry#closeableToRef}} 
got cleared.

 # {{SafetyNetCloseableRegistry#REAPER_THREAD}} executes 
{{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}} which close 
{{innerCloseable}} unconditionally.

If we change {{Thread.sleep(2)}} from 
{{AbstractCloseableRegistryTest.ProducerThread}} to {{Thread.sleep(0)}}, then 
{{AbstractCloseableRegistryTest#testClose}} fails quite often.

Though {{Closeable#close}} declares itself be idempotent, but it does not 
specify any thread-safe guarantee, so I think this should be treated as bug and 
thus need to fix.

I think this could be fixed by closing 
{{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#innerCloseable}} iff 
pair of {{innerCloseable}} and {{PhantomDelegatingCloseableRef}} was 
successfully from {{AbstractCloseableRegistry#closeableToRef}}.

I think this should be sufficient to fix concurrent removing/closing of 
{{Closeable}} in {{SafetyNetCloseableRegistry}} due to following observation:
 # There are only four removing paths: {{SafetyNetCloseableRegistry#close}}, 
{{ClosingFSDataInputStream#close}}, 
{{SafetyNetCloseableRegistry#unregisterCloseable}} and 
{{PhantomDelegatingCloseableRef#close}}.
 # I think the first three paths by design should have no concurrent contention.
 # {{PhantomDelegatingCloseableRef#close}} has no concurrent contention with 
{{SafetyNetCloseableRegistry#unregisterCloseable}} and 
{{ClosingFSDataInputStream#close}} since it is phantom reference of later 
removing/closing stream.

[~rmetzger] [~dianfu] [~aljoscha] [~srichter] Any opinions on this ? If we tend 
to fix this, could I take over it ?

> AbstractCloseableRegistryTest.testClose unstable
> 
>
> Key: FLINK-18815
> URL: https://issues.apache.org/jira/browse/FLINK-18815
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Tests
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)