[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-06 Thread Stefan Richter (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240153#comment-16240153
 ] 

Stefan Richter edited comment on FLINK-7880 at 11/6/17 11:20 AM:
-

Yes, but the test seems to expect that waiting for {{CancellationSuccess}} 
includes a successful cleanup or it was just not aware how important the proper 
cleanup is with native resources. In any case, I think the origin of this 
problem might be taking other IT cases as blueprint, and I have seen different 
patterns for this "wait until the job is gone" problem in different tests. Many 
of them might be similar to this one, but they will often look correct or cause 
no trouble if there are no native libraries involved (e.g. test only uses heap 
backend). I would suggest that there should be one and only one simple (maybe 
one helper class that does this), idiomatic way of waiting for a job to go away 
and release all resources that is used throughout all tests that actually want 
to have this behaviour. Otherwise, for example, extending an existing test to 
include a different backend can suddenly uncover the improper cleanup and make 
tests randomly fail with a JVM crash. 

Having a clear way to end IT cases could help to avoid chasing seriously 
looking, misleading test failures that seem to originate from the RocksDB 
backend code, but are actually tests problems from improper cleanup. What do 
you think?


was (Author: srichter):
Yes, but the test seems to expect that waiting for {{CancellationSuccess}} 
includes a successful cleanup or it was just not aware how important the proper 
cleanup is with native resources. In any case, I think the origin of this 
problem might be taking other IT cases a blueprints, and I have seen different 
patterns for this "wait until the job is gone" problem in different tests. Many 
of them might be similar to this one, but they will often look correct or cause 
no trouble if there are no native libraries involved (e.g. test only uses heap 
backend). I would suggest that there should be one and only one simple (maybe 
one helper class that does this), idiomatic way of waiting for a job to go away 
and release all resources that is used throughout all tests that actually want 
to have this behaviour. Otherwise, for example, extending an existing test to 
include a different backend can suddenly uncover the improper cleanup and make 
tests randomly fail with a JVM crash. 

Having a clear way to end IT cases could help to avoid chasing seriously 
looking, misleading test failures that seem to originate from the RocksDB 
backend code, but are actually tests problems from improper cleanup. What do 
you think?

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-03 Thread Gary Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237616#comment-16237616
 ] 

Gary Yao edited comment on FLINK-7880 at 11/3/17 1:55 PM:
--

I have further isolated the problem. The problem still appears even if you are 
running a single test method. I tried 
{{NonHAQueryableStateRocksDBBackendITCase#testValueState}} and even replaced 
the state query with a {{Thread.sleep(400)}}. My changes to the code are 
documented here: 
https://github.com/apache/flink/compare/master...GJL:FLINK-7880?expand=1

To run the tests, I use the command below. Note that multiple iterations may be 
required. Hence, the {{while}} loop.
{noformat}
while mvn -o clean verify -Dtest=NonHAQueryableStateRocksDBBackendITCase 
-DfailIfNoTests=false -Dcheckstyle.skip; do :; done
{noformat}

The failure, I get is the same as mentioned by [~kkl0u]:
{noformat}
libc++abi.dylib: terminating with uncaught exception of type 
std::__1::system_error: mutex lock failed: Invalid argument
{noformat}




was (Author: gjy):
I have further isolated the problem. The problem still appears even if you are 
running a single test method. I tried 
{{NonHAQueryableStateRocksDBBackendITCase#testValueState}} and even replaced 
the state query with a {{Thread.sleep(400)}}. My changes to the code are 
documented here: 
https://github.com/apache/flink/compare/master...GJL:FLINK-7880?expand=1

To run the tests, I use the command below. Note that multiple iterations may be 
required. Hence, the {{while}} loop.
{noformat}
while mvn -o clean verify -Dtest=NonHAQueryableStateRocksDBBackendITCase 
-DfailIfNoTests=false -Dcheckstyle.skip; do :; done
{noformat}





> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-03 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237584#comment-16237584
 ] 

Kostas Kloudas edited comment on FLINK-7880 at 11/3/17 1:38 PM:


I will keep on investigating *BUT* I commented out the {{queryable state}} code 
from the tests and they still seem to fail even locally with:

{code}
libc++abi.dylib: terminating with uncaught exception of type 
std::__1::system_error: mutex lock failed: Invalid argument
/bin/sh: line 1: 10553 Abort trap: 6   
/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/bin/java 
-Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseSerialGC -jar 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefirebooter6530581495710923655.jar
 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire7293759706604721607tmp
 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire_87379374358455308985tmp
{code}

You can find my code here https://github.com/kl0u/flink/tree/more-than-qs and 
to reproduce, go to the {{flink-queryable-state}} dir and run a couple of times 
{{mvn verify}}. At least on my machine this reproduces the problem.

So the root problem may not be in the Queryable State but in the sequence of 
steps taken when shutting down the RocksDB state backend.


was (Author: kkl0u):
I will keep on investigating *BUT* I commented out the {{queryable state}} code 
from the tests and they still seem to fail even locally with:

{code}
libc++abi.dylib: terminating with uncaught exception of type 
std::__1::system_error: mutex lock failed: Invalid argument
/bin/sh: line 1: 10553 Abort trap: 6   
/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/bin/java 
-Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseSerialGC -jar 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefirebooter6530581495710923655.jar
 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire7293759706604721607tmp
 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire_87379374358455308985tmp
{code}

You can find my code here https://github.com/kl0u/flink/tree/more-than-qs and 
to reproduce, go to the {{flink-queryable-state}} dir and run a couple of times 
{{mvn verify}}. At least on my machine this reproduces the problem.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)