[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240153#comment-16240153 ] Stefan Richter edited comment on FLINK-7880 at 11/6/17 11:20 AM: - Yes, but the test seems to expect that waiting for {{CancellationSuccess}} includes a successful cleanup or it was just not aware how important the proper cleanup is with native resources. In any case, I think the origin of this problem might be taking other IT cases as blueprint, and I have seen different patterns for this "wait until the job is gone" problem in different tests. Many of them might be similar to this one, but they will often look correct or cause no trouble if there are no native libraries involved (e.g. test only uses heap backend). I would suggest that there should be one and only one simple (maybe one helper class that does this), idiomatic way of waiting for a job to go away and release all resources that is used throughout all tests that actually want to have this behaviour. Otherwise, for example, extending an existing test to include a different backend can suddenly uncover the improper cleanup and make tests randomly fail with a JVM crash. Having a clear way to end IT cases could help to avoid chasing seriously looking, misleading test failures that seem to originate from the RocksDB backend code, but are actually tests problems from improper cleanup. What do you think? was (Author: srichter): Yes, but the test seems to expect that waiting for {{CancellationSuccess}} includes a successful cleanup or it was just not aware how important the proper cleanup is with native resources. In any case, I think the origin of this problem might be taking other IT cases a blueprints, and I have seen different patterns for this "wait until the job is gone" problem in different tests. Many of them might be similar to this one, but they will often look correct or cause no trouble if there are no native libraries involved (e.g. test only uses heap backend). I would suggest that there should be one and only one simple (maybe one helper class that does this), idiomatic way of waiting for a job to go away and release all resources that is used throughout all tests that actually want to have this behaviour. Otherwise, for example, extending an existing test to include a different backend can suddenly uncover the improper cleanup and make tests randomly fail with a JVM crash. Having a clear way to end IT cases could help to avoid chasing seriously looking, misleading test failures that seem to originate from the RocksDB backend code, but are actually tests problems from improper cleanup. What do you think? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237616#comment-16237616 ] Gary Yao edited comment on FLINK-7880 at 11/3/17 1:55 PM: -- I have further isolated the problem. The problem still appears even if you are running a single test method. I tried {{NonHAQueryableStateRocksDBBackendITCase#testValueState}} and even replaced the state query with a {{Thread.sleep(400)}}. My changes to the code are documented here: https://github.com/apache/flink/compare/master...GJL:FLINK-7880?expand=1 To run the tests, I use the command below. Note that multiple iterations may be required. Hence, the {{while}} loop. {noformat} while mvn -o clean verify -Dtest=NonHAQueryableStateRocksDBBackendITCase -DfailIfNoTests=false -Dcheckstyle.skip; do :; done {noformat} The failure, I get is the same as mentioned by [~kkl0u]: {noformat} libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument {noformat} was (Author: gjy): I have further isolated the problem. The problem still appears even if you are running a single test method. I tried {{NonHAQueryableStateRocksDBBackendITCase#testValueState}} and even replaced the state query with a {{Thread.sleep(400)}}. My changes to the code are documented here: https://github.com/apache/flink/compare/master...GJL:FLINK-7880?expand=1 To run the tests, I use the command below. Note that multiple iterations may be required. Hence, the {{while}} loop. {noformat} while mvn -o clean verify -Dtest=NonHAQueryableStateRocksDBBackendITCase -DfailIfNoTests=false -Dcheckstyle.skip; do :; done {noformat} > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237584#comment-16237584 ] Kostas Kloudas edited comment on FLINK-7880 at 11/3/17 1:38 PM: I will keep on investigating *BUT* I commented out the {{queryable state}} code from the tests and they still seem to fail even locally with: {code} libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument /bin/sh: line 1: 10553 Abort trap: 6 /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseSerialGC -jar /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefirebooter6530581495710923655.jar /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire7293759706604721607tmp /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire_87379374358455308985tmp {code} You can find my code here https://github.com/kl0u/flink/tree/more-than-qs and to reproduce, go to the {{flink-queryable-state}} dir and run a couple of times {{mvn verify}}. At least on my machine this reproduces the problem. So the root problem may not be in the Queryable State but in the sequence of steps taken when shutting down the RocksDB state backend. was (Author: kkl0u): I will keep on investigating *BUT* I commented out the {{queryable state}} code from the tests and they still seem to fail even locally with: {code} libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument /bin/sh: line 1: 10553 Abort trap: 6 /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseSerialGC -jar /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefirebooter6530581495710923655.jar /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire7293759706604721607tmp /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire_87379374358455308985tmp {code} You can find my code here https://github.com/kl0u/flink/tree/more-than-qs and to reproduce, go to the {{flink-queryable-state}} dir and run a couple of times {{mvn verify}}. At least on my machine this reproduces the problem. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)