[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909013#comment-16909013 ] Till Rohrmann commented on FLINK-9900: -- Great to hear [~SleePy]. I'll take a look at your PR. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0, 1.9.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908843#comment-16908843 ] Biao Liu commented on FLINK-9900: - Hi [~till.rohrmann], there is another race condition of this case. However the good news is that it's just about the case, not a bug. I have started a PR to fix it. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0, 1.9.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907873#comment-16907873 ] Biao Liu commented on FLINK-9900: - Hi [~till.rohrmann], thanks for feedback. It seems to be a different scenario of this case. The stack is different with the prior one. I'll check it later. Let's keep the "reopened" status for now. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0, 1.9.1 > > Time Spent: 20m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907864#comment-16907864 ] Till Rohrmann commented on FLINK-9900: -- You are right [~SleePy]. However I found another instance from the latest master cron job where it failed: https://api.travis-ci.org/v3/job/571776566/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0, 1.9.1 > > Time Spent: 20m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907343#comment-16907343 ] Biao Liu commented on FLINK-9900: - Hi [~till.rohrmann], I have just checked the failed building. This PR is a bit old. It does not include the fixing on this ticket. Could you rebase master branch and try it again? > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0, 1.9.1 > > Time Spent: 20m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907287#comment-16907287 ] Biao Liu commented on FLINK-9900: - [~till.rohrmann] I will check it later. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.10.0, 1.9.1 > > Time Spent: 20m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897844#comment-16897844 ] Biao Liu commented on FLINK-9900: - Another instance, https://travis-ci.com/flink-ci/flink/jobs/221451189. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896376#comment-16896376 ] Till Rohrmann commented on FLINK-9900: -- Another instance: https://api.travis-ci.com/v3/job/220853113/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896182#comment-16896182 ] Till Rohrmann commented on FLINK-9900: -- I think this test instability actually points towards an inconsistency which we introduced with FLINK-12364. It seems that it is now possible that we complete checkpoints after the {{CheckpointFailureManager}} has decided that the job should fail. I think this is a somewhat unexpected behavior. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895873#comment-16895873 ] Biao Liu commented on FLINK-9900: - Hi Gordon, I think it's already under "In progress". I am not in office currently. I updated the state through my phone. I'll make sure later when I'm back to office. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895827#comment-16895827 ] Tzu-Li (Gordon) Tai commented on FLINK-9900: [~azagrebin] could you move this ticket to "In Progress"? That way we will be able to see that it is actively worked on in the 1.9 burndown chart. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Biao Liu >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895793#comment-16895793 ] Biao Liu commented on FLINK-9900: - Hi [~till.rohrmann], I have built a PR to fix this. The unstable scenario is a bit complicated. I described the details in PR, https://github.com/apache/flink/pull/9269. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895155#comment-16895155 ] Till Rohrmann commented on FLINK-9900: -- Made a blocker issue to understand what the cause is. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895150#comment-16895150 ] Stephan Ewen commented on FLINK-9900: - Another instance: https://api.travis-ci.org/v3/job/564847877/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894711#comment-16894711 ] Congxian Qiu(klion26) commented on FLINK-9900: -- Another instance https://travis-ci.com/flink-ci/flink/jobs/220237256 > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893912#comment-16893912 ] Till Rohrmann commented on FLINK-9900: -- Another instance: https://api.travis-ci.org/v3/job/563889124/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889976#comment-16889976 ] Andrey Zagrebin commented on FLINK-9900: Looks like one more: [https://api.travis-ci.com/v3/job/217584395/log.txt] > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880415#comment-16880415 ] Till Rohrmann commented on FLINK-9900: -- Another instance: https://api.travis-ci.org/v3/job/554991848/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.7.3 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880107#comment-16880107 ] Till Rohrmann commented on FLINK-9900: -- Another instance: https://api.travis-ci.org/v3/job/555252049/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.7.3 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869224#comment-16869224 ] Yun Tang commented on FLINK-9900: - Another instance which exceed 300 seconds [https://api.travis-ci.org/v3/job/548326670/log.txt] > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0, 1.9.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.7.3 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865394#comment-16865394 ] Dawid Wysakowicz commented on FLINK-9900: - I saw this test failing also with 1.9 master: https://api.travis-ci.org/v3/job/546443720/log.txt The cause might be different though, as I got a different error. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.7.3 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774231#comment-16774231 ] Stefan Richter commented on FLINK-9900: --- Saw another instance here: https://api.travis-ci.org/v3/job/495933065/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.7.3, 1.8.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766966#comment-16766966 ] Chesnay Schepler commented on FLINK-9900: - [~sunjincheng121] No, because these only added more debugging details and did not fix the underlying issue. > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.4, 1.7.3, 1.8.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766913#comment-16766913 ] sunjincheng commented on FLINK-9900: Can we close the JIRA due to the PRs has been merged. [~Zentol] [~mingleizhang] ? > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.4, 1.7.3, 1.8.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586104#comment-16586104 ] ASF GitHub Bot commented on FLINK-9900: --- zentol closed pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java b/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java index b83f89e4ca5..642af40aaf2 100644 --- a/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java +++ b/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java @@ -50,6 +50,7 @@ import org.apache.flink.streaming.api.functions.source.SourceFunction; import org.apache.flink.test.util.MiniClusterResource; import org.apache.flink.test.util.MiniClusterResourceConfiguration; +import org.apache.flink.util.ExceptionUtils; import org.apache.flink.util.Preconditions; import org.apache.flink.util.TestLogger; @@ -62,6 +63,11 @@ import java.io.File; import java.io.IOException; +import java.io.PrintWriter; +import java.io.StringWriter; +import java.lang.management.ManagementFactory; +import java.lang.management.ThreadInfo; +import java.lang.management.ThreadMXBean; import java.nio.file.FileVisitResult; import java.nio.file.Files; import java.nio.file.Path; @@ -163,7 +169,7 @@ public static void tearDown() throws Exception { * restored successfully * */ - @Test(timeout = 120_000L) + @Test public void testRestoreBehaviourWithFaultyStateHandles() throws Exception { CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.set(1); CheckpointBlockingFunction.successfulRestores.set(0); @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); +
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579495#comment-16579495 ] ASF GitHub Bot commented on FLINK-9900: --- StefanRRichter commented on a change in pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#discussion_r209881971 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); + final Thread.State state = threadInfo.getThreadState(); + dump.append(System.lineSeparator()); + dump.append(" java.lang.Thread.State: "); + dump.append(state); + final StackTraceElement[] stackTraceElements = threadInfo.getStackTrace(); + for (final StackTraceElement stackTraceElement : stackTraceElements) { Review comment: Fine with me, especially when we consider that the code is probably not permanently there. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.1, 1.7.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579487#comment-16579487 ] ASF GitHub Bot commented on FLINK-9900: --- zentol commented on a change in pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#discussion_r209879523 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); + final Thread.State state = threadInfo.getThreadState(); + dump.append(System.lineSeparator()); + dump.append(" java.lang.Thread.State: "); + dump.append(state); + final StackTraceElement[] stackTraceElements = threadInfo.getStackTrace(); + for (final StackTraceElement stackTraceElement : stackTraceElements) { Review comment: The output is nicer if we go with the original option: without Throwable (i.e. current PR revision): ``` "Monitor Ctrl-Break" java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.readLine(BufferedReader.java:324) at java.io.BufferedReader.readLine(BufferedReader.java:389) at com.intellij.rt.execution.application.AppMainV2$1.run(AppMainV2.java:64) ``` with throwable: ``` "Monitor Ctrl-Break" java.lang.Thread.State: RUNNABLE java.lang.Throwable at java.net.SocketInputStream.socketRead0(Native
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579434#comment-16579434 ] ASF GitHub Bot commented on FLINK-9900: --- zentol commented on a change in pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#discussion_r209867134 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); + final Thread.State state = threadInfo.getThreadState(); + dump.append(System.lineSeparator()); + dump.append(" java.lang.Thread.State: "); + dump.append(state); + final StackTraceElement[] stackTraceElements = threadInfo.getStackTrace(); + for (final StackTraceElement stackTraceElement : stackTraceElements) { Review comment: ah ok, that makes sense. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.1, 1.7.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in >
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579432#comment-16579432 ] ASF GitHub Bot commented on FLINK-9900: --- StefanRRichter commented on a change in pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#discussion_r209866827 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); + final Thread.State state = threadInfo.getThreadState(); + dump.append(System.lineSeparator()); + dump.append(" java.lang.Thread.State: "); + dump.append(state); + final StackTraceElement[] stackTraceElements = threadInfo.getStackTrace(); + for (final StackTraceElement stackTraceElement : stackTraceElements) { Review comment: Sure, but you could use it in the loop that iterates all thread info objects to create the trace string per thread, which you can set via the `Throwable.setStackTrace` before printing? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.1, 1.7.0 > > >
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579425#comment-16579425 ] ASF GitHub Bot commented on FLINK-9900: --- zentol commented on a change in pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#discussion_r209865380 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); + final Thread.State state = threadInfo.getThreadState(); + dump.append(System.lineSeparator()); + dump.append(" java.lang.Thread.State: "); + dump.append(state); + final StackTraceElement[] stackTraceElements = threadInfo.getStackTrace(); + for (final StackTraceElement stackTraceElement : stackTraceElements) { Review comment: but doesn't that just print the exception stack trace, whereas this method prints a full thread dump? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.1, 1.7.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576188#comment-16576188 ] ASF GitHub Bot commented on FLINK-9900: --- StefanRRichter commented on a change in pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#discussion_r209236626 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -256,13 +262,54 @@ public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IO () -> clusterClient.getJobStatus(jobID), Time.milliseconds(50), deadline, - (jobStatus) -> jobStatus == JobStatus.FINISHED, + JobStatus::isGloballyTerminalState, TestingUtils.defaultScheduledExecutor()); - assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + try { + assertEquals(JobStatus.FINISHED, jobStatusFuture.get()); + } catch (Throwable e) { + // include additional debugging information + StringWriter error = new StringWriter(); + try (PrintWriter out = new PrintWriter(error)) { + out.println("The job did not finish in time."); + out.println("allowedInitializeCallsWithoutRestore= " + CheckpointBlockingFunction.allowedInitializeCallsWithoutRestore.get()); + out.println("illegalRestores= " + CheckpointBlockingFunction.illegalRestores.get()); + out.println("successfulRestores= " + CheckpointBlockingFunction.successfulRestores.get()); + out.println("afterMessWithZooKeeper= " + CheckpointBlockingFunction.afterMessWithZooKeeper.get()); + out.println("failedAlready= " + CheckpointBlockingFunction.failedAlready.get()); + out.println("currentJobStatus= " + clusterClient.getJobStatus(jobID).get()); + out.println("numRestarts= " + RestartReporter.numRestarts.getValue()); + out.println("threadDump= " + generateThreadDump()); + } + throw new AssertionError(error.toString(), ExceptionUtils.stripCompletionException(e)); + } assertThat("We saw illegal restores.", CheckpointBlockingFunction.illegalRestores.get(), is(0)); } + private static String generateThreadDump() { + final StringBuilder dump = new StringBuilder(); + final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); + final ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds(), 100); + for (ThreadInfo threadInfo : threadInfos) { + dump.append('"'); + dump.append(threadInfo.getThreadName()); + dump.append('"'); + final Thread.State state = threadInfo.getThreadState(); + dump.append(System.lineSeparator()); + dump.append(" java.lang.Thread.State: "); + dump.append(state); + final StackTraceElement[] stackTraceElements = threadInfo.getStackTrace(); + for (final StackTraceElement stackTraceElement : stackTraceElements) { Review comment: Could probably use `Throwable.printStackTrace(PrintWriter)` instead. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.1, 1.7.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575789#comment-16575789 ] ASF GitHub Bot commented on FLINK-9900: --- yanghua commented on issue #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#issuecomment-411985618 hi @zentol , checkstyle error : ``` src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java:[312] (regexp) RegexpSingleline: Trailing whitespace ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.1, 1.7.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565539#comment-16565539 ] ASF GitHub Bot commented on FLINK-9900: --- yanghua commented on issue #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468#issuecomment-409626058 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565083#comment-16565083 ] ASF GitHub Bot commented on FLINK-9900: --- zentol opened a new pull request #6468: [FLINK-9900][tests] Include more information on timeout in Zookeeper HA ITCase URL: https://github.com/apache/flink/pull/6468 ## What is the purpose of the change With this PR we include more debugging information in the error message if the `ZookeeperHighAvailabilityITCase` times out. ## Brief change log * remove `Timeout` annotation to a) deduplicate timeout logics and b) allow addition of new logic * add utility method creating thread dump * on failure to retrieve job information or if the job isn't in `FINISHED` state, print more debug info This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available > Fix For: 1.5.3, 1.6.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561643#comment-16561643 ] ASF GitHub Bot commented on FLINK-9900: --- zentol closed pull request #6395: [FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase URL: https://github.com/apache/flink/pull/6395 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java b/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java index e02ed010242..b83f89e4ca5 100644 --- a/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java +++ b/flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java @@ -31,7 +31,12 @@ import org.apache.flink.configuration.HighAvailabilityOptions; import org.apache.flink.configuration.TaskManagerOptions; import org.apache.flink.core.testutils.OneShotLatch; +import org.apache.flink.metrics.Metric; +import org.apache.flink.metrics.MetricConfig; +import org.apache.flink.metrics.MetricGroup; +import org.apache.flink.metrics.reporter.MetricReporter; import org.apache.flink.runtime.concurrent.FutureUtils; +import org.apache.flink.runtime.executiongraph.metrics.NumberOfFullRestartsGauge; import org.apache.flink.runtime.jobgraph.JobGraph; import org.apache.flink.runtime.jobgraph.JobStatus; import org.apache.flink.runtime.state.FunctionInitializationContext; @@ -56,6 +61,12 @@ import org.junit.rules.TemporaryFolder; import java.io.File; +import java.io.IOException; +import java.nio.file.FileVisitResult; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.SimpleFileVisitor; +import java.nio.file.attribute.BasicFileAttributes; import java.time.Duration; import java.util.UUID; import java.util.concurrent.CompletableFuture; @@ -63,6 +74,7 @@ import java.util.concurrent.atomic.AtomicInteger; import static org.hamcrest.core.Is.is; +import static org.hamcrest.number.OrderingComparison.greaterThan; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertThat; @@ -107,6 +119,10 @@ public static void setup() throws Exception { config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, zkServer.getConnectString()); config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper"); + config.setString( + ConfigConstants.METRICS_REPORTER_PREFIX + "restarts." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, + RestartReporter.class.getName()); + // we have to manage this manually because we have to create the ZooKeeper server // ahead of this miniClusterResource = new MiniClusterResource( @@ -184,58 +200,59 @@ public void testRestoreBehaviourWithFaultyStateHandles() throws Exception { // wait until we did some checkpoints waitForCheckpointLatch.await(); + log.debug("Messing with HA directory"); // mess with the HA directory so that the job cannot restore File movedCheckpointLocation = TEMPORARY_FOLDER.newFolder(); - int numCheckpoints = 0; - File[] files = haStorageDir.listFiles(); - assertNotNull(files); - for (File file : files) { - if (file.getName().startsWith("completedCheckpoint")) { - assertTrue(file.renameTo(new File(movedCheckpointLocation, file.getName(; - numCheckpoints++; + AtomicInteger numCheckpoints = new AtomicInteger(); + Files.walkFileTree(haStorageDir.toPath(), new SimpleFileVisitor() { + @Override + public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) { + if (file.getFileName().toString().startsWith("completedCheckpoint")) { + log.debug("Moving original checkpoint file {}.", file); + try { + Files.move(file, movedCheckpointLocation.toPath().resolve(file.getFileName())); + numCheckpoints.incrementAndGet(); + } catch (IOException ioe) { + // previous checkpoint files may be deleted asynchronously +
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559495#comment-16559495 ] ASF GitHub Bot commented on FLINK-9900: --- StefanRRichter commented on a change in pull request #6395: [FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase URL: https://github.com/apache/flink/pull/6395#discussion_r205721646 ## File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ZooKeeperHighAvailabilityITCase.java ## @@ -107,6 +119,8 @@ public static void setup() throws Exception { config.setString(HighAvailabilityOptions.HA_ZOOKEEPER_QUORUM, zkServer.getConnectString()); config.setString(HighAvailabilityOptions.HA_MODE, "zookeeper"); + config.setString(ConfigConstants.METRICS_REPORTER_PREFIX + "restarts." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, RestartReporter.class.getName()); Review comment: Break down line in two. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552939#comment-16552939 ] ASF GitHub Bot commented on FLINK-9900: --- GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/6395 [FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase ## What is the purpose of the change This PR makes a few modifications to the `ZooKeeperHighAvailabilityITCase` to reduce the chances for intermittent test failures and timeouts. Changes: ## 1) The test was moving files out of the HA storage directory with a simple loop using `File#renameTo`. The test enforced that the moving is successful, however since old checkpoints may be deleted asynchronously this may not always be the case. We now use a `FileVisitor` and ignore `IOExceptions` that occur while moving. If no checkpoint file could be moved the test will still fail. ## 2) After the checkpoint files were moved out of the HA storage directory the job is thrown into a restart loop. To verify the restart behavior the test was polling the job state and checked for the `RESTARTING` and `FAILING` states. Due to the small size the job is in these states only for a short time, effectively adding a race condition. Thus this loop mayrun for longer than anticipated; the largest outlier i got locally was 50 seconds which isn't _that_ for off from the 2 minute timeout. I suspect this to be the failure cause raised in the JIRA, but I can't guarantee it. Instead we now access the `fullRestarts` metric using a custom reporter to check how many restarts have occurred. The actual _state transitions_ should be irrelevant to the test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 9900 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6395 commit b8827dc3723558c52ad567bf88f24ae34129ea08 Author: zentol Date: 2018-07-23T14:21:32Z [FLINK-9900][tests] Harden ZooKeeperHighAvailabilityITCase > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9900) Failed to testRestoreBehaviourWithFaultyStateHandles (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase)
[ https://issues.apache.org/jira/browse/FLINK-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552137#comment-16552137 ] Till Rohrmann commented on FLINK-9900: -- Another instance: https://api.travis-ci.org/v3/job/406882682/log.txt > Failed to testRestoreBehaviourWithFaultyStateHandles > (org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > --- > > Key: FLINK-9900 > URL: https://issues.apache.org/jira/browse/FLINK-9900 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.1, 1.6.0 >Reporter: zhangminglei >Priority: Critical > Fix For: 1.6.0 > > > https://api.travis-ci.org/v3/job/405843617/log.txt > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 124.598 sec > <<< FAILURE! - in > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase > > testRestoreBehaviourWithFaultyStateHandles(org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase) > Time elapsed: 120.036 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 12 > milliseconds > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.test.checkpointing.ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles(ZooKeeperHighAvailabilityITCase.java:244) > Results : > Tests in error: > > ZooKeeperHighAvailabilityITCase.testRestoreBehaviourWithFaultyStateHandles:244 > » TestTimedOut > Tests run: 1453, Failures: 0, Errors: 1, Skipped: 29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)