[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344384#comment-16344384 ] Appy commented on HBASE-19803: -- Yeah, it's infra issue. I'm not able to even access the site https://builds.apache.org/ > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344372#comment-16344372 ] Duo Zhang commented on HBASE-19803: --- Seems the infra sucks... I added label Hadoop but the two new builds still fail due to disconnect from the build machine... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344357#comment-16344357 ] stack commented on HBASE-19803: --- {quote}[~appy] The flakey test finder job is hang? {quote} I was wondering why it wasn't moving today... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344311#comment-16344311 ] Duo Zhang commented on HBASE-19803: --- I've changed the label from 'ubuntu' to 'ubuntu||Hadoop' so it can start a new build... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344305#comment-16344305 ] Duo Zhang commented on HBASE-19803: --- [~appy] The flakey test finder job is hang? https://builds.apache.org/job/HBASE-Find-Flaky-Tests/ The last build can not start... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342309#comment-16342309 ] stack commented on HBASE-19803: --- {quote}Maybe a possible way is to do it at once first, and then add a UT to make sure that we always have a CategoryBasedTimeout ClassRule for every UTs. {quote} I endorse this approach. Our test suite becomes more and more unruly as time goes by. Weeding and fixup consumes weeks of developer time on an ongoing basis (witness the last few weeks of test fixup/flakie suppression). The only counter we have to the ever-increasing disorder is enforcing more order/constraints. Let me comment on HBASE-19873 approach over there. Great work lads. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342307#comment-16342307 ] stack commented on HBASE-19803: --- The issue by [~Apache9] that explores adding CategoryBasedTimer as a ClassRule > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342071#comment-16342071 ] Duo Zhang commented on HBASE-19803: --- {quote} The easiest way to do might be using junit's RunListener#testStarted and using reflection on the test class (Description#getTestClass). If it doesn't have Timeout rule, throwing exception might fail the test. {quote} Sounds good. Let me open an issue for it. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342022#comment-16342022 ] Appy commented on HBASE-19803: -- Sounds good. The easiest way to do might be using junit's RunListener#testStarted and using reflection on the test class (Description#getTestClass). If it doesn't have Timeout rule, throwing exception might fail the test. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341988#comment-16341988 ] Duo Zhang commented on HBASE-19803: --- Maybe a possible way is to do it at once first, and then add a UT to make sure that we always have a CategoryBasedTimeout ClassRule for every UTs. What do you think? [~stack] [~appy]. Thanks. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341985#comment-16341985 ] Duo Zhang commented on HBASE-19803: --- Using ClassRule instead of Rule can change to limit the total time of running a test class. But seems we need to add a line to every tests, which is really a pain... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341228#comment-16341228 ] stack commented on HBASE-19803: --- Great stuff lads. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Sub-task >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340761#comment-16340761 ] Duo Zhang commented on HBASE-19803: --- {quote} For eg, a medium test class with 6 methods where each takes 2.5 minute each (within Timeout rule), will take total time 15min. {quote} Bad news... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340757#comment-16340757 ] Appy commented on HBASE-19803: -- Btw, this can happen even when category based timeout is being correctly enforced because Timeout rule puts the timeout on each individual test method and not on whole test class. For eg, a medium test class with 6 methods where each takes 2.5 minute each (within Timeout rule), will take total time 15min. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340652#comment-16340652 ] Appy commented on HBASE-19803: -- Fix for TestTokenAuthentication - HBASE-19862 Looking at TestRegionServerReportForDuty logs, the test was stuck and running until surefire killed the JVM. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340502#comment-16340502 ] Appy commented on HBASE-19803: -- Chatted with him, Duo was talking about CategoryBasedTimeout. Updated explanation above to cover that case. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340490#comment-16340490 ] Appy commented on HBASE-19803: -- Yes, but we need that timeout, no? You have a way around? > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340489#comment-16340489 ] Duo Zhang commented on HBASE-19803: --- The timeout limit from junit does not work? Damn... Thanks [~appy]. Let's fix these three UTs first. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340485#comment-16340485 ] Duo Zhang commented on HBASE-19803: --- OK, so the problem is, we have a 15 minutes timeout, if there is a test that hangs longer than this period, the surefire plugin will try to kill all the ongoing tests and report a failure to us? > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340483#comment-16340483 ] Appy commented on HBASE-19803: -- I think i have cracked it, it's basically this: - Some test goes bad (which one? - basically what we were trying to figure out) - After 900 sec (forkedProcessTimeoutInSeconds), surefire plugin issues 'shutdown' to *all* forked JVMs (that's what we see *.dump files), basically killing every test running at that time. ** Side note: if findHangingTests.py reports X number of hanging tests, and we have surefire forkcount as Y, then there should be *at least* ceiling[X/Y] count of the following message in each *.dump file. {noformat} # Created on 2018-01-25T12:14:47.114 Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook. {noformat} With that figured, it was easy to find culprit tests. Look for timestamp of "Killed self fork..." messages in dump file and find the tests which started *exactly 900 sec before it* Any hanging test (as reported by our script) with start timestamp between these two times was just caught in cross fire. Applying the method to #207 run (https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal these three culprit tests: - security.token.TestTokenAuthentication - master.balancer.TestStochasticLoadBalancer - regionserver.TestRegionServerReportForDuty > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340184#comment-16340184 ] Appy commented on HBASE-19803: -- Pushing and starting another nightly run. I have a part which tells that this might not be useful since java always tries to core dump on vm crash (at worst, in /tmp dir), and if there were any core dumps happening, surefire plugin should have caught them anyways (irrespective of location) and generated a *.dumpstream file in surefire-reports. I see 5 .dump files (not .dumpstream) in test_logs.zip of [https://builds.apache.org/job/HBase%20Nightly/job/branch-2/197|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/197.]. That -rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun1.dump -rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun2.dump -rw-r--r-- 1 appy staff 331B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun3.dump -rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun4.dump -rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun5.dump What i don't understand is, why does surefire plugin try to stop all 5 jvms at exactly 21:55:35 and 22:43:34. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338855#comment-16338855 ] Duo Zhang commented on HBASE-19803: --- Checked the stacktrace a bit {noformat} org.apache.hadoop.hbase.HConstants$ExitException: There is no escape! at org.apache.hadoop.hbase.HConstants$NoExitSecurityManager.checkExit(HConstants.java:63) at java.lang.Runtime.halt(Runtime.java:273) at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:300) at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:294) at org.apache.maven.surefire.booter.ForkedBooter.access$300(ForkedBooter.java:68) at org.apache.maven.surefire.booter.ForkedBooter$4.update(ForkedBooter.java:247) at org.apache.maven.surefire.booter.CommandReader$CommandRunnable.insertToListeners(CommandReader.java:475) at org.apache.maven.surefire.booter.CommandReader$CommandRunnable.run(CommandReader.java:421) at java.lang.Thread.run(Thread.java:748) {noformat} https://github.com/apache/maven-surefire/blob/surefire-2.20.1/surefire-api/src/main/java/org/apache/maven/surefire/booter/CommandReader.java#L421 https://github.com/apache/maven-surefire/blob/surefire-2.20.1/surefire-booter/src/main/java/org/apache/maven/surefire/booter/ForkedBooter.java#L247 I think it is obvious that the UT is killed by the surefire itself. CommandReader is a class that keep reading data from stdin to get commands from the plugin process. And in the ForkedBooter, we actually do the kill job by calling System.exit, see the below code. {code} DumpErrorSingleton.getSingleton() .dumpText( "Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook." ); kill(); {code} This is not a kill by signal since the command is gotten from stdin. I think this is correct direction to go, find out why the plugin process(or called maven master process?) wants to kill the forked test process. Thanks. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338779#comment-16338779 ] stack commented on HBASE-19803: --- What you thinking then [~appy] The dumpStream is archived IIRC. The mangled thread dump in it is useful? Is it for the surefire driver or is it forked JVM? I'd be +1 for trying it for a while to see if it turns up anything. Anything better than driving blind sir. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338603#comment-16338603 ] Appy commented on HBASE-19803: -- Arghh, setting -XX:ErrorFile doesn't work with surefire. Apparently coredump uses the same output stream as required by maven, this results in following issue: http://maven.apache.org/components/surefire/maven-failsafe-plugin/faq.html#corruptedstream The results file is [^2018-01-24T17-45-37_000-jvmRun1.dumpstream] (generated it by running a test and manually killing it using kill -9) It's not ideal, but it's better than nothing. Ptal at [^HBASE-19803.master.001.patch] . Let's get it in and wait for more failures? [~Apache9] [~stack] > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, > HBASE-19803.master.001.patch > > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338581#comment-16338581 ] Appy commented on HBASE-19803: -- So at this point, main thing to demystify is - is it system.exit calls that's screwing us up, or some failed native calls crashing jvm. Either way, setting jvm option {{-XX:ErrorFile}} and archiving that location is the way to go because: - If we don't see that file in artifacts, then it should have been system.exit call - If we do see it, then it's likely some native call and we can use the error file for further debug. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338521#comment-16338521 ] Appy commented on HBASE-19803: -- So to iterate faster in trying to understand what/when/how this happens, i switched to hbase-http module, added System.exit(0) at the end of TestGlobalFilter#testServletFilter. forkedProcessTimeoutInSeconds=10sec (should be fine since no http test takes more than 4 sec on my machine, see list below) reuseForks=false (that's our default in root pom) {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.http.conf.TestConfServlet Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.756 s - in org.apache.hadoop.hbase.http.conf.TestConfServlet Running org.apache.hadoop.hbase.http.jmx.TestJMXJsonServlet Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.806 s - in org.apache.hadoop.hbase.http.jmx.TestJMXJsonServlet Running org.apache.hadoop.hbase.http.lib.TestStaticUserWebFilter Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.274 s - in org.apache.hadoop.hbase.http.lib.TestStaticUserWebFilter Running org.apache.hadoop.hbase.http.log.TestLogLevel Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.786 s - in org.apache.hadoop.hbase.http.log.TestLogLevel Running org.apache.hadoop.hbase.http.TestGlobalFilter Running org.apache.hadoop.hbase.http.TestHtmlQuoting Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.291 s - in org.apache.hadoop.hbase.http.TestHtmlQuoting Running org.apache.hadoop.hbase.http.TestHttpRequestLog Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.415 s - in org.apache.hadoop.hbase.http.TestHttpRequestLog Running org.apache.hadoop.hbase.http.TestHttpRequestLogAppender Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.253 s - in org.apache.hadoop.hbase.http.TestHttpRequestLogAppender Running org.apache.hadoop.hbase.http.TestHttpServer Tests run: 15, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 3.275 s - in org.apache.hadoop.hbase.http.TestHttpServer Running org.apache.hadoop.hbase.http.TestHttpServerLifecycle Tests run: 6, Failures: 0, Errors: 0, Skipped: 6, Time elapsed: 0.002 s - in org.apache.hadoop.hbase.http.TestHttpServerLifecycle Running org.apache.hadoop.hbase.http.TestHttpServerWebapps Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.557 s - in org.apache.hadoop.hbase.http.TestHttpServerWebapps Running org.apache.hadoop.hbase.http.TestPathFilter Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.71 s - in org.apache.hadoop.hbase.http.TestPathFilter Running org.apache.hadoop.hbase.http.TestServletFilter Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.82 s - in org.apache.hadoop.hbase.http.TestServletFilter Running org.apache.hadoop.hbase.http.TestSpnegoHttpServer Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.881 s - in org.apache.hadoop.hbase.http.TestSpnegoHttpServer Running org.apache.hadoop.hbase.http.TestSSLHttpServer Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.33 s - in org.apache.hadoop.hbase.http.TestSSLHttpServer Results: Tests run: 48, Failures: 0, Errors: 0, Skipped: 9 {noformat} > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338516#comment-16338516 ] Appy commented on HBASE-19803: -- So that's because we set {{${surefire.timeout}}} where surefire.timeout is 900 i.e. 15 min. I tried setting it to 30 sec, but the tests still keep on running :-/ > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338498#comment-16338498 ] Appy commented on HBASE-19803: -- This is weird, trying to see a crash explicitly on my machine, i added {{System.exit(0)}} to TestEntityLocks#setUp and ran {{mvn test -PrunAllTests -pl hbase-server}}, but the tests just keep going on for several minutes (at which point i stop them myself). {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.client.locking.TestEntityLocks Running org.apache.hadoop.hbase.client.TestAllowPartialScanResultCache Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.293 s - in org.apache.hadoop.hbase.client.TestAllowPartialScanResultCache Running org.apache.hadoop.hbase.client.TestBatchScanResultCache Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.285 s - in org.apache.hadoop.hbase.client.TestBatchScanResultCache Running org.apache.hadoop.hbase.client.TestCompleteResultScanResultCache Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.183 s - in org.apache.hadoop.hbase.client.TestCompleteResultScanResultCache Running org.apache.hadoop.hbase.client.TestConnectionUtils Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.547 s - in org.apache.hadoop.hbase.client.TestConnectionUtils Running org.apache.hadoop.hbase.client.TestHBaseAdminNoCluster Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 8.153 s - in org.apache.hadoop.hbase.client.TestHBaseAdminNoCluster Running org.apache.hadoop.hbase.client.TestIntraRowPagination Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.268 s - in org.apache.hadoop.hbase.client.TestIntraRowPagination Running org.apache.hadoop.hbase.client.TestPutDeleteEtcCellIteration Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.273 s - in org.apache.hadoop.hbase.client.TestPutDeleteEtcCellIteration Running org.apache.hadoop.hbase.client.TestResult Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.187 s - in org.apache.hadoop.hbase.client.TestResult Running org.apache.hadoop.hbase.codec.TestCellMessageCodec Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.654 s - in org.apache.hadoop.hbase.codec.TestCellMessageCodec Running org.apache.hadoop.hbase.conf.TestConfigurationManager Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.467 s - in org.apache.hadoop.hbase.conf.TestConfigurationManager Running org.apache.hadoop.hbase.constraint.TestConstraints Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.963 s - in org.apache.hadoop.hbase.constraint.TestConstraints Running org.apache.hadoop.hbase.coprocessor.TestCoprocessorConfiguration Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.401 s - in org.apache.hadoop.hbase.coprocessor.TestCoprocessorConfiguration Running org.apache.hadoop.hbase.coprocessor.TestCoprocessorHost Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.573 s - in org.apache.hadoop.hbase.coprocessor.TestCoprocessorHost Running org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.89 s - in org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface {noformat} > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338489#comment-16338489 ] Duo Zhang commented on HBASE-19803: --- I mean the only place where we call System.exit for the crashed UTs is from the surefire plugin itself. So it is very strange that why surefire tells us the UT is crashed since it is killed by surefire itself. And for the JNI crash, first as you said, we should have a hs_err file, and still, the surefire plugin will not have the chance to call System.exit if the JVM is crashed in native code. But the result shows that, it does call System.exit before the UT exits. Thanks. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338478#comment-16338478 ] Appy commented on HBASE-19803: -- Oh oh, another way to narrow down search space is, run small+medium tests separately from large tests. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338438#comment-16338438 ] Appy commented on HBASE-19803: -- bq. Notice here I only log the exception without throwing it out if it is called from the surefire plugin. So it is killed by the surefire plugin? And then surefire plugin tells us the VM exited abnormally... Didn't understand what you meant. Another possible cause for this one, from reading around, is, failed native call. That can happen for so many reasons - over memory (but i think that one shows up explicitly as OOM), apache machine killing jvm for overuse of resources? (wild guess), etc. I think we really need hs_err files to help debug this one. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329832#comment-16329832 ] Duo Zhang commented on HBASE-19803: --- It seems a surefire issue. I run mvn test locally in hbase-server module, and TestJMXConnectorServer fails, this is a known issue, and then comes lots of crashes. This is one of the failed UT {noformat} Error occurred in starting fork, check output in log Process Exit Code: 1 Crashed tests: org.apache.hadoop.hbase.master.balancer.TestRegionsOnMasterOptions at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:496) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:443) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:295) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1124) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:954) ... 23 more Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Command was /bin/sh -c cd /home/zhangduo/hbase/code/hbase-server && /home/zhangduo/opt/jdk1.8.0_151/jre/bin/java -enableassertions -Dhbase.build.id=2018-01-17T22:44:23Z -Xmx2800m -Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -jar /home/zhangduo/hbase/code/hbase-server/target/surefire/surefirebooter3125641250160453662.jar /home/zhangduo/hbase/code/hbase-server/target/surefire 2018-01-18T06-44-36_642-jvmRun2 surefire7506668156192398602tmp surefire_14263036952065448117423tmp {noformat} And I checked org.apache.hadoop.hbase.master.balancer.TestRegionsOnMasterOptions-output.txt, the only place where we call System.exit is {noformat} org.apache.hadoop.hbase.HConstants$ExitException: There is no escape! at org.apache.hadoop.hbase.HConstants$NoExitSecurityManager.checkExit(HConstants.java:63) at java.lang.Runtime.halt(Runtime.java:273) at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:300) at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:294) at org.apache.maven.surefire.booter.ForkedBooter.access$300(ForkedBooter.java:68) at org.apache.maven.surefire.booter.ForkedBooter$4.update(ForkedBooter.java:247) at org.apache.maven.surefire.booter.CommandReader$CommandRunnable.insertToListeners(CommandReader.java:475) at org.apache.maven.surefire.booter.CommandReader$CommandRunnable.run(CommandReader.java:421) at java.lang.Thread.run(Thread.java:748) {noformat} Notice here I only log the exception without throwing it out if it is called from the surefire plugin. So it is killed by the surefire plugin? And then surefire plugin tells us the VM exited abnormally... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329638#comment-16329638 ] Duo Zhang commented on HBASE-19803: --- If I do not throw ExitException for ForkedBooter.kill then the test run will crash... Let me add more logs and try again. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329600#comment-16329600 ] Duo Zhang commented on HBASE-19803: --- This is only a temporary approach to find out who calls System.exit in test, and then we could find the solution. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329593#comment-16329593 ] Appy commented on HBASE-19803: -- I was thinking of going the way where we would replace all System.exit(X) calls with a util function which would additionally dump stack trace at LOG.debug (testing level) before calling System.exit itself. Adding SecurityManager globally for everything seems like a strong and major change to jvm environment (even if just for tests). > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328707#comment-16328707 ] Duo Zhang commented on HBASE-19803: --- I've registered a SecurityManager in HConstants. Need some hack to let the surefire framework can still call System.exit. First is that surefire will use System.exit to exit even for a successful test. Second is that ForkedBooter.kill will use System.exit, I believe it is used to kill timeout test. I've already hack for these two cases and started the third try. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328249#comment-16328249 ] Duo Zhang commented on HBASE-19803: --- https://stackoverflow.com/questions/5401281/preventing-system-exit-from-api I think we could try this? Disable System.exit when running UTs, and we can output something when System.exit is called so we can know who is the criminal. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327858#comment-16327858 ] Appy commented on HBASE-19803: -- yeah probably, i don't see any unit test calling it (although ITs do). In the non -test code, it's mostly main() fn in tools. But digging around more: - TestZKMainServer seems to be [handling System.exit() appropriately|https://github.com/apache/hbase/blob/master/hbase-zookeeper/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKMainServer.java#L85]. So do all other tests. Here's what method: https://stackoverflow.com/questions/309396/java-how-to-test-methods-that-call-system-exit - The only case where it might be wrong is, ImportTsv#createSubmittableJob calling System.exit(). TestImportTsv calls that fn multiple times. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327731#comment-16327731 ] stack commented on HBASE-19803: --- I don't think any of our tests call System#exit. Would be happy if was proven wrong... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327681#comment-16327681 ] Appy commented on HBASE-19803: -- Oh, this looks promising: http://maven.apache.org/surefire/maven-surefire-plugin/faq.html#vm-termination And we have quite a few System.exit in our code. Since the failure is in hbase-server tests, just looking for System.exit calls in that module and those on which it depends. Also, ignoring the calls from main() fns of tools. Here's list of possible culprits: - HMaster#InitializationMonitor#run() ...there may be others, but not obvious at first look. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327544#comment-16327544 ] Appy commented on HBASE-19803: -- But the [console output|https://builds.apache.org/job/HBASE-Flaky-Tests/24830/consoleFull] doesn't say anything about out-of-memory, which i have seen in some cases in the past. So maybe it's not the issue. > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327526#comment-16327526 ] Appy commented on HBASE-19803: -- bq. It seems that the jvm may crash during the mvn test run and then we will kill all the running tests and then we may mark some of them as hang which leads to the false positive. Makes sense. This one suggests that it can be memory issue : https://stackoverflow.com/questions/42298883/maven-build-failure-when-running-tests-due-to-jvm-crash Looking at old nighly job (https://builds.apache.org/job/HBase-Trunk_matrix/configure), it was using -Xmx6100M. But the new jobs seem to be using just 3g (https://github.com/apache/hbase/blob/master/dev-support/docker/Dockerfile#L40) > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job
[ https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327126#comment-16327126 ] Duo Zhang commented on HBASE-19803: --- [~appy] FYI. Maybe the first thing is that we need to figure out why it is so easy to crash when executing mvn test... The nightly job always fail in this way... > False positive for the HBASE-Find-Flaky-Tests job > - > > Key: HBASE-19803 > URL: https://issues.apache.org/jira/browse/HBASE-19803 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > > It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the > surefire output > https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was likely to be killed in the middle of the run within 20 seconds. > https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt > This one was also killed within about 1 minutes. > The test is declared as LargeTests so the time limit should be 10 minutes. It > seems that the jvm may crash during the mvn test run and then we will kill > all the running tests and then we may mark some of them as hang which leads > to the false positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)