[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-29 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344384#comment-16344384
 ] 

Appy commented on HBASE-19803:
--

Yeah, it's infra issue. I'm not able to even access the site 
https://builds.apache.org/

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-29 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344372#comment-16344372
 ] 

Duo Zhang commented on HBASE-19803:
---

Seems the infra sucks... I added label Hadoop but the two new builds still fail 
due to disconnect from the build machine...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-29 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344357#comment-16344357
 ] 

stack commented on HBASE-19803:
---

{quote}[~appy] The flakey test finder job is hang?
{quote}
I was wondering why it wasn't moving today...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-29 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344311#comment-16344311
 ] 

Duo Zhang commented on HBASE-19803:
---

I've changed the label from 'ubuntu' to 'ubuntu||Hadoop' so it can start a new 
build...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-29 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344305#comment-16344305
 ] 

Duo Zhang commented on HBASE-19803:
---

[~appy] The flakey test finder job is hang?

https://builds.apache.org/job/HBASE-Find-Flaky-Tests/

The last build can not start...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-27 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342309#comment-16342309
 ] 

stack commented on HBASE-19803:
---

{quote}Maybe a possible way is to do it at once first, and then add a UT to 
make sure that we always have a CategoryBasedTimeout ClassRule for every UTs.
{quote}
I endorse this approach. Our test suite becomes more and more unruly as time 
goes by.  Weeding and fixup consumes weeks of developer time on an ongoing 
basis (witness the last few weeks of test fixup/flakie suppression). The only 
counter we have to the ever-increasing disorder is enforcing more 
order/constraints. Let me comment on HBASE-19873 approach over there. Great 
work lads.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-27 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342307#comment-16342307
 ] 

stack commented on HBASE-19803:
---

The issue by [~Apache9] that explores adding CategoryBasedTimer as a ClassRule

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-27 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342071#comment-16342071
 ] 

Duo Zhang commented on HBASE-19803:
---

{quote}
The easiest way to do might be using junit's RunListener#testStarted and using 
reflection on the test class (Description#getTestClass). If it doesn't have 
Timeout rule, throwing exception might fail the test.
{quote}

Sounds good. Let me open an issue for it.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-27 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342022#comment-16342022
 ] 

Appy commented on HBASE-19803:
--

Sounds good.
The easiest way to do might be using junit's RunListener#testStarted and using 
reflection on the test class (Description#getTestClass). If it doesn't have 
Timeout rule, throwing exception might fail the test.


> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-26 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341988#comment-16341988
 ] 

Duo Zhang commented on HBASE-19803:
---

Maybe a possible way is to do it at once first, and then add a UT to make sure 
that we always have a CategoryBasedTimeout ClassRule for every UTs.

What do you think? [~stack] [~appy].

Thanks.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-26 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341985#comment-16341985
 ] 

Duo Zhang commented on HBASE-19803:
---

Using ClassRule instead of Rule can change to limit the total time of running a 
test class. But seems we need to add a line to every tests, which is really a 
pain...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-26 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341228#comment-16341228
 ] 

stack commented on HBASE-19803:
---

 Great stuff lads.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-26 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340761#comment-16340761
 ] 

Duo Zhang commented on HBASE-19803:
---

{quote}
For eg, a medium test class with 6 methods where each takes 2.5 minute each 
(within Timeout rule), will take total time 15min.
{quote}
Bad news...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-26 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340757#comment-16340757
 ] 

Appy commented on HBASE-19803:
--

Btw, this can happen even when category based timeout is being correctly 
enforced because Timeout rule puts the timeout on each individual test method 
and not on whole test class.
For eg, a medium test class with 6 methods where each takes 2.5 minute each 
(within Timeout rule), will take total time 15min.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340652#comment-16340652
 ] 

Appy commented on HBASE-19803:
--

Fix for TestTokenAuthentication - HBASE-19862
Looking at TestRegionServerReportForDuty logs, the test was stuck and running 
until surefire killed the JVM.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340502#comment-16340502
 ] 

Appy commented on HBASE-19803:
--

Chatted with him, Duo was talking about CategoryBasedTimeout. Updated 
explanation above to cover that case.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340490#comment-16340490
 ] 

Appy commented on HBASE-19803:
--

Yes, but we need that timeout, no? You have a way around?

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340489#comment-16340489
 ] 

Duo Zhang commented on HBASE-19803:
---

The timeout limit from junit does not work? Damn...

Thanks [~appy]. Let's fix these three UTs first.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340485#comment-16340485
 ] 

Duo Zhang commented on HBASE-19803:
---

OK, so the problem is, we have a 15 minutes timeout, if there is a test that 
hangs longer than this period, the surefire plugin will try to kill all the 
ongoing tests and report a failure to us?

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340483#comment-16340483
 ] 

Appy commented on HBASE-19803:
--

I think i have cracked it, it's basically this:
- Some test goes bad (which one? - basically what we were trying to figure out)
- After 900 sec (forkedProcessTimeoutInSeconds), surefire plugin issues 
'shutdown' to *all* forked JVMs (that's what we see *.dump files), basically 
killing every test running at that time.
** Side note: if findHangingTests.py reports X number of hanging tests, and we 
have surefire forkcount as Y, then there should be *at least* ceiling[X/Y] 
count of the following message in each *.dump file.
{noformat}
# Created on 2018-01-25T12:14:47.114
Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook.
{noformat}

With that figured, it was easy to find culprit tests.
Look for timestamp of "Killed self fork..." messages in dump file and find the 
tests which started *exactly 900 sec before it* Any hanging test (as reported 
by our script) with start timestamp between these two times was just caught in 
cross fire.

Applying the method to #207 run 
(https://builds.apache.org/job/HBase%20Nightly/job/master/207/) will reveal 
these three culprit tests:
- security.token.TestTokenAuthentication
- master.balancer.TestStochasticLoadBalancer
- regionserver.TestRegionServerReportForDuty

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-25 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340184#comment-16340184
 ] 

Appy commented on HBASE-19803:
--

Pushing and starting another nightly run.

I have a part which tells that this might not be useful since java always tries 
to core dump on vm crash (at worst, in /tmp dir), and if there were any core 
dumps happening, surefire plugin should have caught them anyways (irrespective 
of location) and generated a *.dumpstream file in surefire-reports.

I see 5 .dump files (not .dumpstream) in test_logs.zip of 
[https://builds.apache.org/job/HBase%20Nightly/job/branch-2/197|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/197.].
 That 
-rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun1.dump
-rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun2.dump
-rw-r--r-- 1 appy staff 331B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun3.dump
-rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun4.dump
-rw-r--r-- 1 appy staff 226B Jan 23 14:43 2018-01-23T20-53-29_364-jvmRun5.dump

What i don't understand is, why does surefire plugin try to stop all 5 jvms at 
exactly 21:55:35 and 22:43:34.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338855#comment-16338855
 ] 

Duo Zhang commented on HBASE-19803:
---

Checked the stacktrace a bit
{noformat}
org.apache.hadoop.hbase.HConstants$ExitException: There is no escape!
at 
org.apache.hadoop.hbase.HConstants$NoExitSecurityManager.checkExit(HConstants.java:63)
at java.lang.Runtime.halt(Runtime.java:273)
at 
org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:300)
at 
org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:294)
at 
org.apache.maven.surefire.booter.ForkedBooter.access$300(ForkedBooter.java:68)
at 
org.apache.maven.surefire.booter.ForkedBooter$4.update(ForkedBooter.java:247)
at 
org.apache.maven.surefire.booter.CommandReader$CommandRunnable.insertToListeners(CommandReader.java:475)
at 
org.apache.maven.surefire.booter.CommandReader$CommandRunnable.run(CommandReader.java:421)
at java.lang.Thread.run(Thread.java:748)
{noformat}

https://github.com/apache/maven-surefire/blob/surefire-2.20.1/surefire-api/src/main/java/org/apache/maven/surefire/booter/CommandReader.java#L421

https://github.com/apache/maven-surefire/blob/surefire-2.20.1/surefire-booter/src/main/java/org/apache/maven/surefire/booter/ForkedBooter.java#L247

I think it is obvious that the UT is killed by the surefire itself. 
CommandReader is a class that keep reading data from stdin to get commands from 
the plugin process. And in the ForkedBooter, we actually do the kill job by 
calling System.exit, see the below code.

{code}
DumpErrorSingleton.getSingleton()
.dumpText( "Killing self fork JVM. Received 
SHUTDOWN command from Maven shutdown hook." );
kill();
{code}

This is not a kill by signal since the command is gotten from stdin. I think 
this is correct direction to go, find out why the plugin process(or called 
maven master process?) wants to kill the forked test process.

Thanks.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338779#comment-16338779
 ] 

stack commented on HBASE-19803:
---

What you thinking then [~appy] The dumpStream is archived IIRC. The mangled 
thread dump in it is useful? Is it for the surefire driver or is it forked JVM? 
I'd be +1 for trying it for a while to see if it turns up anything. Anything 
better than driving blind sir.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338603#comment-16338603
 ] 

Appy commented on HBASE-19803:
--

Arghh, setting -XX:ErrorFile doesn't work with surefire. Apparently coredump 
uses the same output stream as required by maven, this results in following 
issue: 
http://maven.apache.org/components/surefire/maven-failsafe-plugin/faq.html#corruptedstream
The results file is  [^2018-01-24T17-45-37_000-jvmRun1.dumpstream]  (generated 
it by running a test and manually killing it using kill -9)
It's not ideal, but it's better than nothing.
Ptal at  [^HBASE-19803.master.001.patch] . Let's get it in and wait for more 
failures? [~Apache9] [~stack]

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
> Attachments: 2018-01-24T17-45-37_000-jvmRun1.dumpstream, 
> HBASE-19803.master.001.patch
>
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338581#comment-16338581
 ] 

Appy commented on HBASE-19803:
--

So at this point, main thing to demystify is - is it system.exit calls that's 
screwing us up, or some failed native calls crashing jvm.

Either way, setting jvm option {{-XX:ErrorFile}} and archiving that location is 
the way to go because:
- If we don't see that file in artifacts, then it should have been system.exit 
call
- If we do see it, then it's likely some native call and we can use the error 
file for further debug.


> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338521#comment-16338521
 ] 

Appy commented on HBASE-19803:
--

So to iterate faster in trying to understand what/when/how this happens, i 
switched to hbase-http module, added System.exit(0) at the end of 
TestGlobalFilter#testServletFilter.
forkedProcessTimeoutInSeconds=10sec (should be fine since no http test takes 
more than 4 sec on my machine, see list below)
reuseForks=false (that's our default in root pom)

{noformat}
---
 T E S T S
---
Running org.apache.hadoop.hbase.http.conf.TestConfServlet
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.756 s - in 
org.apache.hadoop.hbase.http.conf.TestConfServlet
Running org.apache.hadoop.hbase.http.jmx.TestJMXJsonServlet
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.806 s - in 
org.apache.hadoop.hbase.http.jmx.TestJMXJsonServlet
Running org.apache.hadoop.hbase.http.lib.TestStaticUserWebFilter
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.274 s - in 
org.apache.hadoop.hbase.http.lib.TestStaticUserWebFilter
Running org.apache.hadoop.hbase.http.log.TestLogLevel
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.786 s - in 
org.apache.hadoop.hbase.http.log.TestLogLevel
Running org.apache.hadoop.hbase.http.TestGlobalFilter
Running org.apache.hadoop.hbase.http.TestHtmlQuoting
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.291 s - in 
org.apache.hadoop.hbase.http.TestHtmlQuoting
Running org.apache.hadoop.hbase.http.TestHttpRequestLog
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.415 s - in 
org.apache.hadoop.hbase.http.TestHttpRequestLog
Running org.apache.hadoop.hbase.http.TestHttpRequestLogAppender
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.253 s - in 
org.apache.hadoop.hbase.http.TestHttpRequestLogAppender
Running org.apache.hadoop.hbase.http.TestHttpServer
Tests run: 15, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 3.275 s - in 
org.apache.hadoop.hbase.http.TestHttpServer
Running org.apache.hadoop.hbase.http.TestHttpServerLifecycle
Tests run: 6, Failures: 0, Errors: 0, Skipped: 6, Time elapsed: 0.002 s - in 
org.apache.hadoop.hbase.http.TestHttpServerLifecycle
Running org.apache.hadoop.hbase.http.TestHttpServerWebapps
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.557 s - in 
org.apache.hadoop.hbase.http.TestHttpServerWebapps
Running org.apache.hadoop.hbase.http.TestPathFilter
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.71 s - in 
org.apache.hadoop.hbase.http.TestPathFilter
Running org.apache.hadoop.hbase.http.TestServletFilter
Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.82 s - in 
org.apache.hadoop.hbase.http.TestServletFilter
Running org.apache.hadoop.hbase.http.TestSpnegoHttpServer
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.881 s - in 
org.apache.hadoop.hbase.http.TestSpnegoHttpServer
Running org.apache.hadoop.hbase.http.TestSSLHttpServer
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.33 s - in 
org.apache.hadoop.hbase.http.TestSSLHttpServer

Results:

Tests run: 48, Failures: 0, Errors: 0, Skipped: 9
{noformat}



> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338516#comment-16338516
 ] 

Appy commented on HBASE-19803:
--

So that's because we set   
{{${surefire.timeout}}}
 where surefire.timeout is 900 i.e. 15 min.
I tried setting it to 30 sec, but the tests still keep on running :-/

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338498#comment-16338498
 ] 

Appy commented on HBASE-19803:
--

This is weird, trying to see a crash explicitly on my machine, i added 
{{System.exit(0)}} to TestEntityLocks#setUp and ran {{mvn test -PrunAllTests 
-pl hbase-server}}, but the tests just keep going on for several minutes (at 
which point i stop them myself).
{noformat}
---
 T E S T S
---
Running org.apache.hadoop.hbase.client.locking.TestEntityLocks
Running org.apache.hadoop.hbase.client.TestAllowPartialScanResultCache
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.293 s - in 
org.apache.hadoop.hbase.client.TestAllowPartialScanResultCache
Running org.apache.hadoop.hbase.client.TestBatchScanResultCache
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.285 s - in 
org.apache.hadoop.hbase.client.TestBatchScanResultCache
Running org.apache.hadoop.hbase.client.TestCompleteResultScanResultCache
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.183 s - in 
org.apache.hadoop.hbase.client.TestCompleteResultScanResultCache
Running org.apache.hadoop.hbase.client.TestConnectionUtils
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.547 s - in 
org.apache.hadoop.hbase.client.TestConnectionUtils
Running org.apache.hadoop.hbase.client.TestHBaseAdminNoCluster
Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 8.153 s - in 
org.apache.hadoop.hbase.client.TestHBaseAdminNoCluster
Running org.apache.hadoop.hbase.client.TestIntraRowPagination
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.268 s - in 
org.apache.hadoop.hbase.client.TestIntraRowPagination
Running org.apache.hadoop.hbase.client.TestPutDeleteEtcCellIteration
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.273 s - in 
org.apache.hadoop.hbase.client.TestPutDeleteEtcCellIteration
Running org.apache.hadoop.hbase.client.TestResult
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.187 s - in 
org.apache.hadoop.hbase.client.TestResult
Running org.apache.hadoop.hbase.codec.TestCellMessageCodec
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.654 s - in 
org.apache.hadoop.hbase.codec.TestCellMessageCodec
Running org.apache.hadoop.hbase.conf.TestConfigurationManager
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.467 s - in 
org.apache.hadoop.hbase.conf.TestConfigurationManager
Running org.apache.hadoop.hbase.constraint.TestConstraints
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.963 s - in 
org.apache.hadoop.hbase.constraint.TestConstraints
Running org.apache.hadoop.hbase.coprocessor.TestCoprocessorConfiguration
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.401 s - in 
org.apache.hadoop.hbase.coprocessor.TestCoprocessorConfiguration
Running org.apache.hadoop.hbase.coprocessor.TestCoprocessorHost
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.573 s - in 
org.apache.hadoop.hbase.coprocessor.TestCoprocessorHost
Running org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.89 s - in 
org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface
{noformat}

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338489#comment-16338489
 ] 

Duo Zhang commented on HBASE-19803:
---

I mean the only place where we call System.exit for the crashed UTs is from the 
surefire plugin itself. So it is very strange that why surefire tells us the UT 
is crashed since it is killed by surefire itself.

And for the JNI crash, first as you said, we should have a hs_err file, and 
still, the surefire plugin will not have the chance to call System.exit if the 
JVM is crashed in native code. But the result shows that, it does call 
System.exit before the UT exits.

Thanks.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338478#comment-16338478
 ] 

Appy commented on HBASE-19803:
--

Oh oh, another way to narrow down search space is, run small+medium tests 
separately from large tests.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-24 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338438#comment-16338438
 ] 

Appy commented on HBASE-19803:
--

bq. Notice here I only log the exception without throwing it out if it is 
called from the surefire plugin. So it is killed by the surefire plugin? And 
then surefire plugin tells us the VM exited abnormally...
Didn't understand what you meant.

Another possible cause for this one, from reading around, is, failed native 
call. That can happen for so many reasons - over memory (but i think that one 
shows up explicitly as OOM), apache machine killing jvm for overuse of 
resources? (wild guess), etc.
I think we really need hs_err files to help debug this one.


> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-17 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329832#comment-16329832
 ] 

Duo Zhang commented on HBASE-19803:
---

It seems a surefire issue.

I run mvn test locally in hbase-server module, and TestJMXConnectorServer 
fails, this is a known issue, and then comes lots of crashes.

This is one of the failed UT
{noformat}
Error occurred in starting fork, check output in log
Process Exit Code: 1
Crashed tests:
org.apache.hadoop.hbase.master.balancer.TestRegionsOnMasterOptions
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:496)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:443)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:295)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1124)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:954)
... 23 more
Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The 
forked VM terminated without properly saying goodbye. VM crash or System.exit 
called?
Command was /bin/sh -c cd /home/zhangduo/hbase/code/hbase-server && 
/home/zhangduo/opt/jdk1.8.0_151/jre/bin/java -enableassertions 
-Dhbase.build.id=2018-01-17T22:44:23Z -Xmx2800m 
-Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true 
-Djava.awt.headless=true -jar 
/home/zhangduo/hbase/code/hbase-server/target/surefire/surefirebooter3125641250160453662.jar
 /home/zhangduo/hbase/code/hbase-server/target/surefire 
2018-01-18T06-44-36_642-jvmRun2 surefire7506668156192398602tmp 
surefire_14263036952065448117423tmp
{noformat}

And I checked 
org.apache.hadoop.hbase.master.balancer.TestRegionsOnMasterOptions-output.txt, 
the only place where we call System.exit is
{noformat}
org.apache.hadoop.hbase.HConstants$ExitException: There is no escape!
at 
org.apache.hadoop.hbase.HConstants$NoExitSecurityManager.checkExit(HConstants.java:63)
at java.lang.Runtime.halt(Runtime.java:273)
at 
org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:300)
at 
org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:294)
at 
org.apache.maven.surefire.booter.ForkedBooter.access$300(ForkedBooter.java:68)
at 
org.apache.maven.surefire.booter.ForkedBooter$4.update(ForkedBooter.java:247)
at 
org.apache.maven.surefire.booter.CommandReader$CommandRunnable.insertToListeners(CommandReader.java:475)
at 
org.apache.maven.surefire.booter.CommandReader$CommandRunnable.run(CommandReader.java:421)
at java.lang.Thread.run(Thread.java:748)
{noformat}

Notice here I only log the exception without throwing it out if it is called 
from the surefire plugin. So it is killed by the surefire plugin? And then 
surefire plugin tells us the VM exited abnormally...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-17 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329638#comment-16329638
 ] 

Duo Zhang commented on HBASE-19803:
---

If I do not throw ExitException for ForkedBooter.kill then the test run will 
crash... Let me add more logs and try again.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-17 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329600#comment-16329600
 ] 

Duo Zhang commented on HBASE-19803:
---

This is only a temporary approach to find out who calls System.exit in test, 
and then we could find the solution.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-17 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329593#comment-16329593
 ] 

Appy commented on HBASE-19803:
--

I was thinking of going the way where we would replace all System.exit(X) calls 
with a util function which would additionally dump stack trace at LOG.debug 
(testing level) before calling System.exit itself.

Adding SecurityManager globally for everything seems like a strong and major 
change to jvm environment (even if just for tests).

 

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-17 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328707#comment-16328707
 ] 

Duo Zhang commented on HBASE-19803:
---

I've registered a SecurityManager in HConstants. Need some hack to let the 
surefire framework can still call System.exit.

First is that surefire will use System.exit to exit even for a successful test.
Second is that ForkedBooter.kill will use System.exit, I believe it is used to 
kill timeout test.

I've already hack for these two cases and started the third try.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328249#comment-16328249
 ] 

Duo Zhang commented on HBASE-19803:
---

https://stackoverflow.com/questions/5401281/preventing-system-exit-from-api

I think we could try this? Disable System.exit when running UTs, and we can 
output something when System.exit is called so we can know who is the criminal.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327858#comment-16327858
 ] 

Appy commented on HBASE-19803:
--

yeah probably, i don't see any unit test calling it (although ITs do). In the 
non -test code, it's mostly main() fn in tools.
But digging around more:
- TestZKMainServer seems to be [handling System.exit() 
appropriately|https://github.com/apache/hbase/blob/master/hbase-zookeeper/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKMainServer.java#L85].
 So do all other tests.
Here's what method: 
https://stackoverflow.com/questions/309396/java-how-to-test-methods-that-call-system-exit
- The only case where it might be wrong is, ImportTsv#createSubmittableJob 
calling System.exit(). TestImportTsv calls that fn multiple times.


> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327731#comment-16327731
 ] 

stack commented on HBASE-19803:
---

I don't think any of our tests call System#exit. Would be happy if was proven 
wrong...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327681#comment-16327681
 ] 

Appy commented on HBASE-19803:
--

Oh, this looks promising: 
http://maven.apache.org/surefire/maven-surefire-plugin/faq.html#vm-termination
And we have quite a few System.exit in our code.

Since the failure is in hbase-server tests, just looking for System.exit calls 
in that module and those on which it depends. Also, ignoring the calls from 
main() fns of tools. Here's list of possible culprits:
- HMaster#InitializationMonitor#run()
...there may be others, but not obvious at first look.


> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327544#comment-16327544
 ] 

Appy commented on HBASE-19803:
--

But the [console 
output|https://builds.apache.org/job/HBASE-Flaky-Tests/24830/consoleFull] 
doesn't say anything about out-of-memory, which i have seen in some cases in 
the past. So maybe it's not the issue.

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327526#comment-16327526
 ] 

Appy commented on HBASE-19803:
--

bq. It seems that the jvm may crash during the mvn test run and then we will 
kill all the running tests and then we may mark some of them as hang which 
leads to the false positive.
Makes sense.

This one suggests that it can be memory issue : 
https://stackoverflow.com/questions/42298883/maven-build-failure-when-running-tests-due-to-jvm-crash
Looking at old nighly job 
(https://builds.apache.org/job/HBase-Trunk_matrix/configure), it was using 
-Xmx6100M. But the new jobs seem to be using just 3g 
(https://github.com/apache/hbase/blob/master/dev-support/docker/Dockerfile#L40)

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19803) False positive for the HBASE-Find-Flaky-Tests job

2018-01-16 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327126#comment-16327126
 ] 

Duo Zhang commented on HBASE-19803:
---

[~appy] FYI.

Maybe the first thing is that we need to figure out why it is so easy to crash 
when executing mvn test... The nightly job always fail in this way...

> False positive for the HBASE-Find-Flaky-Tests job
> -
>
> Key: HBASE-19803
> URL: https://issues.apache.org/jira/browse/HBASE-19803
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> It reports two hangs for TestAsyncTableGetMultiThreaded, but I checked the 
> surefire output
> https://builds.apache.org/job/HBASE-Flaky-Tests/24830/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was likely to be killed in the middle of the run within 20 seconds.
> https://builds.apache.org/job/HBASE-Flaky-Tests/24852/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt
> This one was also killed within about 1 minutes.
> The test is declared as LargeTests so the time limit should be 10 minutes. It 
> seems that the jvm may crash during the mvn test run and then we will kill 
> all the running tests and then we may mark some of them as hang which leads 
> to the false positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)