[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371035#comment-16371035
 ] 

Duo Zhang commented on HBASE-19554:
---

Oh, maybe the same problem with numEntries. We should also reset 
sequenceIdAccounting before processing new requests. Let me file an issue to 
address it.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371024#comment-16371024
 ] 

Duo Zhang commented on HBASE-19554:
---

OK the numEntries is incorrect, will file an issue to address it, not critical, 
just logging.

And I think this is the problem
{noformat}
2018-02-20 16:33:41,183 INFO  [regionserver/asf903:0.logRoller] 
wal.AbstractFSWAL(690): Rolled WAL 
/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
 with entries=0, filesize=35.46 KB; new WAL 
/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421171
2018-02-20 16:33:41,184 DEBUG [regionserver/asf903:0.logRoller] 
wal.AbstractFSWAL(757): Create new FSHLog writer with pipeline: 
[DatanodeInfoWithStorage[127.0.0.1:60898,DS-fe8309dd-a316-4ef5-a474-7167557c5c76,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:53328,DS-0f9fbb3d-3cd1-47ea-9867-52f1ecac6f19,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:57523,DS-fc2097ed-4d98-40d7-a4a0-adfd31a5bf68,DISK]]
2018-02-20 16:33:41,184 INFO  [regionserver/asf903:0.logRoller] 
wal.AbstractFSWAL(652): Archiving 
hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
 to 
hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/oldWALs/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
{noformat}

Here we move the wal file to oldWALs soon after rolling. We write to WAL 
directly in the test, so this should not happen. Let me dig more.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370998#comment-16370998
 ] 

Duo Zhang commented on HBASE-19554:
---

Oh they are not same... But still a bit strange that the file sizes are same 
but the entries numbers are different...

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370818#comment-16370818
 ] 

Duo Zhang commented on HBASE-19554:
---

A bit strange, we roll wal with the same name twice...

{noformat}
2018-02-20 16:33:41,128 INFO  [regionserver/asf903:0.logRoller] 
wal.AbstractFSWAL(690): Rolled WAL 
/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421057
 with entries=614, filesize=124.20 KB; new WAL 
/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109
2018-02-20 16:33:41,164 INFO  [regionserver/asf903:0.logRoller] 
wal.AbstractFSWAL(690): Rolled WAL 
/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109
 with entries=789, filesize=124.20 KB; new WAL 
/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
{noformat}

Will dig more.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370774#comment-16370774
 ] 

Duo Zhang commented on HBASE-19554:
---

https://builds.apache.org/job/HBASE-Flaky-Tests-branch2.0/2393/testReport/junit/org.apache.hadoop.hbase.master/TestDLSFSHLog/testRecoveredEdits/

{quote}
java.lang.AssertionError: expected:<1> but was:<9825>
{quote}

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370491#comment-16370491
 ] 

stack commented on HBASE-19554:
---

bq. But it seems not, it keeps retrying on the closed connection... Let me 
check what's wrong here...

Yes. I thought this addressed because we explicitly 'close' the Master 
Connection on shutdown as of about a week or so ago. But looks like you figured 
that we need more than that -- the check Connection is closed. Good one.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369800#comment-16369800
 ] 

Duo Zhang commented on HBASE-19554:
---

I think this is a problem which we have already addressed? We have already 
closed the ConnectionImplementation from outside, and then the initialize of 
HMaster should quit immediately. But it seems not, it keeps retrying on the 
closed connection... Let me check what's wrong here...

{noformat}
2018-02-20 02:31:12,578 DEBUG [M:0;asf911:38379] 
client.RpcRetryingCallerImpl(132): Call exception, tries=11, retries=11, 
started=48515 ms ago, cancelled=false, msg=hconnection-0x195e9495 closed, 
details=row 'default' on table 'hbase:namespace' at 
region=hbase:namespace,,1519093804365.3ff104b43ef4ce0a491e1f26b598813e., 
hostname=asf911.gq1.ygridcore.net,51776,1519093799312, seqNum=2, 
exception=java.io.IOException: hconnection-0x195e9495 closed
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1959)
at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131)
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573)
at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131)
at 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
at 
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
at 
org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
at 
org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
at 
org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62)
at 
org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
at 
org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1053)
at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:919)
at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2017)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:559)
at java.lang.Thread.run(Thread.java:748)
{noformat}


> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369797#comment-16369797
 ] 

Duo Zhang commented on HBASE-19554:
---

OK the problem is testMasterStartsUpWithLogSplittingWork, it spends nearly 13 
minutes to finish

{noformat}
2018-02-20 02:29:56,586 INFO  [Time-limited test] hbase.ResourceChecker(148): 
before: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork 
Thread=133, OpenFileDescriptor=893, MaxFileDescriptor=6, 
SystemLoadAverage=819, ProcessCount=273, AvailableMemoryMB=3881

2018-02-20 02:42:38,021 INFO  [Time-limited test] hbase.ResourceChecker(172): 
after: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork 
Thread=233 (was 133)
{noformat}

Let me dig more.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-20 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369795#comment-16369795
 ] 

Duo Zhang commented on HBASE-19554:
---

Different problem

https://builds.apache.org/job/HBASE-Flaky-Tests/26171/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL-output.txt/*view*/

Now we time out when running testRecoveredEdits.

{noformat}
"Time-limited test" daemon prio=5 tid=23 timed_waiting
java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1378)
at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:677)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.close(RecoverableZooKeeper.java:711)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.close(ZKWatcher.java:596)
at 
org.apache.hadoop.hbase.master.AbstractTestDLS.testRecoveredEdits(AbstractTestDLS.java:251)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{noformat}

Let me dig more.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360743#comment-16360743
 ] 

Hudson commented on HBASE-19554:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4571 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/4571/])
HBASE-19554 Dump thread info before failure (zhangduo: rev 
8ff783f76d370199b81a24c48feaad1aa980285e)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java


> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360455#comment-16360455
 ] 

Duo Zhang commented on HBASE-19554:
---

After revisiting the log, I think I found a possible dead lock in recovery.

It is very simple, we only have 16 threads to run the procedures. During a 
failover, one region will lead to one AssignProcedure, and if meta region is 
also offline, then every AssignProcedure will be stuck at 
AssignProcedure.startTransition where we need to read something from meta 
region. So if the RecoverMetaProcedure can not be picked up by one of the 
worker thread before all of them are stuck at AssignProcedure.startTransition, 
then dead lock...

So I think we should have a special worker thread which can only execute meta 
related operations. And if a TableProcedureInterface is for meta table, then we 
should insert it into a special queue and let the special worker to run it 
immediately.

Let me open a new issue to address it.

Thanks.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360411#comment-16360411
 ] 

Duo Zhang commented on HBASE-19554:
---

What I pushed to master and branch-2.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-11 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360401#comment-16360401
 ] 

Duo Zhang commented on HBASE-19554:
---

https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/

{noformat}
2018-02-12 04:56:20,895 DEBUG [PEWorker-16] 
procedure.ServerCrashProcedure(192): pid=132, 
state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure 
server=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true, meta=true; 
Processing hbase:meta that was on asf911.gq1.ygridcore.net,41715,1518411358686
2018-02-12 04:56:20,895 INFO  [PEWorker-16] procedure2.ProcedureExecutor(1498): 
Initialized subprocedures=[{pid=135, ppid=132, 
state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure 
failedMetaServer=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true}]
{noformat}

Then there is no progress, so at last we time out. Let me add a thread dump 
when we are about to time out to see if we can find something.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-09 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358324#comment-16358324
 ] 

Duo Zhang commented on HBASE-19554:
---

I guess the problem is related to meta reassign. [~zghaobac] has already found 
something strange for AMv2 in HBASE-19965, so wait for his digging result.

And I found something strange in the output

{noformat}
java.lang.AssertionError
at 
org.apache.hadoop.hbase.wal.WALKeyImpl.getWriteEntry(WALKeyImpl.java:82)
at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:159)
at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:132)
at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97)
at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1103)
at 
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1615)
at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
at 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

The assertion is
{code}
  public MultiVersionConcurrencyControl.WriteEntry getWriteEntry() throws 
InterruptedIOException {
assert this.writeEntry != null;
return this.writeEntry;
  }
{code}

I think the problem is introduced by HBASE-19929. Since we have closed WAL 
directly, then it is possible that we fail a WAL.append without assigning a 
mvcc writeEntry. Let me file a issue to fix this.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352034#comment-16352034
 ] 

Hudson commented on HBASE-19554:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/4528/])
HBASE-19554 reenable TestDLSAsyncFSWAL/TestDLSFSHLog for debugging (zhangduo: 
rev 88d6e06a1fe4130b96e0e9f0e2c2b189f0b6affd)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSFSHLog.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSAsyncFSWAL.java


> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-02-04 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351957#comment-16351957
 ] 

Duo Zhang commented on HBASE-19554:
---

Reenable for debugging.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Sub-task
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2018-01-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324612#comment-16324612
 ] 

stack commented on HBASE-19554:
---

I disabled TestDLSAsyncFSWAL and TestDLSFSHLog so I can cut the beta-1. They 
fail 40% and 33% respectively on branch-2. Disabled them in master too... since 
once fixed, we can add it in both places. Will be back to help out. Just 
disabling so can push out the beta. Thanks.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Bug
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-24 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302983#comment-16302983
 ] 

Duo Zhang commented on HBASE-19554:
---

Still failed in countRows. There may be a problem with the assignments?

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Bug
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-24 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302982#comment-16302982
 ] 

Duo Zhang commented on HBASE-19554:
---

https://builds.apache.org/job/PreCommit-HBASE-Build/10676/artifact/patchprocess/patch-unit-hbase-server.txt

{noformat}
[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 406.921 
s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL
[ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL)  
Time elapsed: 301.077 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 30 
milliseconds

[ERROR] Errors: 
[ERROR]   
TestDLSAsyncFSWAL>AbstractTestDLS.testThreeRSAbort:401->Object.wait:460->Object.wait:-2
 » TestTimedOut
{noformat}

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Bug
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-21 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300816#comment-16300816
 ] 

Duo Zhang commented on HBASE-19554:
---

Checked recent pre commit building, seems much better. And this a failure case

https://builds.apache.org/job/PreCommit-HBASE-Build/10609/artifact/patchprocess/patch-unit-hbase-server.txt

The error message
{quote}
[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.505 
s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL
[ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL)  
Time elapsed: 49.462 s  <<< ERROR!
java.lang.RuntimeException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=11, exceptions:
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.io.IOException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.io.IOException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.io.IOException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.io.IOException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 604d085d7ec5/172.17.0.2:57745

Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: 
Failed after attempts=11, exceptions:
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, 
pause=100, maxAttempts=11}, java.net.ConnectException: Call to 
604d085d7ec5/172.17.0.2:57745 failed on connection exception: 
org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection r

[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299306#comment-16299306
 ] 

Hudson commented on HBASE-19554:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4262 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/4262/])
HBASE-19554 For debug: Modify HTU.waitUntilAllRegionsAssigned to handle (stack: 
rev 084e324fb9f769f221236e31aa97382d2677f911)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java


> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Bug
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-20 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298870#comment-16298870
 ] 

stack commented on HBASE-19554:
---

I pushed these for you so you have debug when you get up [~Apache9]

The failure was TestLogRolling... but there are timeouts too. Random. We seem 
to be inching into fun test timeout territory again. Will have to dig in...

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Bug
>  Components: Recovery, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297018#comment-16297018
 ] 

Hadoop QA commented on HBASE-19554:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
38s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
49s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m  
5s{color} | {color:red} hbase-server: The patch generated 2 new + 269 unchanged 
- 2 fixed = 271 total (was 271) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
39s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
19m 20s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 24s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}146m 18s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.regionserver.wal.TestLogRolling |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-19554 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12902859/HBASE-19554.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 091eb3fa7b75 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 
15:49:21 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 03e79b7994 |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/diff-checkstyle-hbase-server.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10557/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console o

[jira] [Commented] (HBASE-19554) AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

2017-12-19 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296765#comment-16296765
 ] 

Duo Zhang commented on HBASE-19554:
---

The message means that we do not have the records of the given table in meta 
table. And I found that the HTU.waitUntilAllRegionsAssigned will return in this 
case. Let me modify HTU.waitUntilAllRegionsAssigned first to not return in this 
case.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --
>
> Key: HBASE-19554
> URL: https://issues.apache.org/jira/browse/HBASE-19554
> Project: HBase
>  Issue Type: Bug
>  Components: Recovery, wal
>Reporter: Duo Zhang
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)