[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-11 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20330:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to branch-2.0+. Thanks [~uagashe] and [~appy]

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch, 
> hbase-20330.master.002.patch, hbase-20330.master.003.patch, 
> hbase-20330.master.004.patch, hbase-20330.master.005.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-07 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Attachment: hbase-20330.master.005.patch

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch, 
> hbase-20330.master.002.patch, hbase-20330.master.003.patch, 
> hbase-20330.master.004.patch, hbase-20330.master.005.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-06 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Attachment: hbase-20330.master.004.patch

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch, 
> hbase-20330.master.002.patch, hbase-20330.master.003.patch, 
> hbase-20330.master.004.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-06 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Attachment: hbase-20330.master.003.patch

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch, 
> hbase-20330.master.002.patch, hbase-20330.master.003.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-06 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Attachment: hbase-20330.master.002.patch

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch, 
> hbase-20330.master.002.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-04 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Status: Patch Available  (was: In Progress)

rollWriter() fails after creating the file and returns false. In next iteration 
of while loop in recoverLease() file list is refreshed.

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-04 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Attachment: hbase-20330.master.001.patch

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---
>
> Key: HBASE-20330
> URL: https://issues.apache.org/jira/browse/HBASE-20330
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0-beta-2
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: hbase-20330.master.001.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
> already exists
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

2018-04-04 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-20330:
-
Description: 
We have instance in our internal testing where master log is getting filled 
with following messages:
{code}
2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover 
lease on dfs file hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
Recovered lease, attempt=0 on 
file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
2018-04-02 17:11:17,574 WARN 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to read 
tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log - 
Invalid Trailer version. got 111 expected 1
2018-04-02 17:11:17,576 ERROR 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
id=19 already exists
org.apache.hadoop.fs.FileAlreadyExistsException: 
/hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
already exists
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
{code}

Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
when WALProcedureStore#rollWriter() fails and returns false for some reason, it 
keeps looping continuously.


  was:
We have instance in our internal testing where master log is getting filled 
with following messages:
{code}
2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover 
lease on dfs file hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log
2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
Recovered lease, attempt=0 on 
file=hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log after 1ms
2018-04-02 17:11:17,574 WARN 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to read 
tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-0018.log - 
Invalid Trailer version. got 111 expected 1
2018-04-02 17:11:17,576 ERROR 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
id=19 already exists
org.apache.hadoop.fs.FileAlreadyExistsException: 
/hbase/MasterProcWALs/pv2-0019.log for client 10.17.202.11 
already exists
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
{code}

Debugging it f