[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-04-27 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516365#comment-14516365
 ] 

Nick Dimiduk commented on HBASE-12743:
--

This going to bite everyone on 1.1.0?

> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 2.0.0, 1.1.0, 1.0.2
>
> Attachments: 12743.hack.txt
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-03-11 Thread Gustavo Anatoly (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357969#comment-14357969
 ] 

Gustavo Anatoly commented on HBASE-12743:
-

Hi, [~stack]

Sorry for long delay to reply and thanks for share how to reproduce the fail. 
Well, I wasn't able to reproduce the exactly exception, but I'm hunting the 
root causes, using ITBLL.

About my first idea, I was thinking to use mock to reproduce the error but I 
going to follow with ITBLL.



> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 2.0.0, 1.0.1, 1.1.0
>
> Attachments: 12743.hack.txt
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-01-25 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291363#comment-14291363
 ] 

Enis Soztutar commented on HBASE-12743:
---

Moving to 1.0.1 for now. We can bring this back, if we can find the root cause.

> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 2.0.0, 1.0.1, 1.1.0
>
> Attachments: 12743.hack.txt
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-01-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281147#comment-14281147
 ] 

stack commented on HBASE-12743:
---

bq. I'm trying to reproduce it from 
TableNamespaceManager.isTableAvailableAndInitialized(). Suggestions?

[~gustavoanatoly] Are you trying to reproduce the failure when DLR is running?  
If so ITBLL + DLR + chaos monkey at a bit of scale on a cluster of 4/5 nodes 
seems to turn it up pretty easily.  But maybe you are on the particular 
exception posted?

[~jeffreyz] I will. My little cluster is currently occupied working on another 
issue.  Will be back to help on DLR after done with current prob.

> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-01-16 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280997#comment-14280997
 ] 

Jeffrey Zhong commented on HBASE-12743:
---

For the error "org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
online", master won't start. But it should not unrelated to log recovery either 
splitting/replay. 

[~saint@gmail.com] could you share more master logs so that I can check why 
hbase:namespace wasn't online & assigned for two hours? Thanks.

> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-01-16 Thread Gustavo Anatoly (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280318#comment-14280318
 ] 

Gustavo Anatoly commented on HBASE-12743:
-

I'm trying to reproduce it from 
{{TableNamespaceManager.isTableAvailableAndInitialized()}}. Suggestions?

> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

2015-01-15 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279394#comment-14279394
 ] 

Enis Soztutar commented on HBASE-12743:
---

[~jeffreyz] do you have an idea about this?

> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log 
> replay=true
> 
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and 
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] 
> client.RpcRetryingCaller: Call exception, tries=349, retries=350, 
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table 
> 'hbase:namespace' at 
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., 
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190
> 2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] 
> master.TableNamespaceManager: Caught exception in initializing namespace 
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014, 
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not 
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO  [RS_LOG_REPLAY_OPS-c2021:16020-0] 
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)