[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516365#comment-14516365 ] Nick Dimiduk commented on HBASE-12743: -- This going to bite everyone on 1.1.0? > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 2.0.0, 1.1.0, 1.0.2 > > Attachments: 12743.hack.txt > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357969#comment-14357969 ] Gustavo Anatoly commented on HBASE-12743: - Hi, [~stack] Sorry for long delay to reply and thanks for share how to reproduce the fail. Well, I wasn't able to reproduce the exactly exception, but I'm hunting the root causes, using ITBLL. About my first idea, I was thinking to use mock to reproduce the error but I going to follow with ITBLL. > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 2.0.0, 1.0.1, 1.1.0 > > Attachments: 12743.hack.txt > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291363#comment-14291363 ] Enis Soztutar commented on HBASE-12743: --- Moving to 1.0.1 for now. We can bring this back, if we can find the root cause. > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 2.0.0, 1.0.1, 1.1.0 > > Attachments: 12743.hack.txt > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281147#comment-14281147 ] stack commented on HBASE-12743: --- bq. I'm trying to reproduce it from TableNamespaceManager.isTableAvailableAndInitialized(). Suggestions? [~gustavoanatoly] Are you trying to reproduce the failure when DLR is running? If so ITBLL + DLR + chaos monkey at a bit of scale on a cluster of 4/5 nodes seems to turn it up pretty easily. But maybe you are on the particular exception posted? [~jeffreyz] I will. My little cluster is currently occupied working on another issue. Will be back to help on DLR after done with current prob. > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 1.0.0, 2.0.0, 1.1.0 > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280997#comment-14280997 ] Jeffrey Zhong commented on HBASE-12743: --- For the error "org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not online", master won't start. But it should not unrelated to log recovery either splitting/replay. [~saint@gmail.com] could you share more master logs so that I can check why hbase:namespace wasn't online & assigned for two hours? Thanks. > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 1.0.0, 2.0.0, 1.1.0 > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280318#comment-14280318 ] Gustavo Anatoly commented on HBASE-12743: - I'm trying to reproduce it from {{TableNamespaceManager.isTableAvailableAndInitialized()}}. Suggestions? > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 1.0.0, 2.0.0, 1.1.0 > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12743) [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true
[ https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279394#comment-14279394 ] Enis Soztutar commented on HBASE-12743: --- [~jeffreyz] do you have an idea about this? > [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log > replay=true > > > Key: HBASE-12743 > URL: https://issues.apache.org/jira/browse/HBASE-12743 > Project: HBase > Issue Type: Bug >Reporter: stack > Fix For: 1.0.0, 2.0.0, 1.1.0 > > > Master is stuck for two days trying to rejoin cluster after monkey killed and > restarted it. > After retrying to get namespace 350 times, Master goes down: > {code} > 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager] > client.RpcRetryingCaller: Call exception, tries=349, retries=350, > started=6885331 ms ago, cancelled=false, msg=row 'default' on table > 'hbase:namespace' at > region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., > hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=600190 > 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager] > master.TableNamespaceManager: Caught exception in initializing namespace > table manager > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=350, exceptions: > Sat Dec 20 16:49:08 PST 2014, > RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not > online on c2023.halxg.cloudera.com,16020,1418988286696 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434) > {code} > Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] > wal.WALSplitter: DistributedLogReplay = true > Seems easy enough to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)