bq. because 159 region(s) in transition This case seems to be similar to the one I saw where user table region assignment blocked system table region assignment.
Can you take a look at the user regions which got stuck in transition ? One or more of them might continuously fail to open. You should get some clue by checking region server log(s). Cheers On Wed, Nov 18, 2015 at 8:59 PM, Sumit Nigam <[email protected]> wrote: > Hello Ted, > > I could finally replicate one of the issues below : > > 1. *Wed Nov 18* 02:27:36 EST 2015, > org.apache.hadoop.hbase.client.RpcRetryingCaller@1a8bbdc9, > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): > java.io.IOException: org.apache.hadoop.hbase.master.*TableNamespaceManager > isn't ready to serve* > at > org.apache.hadoop.hbase.master.TableNamespaceManager.getNamespaceTable(TableNamespaceManager.java:112) > at > org.apache.hadoop.hbase.master.TableNamespaceManager.list(TableNamespaceManager.java:211) > at > org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3473) > at > org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3367) > at > org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:43312) > > > > At the same time, HMaster logs show following line: > > *2015-11-17* 22:27:21,607* WARN* [master:ip-172-31-23-41:48470] > master.TableNamespaceManager: *Timedout* waiting for namespace table to > be assigned. > 2015-11-17 22:27:21,607 INFO [master:ip-172-31-23-41:48470] > master.HMaster: *Master has completed* *initialization* > 2015-11-17 22:31:21,616 DEBUG > [ip-172-31-23-41.us-west-2.compute.internal,48470,1447827964772-BalancerChore] > master.HMaster: Not running balancer because 159 region(s) in transition: > {d93af1e3d8d460cf2ac980ad60ce3f3d={d93af1e3d8d460cf2ac980ad60ce3f3d > state=PENDING_OPEN, ts=1447827986817, > server=ip-172-31-23-41.us-west-2.compute.internal,37544,1447827973069}, > 83fc50ab0413f4a0e7f71e072ccaa6f5={83fc50ab0413f4a0e7f71e072ccaa6f5 > state=PE... > 2015-11-17 22:36:21,616 DEBUG > [ip-172-31-23-41.us-west-2.compute.internal,48470,1447827964772-BalancerChore] > master.HMaster: *Not running balancer because 159 region(s) in transition*: > {d93af1e3d8d460cf2ac980ad60ce3f3d={d93af1e3d8d460cf2ac980ad60ce3f3d > state=PENDING_OPEN, ts=1447827986817, > server=ip-172-31-23-41.us-west-2.compute.internal,37544,1447827973069}, > 83fc50ab0413f4a0e7f71e072ccaa6f5={83fc50ab0413f4a0e7f71e072ccaa6f5 > state=PE... > > > Not sure, what makes it time out. I looked at that code and it seems it > tries to load all the regions for a given table but times out. Not sure if > it points to zookeeper or hdfs problem or some other. > > Would this give any clues? > > One more thing of interest is that the Hbase client (which shows up the > error) and HMaster machines in this particular case are not time-synced. I > notice a day's gap but I assume that NTP time-sync is only a requirement > for Hbase master/ region servers and not also for their clients. > > Thanks, > Sumit > > ------------------------------ > *From:* Ted Yu <[email protected]> > *To:* Sumit Nigam <[email protected]> > *Cc:* "[email protected]" <[email protected]> > *Sent:* Sunday, November 15, 2015 9:14 PM > *Subject:* Re: About exceptions > > bq. if we increase #retries from our end, is there a chance that it may > get past the issue? > > Most likely the chance of getting past the issue would be low without > manually fixing the condition. > > For #2, it is a mystery because 0.98 master does not have Procedure V2 in > Apache. What distro are you using ? > > For #3, unclean shutdown could be one of the causes. To make further > assessment, log snippet from master concerning the table is desirable. > > Cheers > > > > On Sun, Nov 15, 2015 at 2:25 AM, Sumit Nigam <[email protected]> wrote: > > Thank you Ted. > > I was unaware of both those issues. The issue with these exceptions is > that they are intermittent and do not replicate easily. So, let me see if I > can replicate it with trace enabled. For #1, should retrying be attempted? > Or possibly, if we increase #retries from our end, is there a chance that > it may get past the issue? I like the idea of master having a WAL ( > HBASE-14190) to find/ fix such inconsistencies. > > #2 That trace showed up in a hbase client. > > #3 unclean shutdown is possibly one case? I do not explicitly enable/ > disable tables. So, I assume those reasons may be related to Hbase code? > And any advise on if I can somehow avoid it in first place? > > Thanks, > Sumit > > ------------------------------ > *From:* Ted Yu <[email protected]> > *To:* Sumit Nigam <[email protected]> > *Cc:* "[email protected]" <[email protected]> > *Sent:* Sunday, November 15, 2015 3:34 PM > *Subject:* Re: About exceptions > > Sumit: > For #1, I have seen a similar issue (HBASE-14190, though on hbase 1.x > release). > If you have debug logging enabled, please pastebin relevant master log > snippet so that we can take a closer look. > > For #2, I am bit confused - I didn't find CreateTableProcedure.java in > 0.98 branch. To my knowledge, CreateTableProcedure is only in hbase 1 > release. > Did you see the stack trace in master log ? > > For #3, there could be various reasons a table was not enabled. > You can trace the table assignment in master log, check log from > hbase:meta server to see if you can find some clue. > > bq. Hbase fails only after it exhausts its attempts so retrying may not > be helpful? > > Your understanding should be correct. > > I want to bring your attention to HBASE-12070 which helps you fix ZK > inconsistencies. > > Cheers > > > > On Sun, Nov 15, 2015 at 12:29 AM, Sumit Nigam <[email protected]> > wrote: > > Hi Ted, > > Thanks for your reply. I am using Hbase 0.98.14. I have used hbck, but for > some (unknown) reason it has not always resolved inconsistencies. > > I have been able to get around these issues so far by deleting ZK entries > for the offending table and restarting Hbase. But I am not sure what causes > them in the first place and if I can avoid those issues through code or > not. Also, upon getting these exceptions is it a good idea to retry the > operation. I think Hbase fails only after it exhausts its attempts so > retrying may not be helpful? > > > Here are 3 logs snippets: > > 1. TableNamespaceManager isn't ready to serve: > > Fri Nov 13 17:47:19 IST 2015, > org.apache.hadoop.hbase.client.RpcRetryingCaller@44726f67, > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): > java.io.*IOException*: org.apache.hadoop.hbase.master.*TableNamespaceManager > isn't ready to serve* > at > org.apache.hadoop.hbase.master.TableNamespaceManager.getNamespaceTable(TableNamespaceManager.java:112) > at > org.apache.hadoop.hbase.master.TableNamespaceManager.list(TableNamespaceManager.java:211) > at > org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3473) > at > org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3367) > > > > 2. TableExistsException: > > Caused by: org.apache.hadoop.hbase.TableExistsException: > org.apache.hadoop.hbase.*TableExistsException: ldmns:exDocStore* > at > org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.prepareCreate(CreateTableProcedure.java:300) > at > org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:106) > at > org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:58) > ... > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:90) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3403) > at > org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:632) > at org.apache.hadoop.hbase.client.HBaseAdmin.*createTable* > (HBaseAdmin.java:523) > > > 3. TableNotEnabledException: > > Caused by: org.apache.hadoop.hbase.*TableNotEnabledException*: > ldmns:DataDomain_stage is disabled. > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1139) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:963) > at > org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:74) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:833) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:810) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:842) > at > com.thinkaurelius.titan.diskstorage.hbase.HBaseKeyColumnValueStore.getHelper(HBaseKeyColumnValueStore.java:155) > > ------------------------------ > *From:* Ted Yu <[email protected]> > *To:* "[email protected]" <[email protected]>; Sumit Nigam < > [email protected]> > *Sent:* Sunday, November 15, 2015 10:50 AM > *Subject:* Re: About exceptions > > bq. TableNotEnabledExceptionTableNotFoundExceptionIOException > > Can you show log snippets where these exceptions occurred ? > Which release of hbase are you using ? > > Have you run hbck to repair the inconsistencies ? > > See http://hbase.apache.org/book.html#hbck.in.depth > > Cheers > > > > On Sat, Nov 14, 2015 at 8:42 PM, Sumit Nigam <[email protected] > > wrote: > > Hi, > There are some exceptions which I face intermittently with Hbase and I > thought some help from experts online can really help me. These are: > TableNotEnabledExceptionTableNotFoundExceptionIOException - > TableNamespaceManager isn't ready to serve > > One of the reasons I can see for this seems to be zookeeper and Hbase/ > Hdfs data being out of sync due to an unclean shutdown. > So, my questions are these: > 1. Are these exceptions only related to unclean shutdowns?2. Do I need to > explicitly handle them and retry the operation again because they also seem > to indicate that it is some race condition between trying to access a table > vs Hbase enabling them? > Any help is greatly appreciated. > Thanks,Sumit > > > > > > > > > > >
