Hey Ted,
I finally got to look into master and region server logs. I think what has
happened is that some particular regions have become orphaned due to a region
server going down. And those are possibly stuck? Also, what can be done about
stuck regions in general?
Master shows a region server registering with it (port 37544)
2015-11-17 22:26:14,105 INFO [FifoRpcScheduler.handler1-thread-2]
master.ServerManager: Registering
server=ip-172-31-23-41.us-west-2.compute.internal,37544,1447827973069
Also master shows this a little later (that some region server which was up at
port 40359 earlier):
2015-11-17 22:26:15,637 INFO [master:ip-172-31-23-41:48470]
master.MasterFileSystem: Log folder
hdfs://ip-172-31-23-41.us-west-2.compute.internal:8020/Informatica/LDM/InfaCatalog1/hbase/WALs/ip-172-31-23-41.us-west-2.compute.internal,40359,1447816123180
doesn't belong to a known region server, splitting
2015-11-17 22:26:27,205 INFO
[MASTER_SERVER_OPERATIONS-ip-172-31-23-41:48470-0]
handler.ServerShutdownHandler: Finished processing of shutdown of
ip-172-31-23-41.us-west-2.compute.internal,40359,1447816123180
Times out finally:
2015-11-17 22:27:21,607 WARN [master:ip-172-31-23-41:48470]
master.TableNamespaceManager: Timedout waiting for namespace table to be
assigned.
While, in the same interval active region server (port 37544) keeps showing
this log (many times). The 2nd log with dead region server port (40359) is not
clear to me. Is the active RS trying to contact the other one and failing with
ConnectException?
2015-11-17 22:26:55,945 INFO
[ip-172-31-23-41.us-west-2.compute.internal,37544,1447827973069-recovery-writer--pool11-t2]
client.AsyncProcess: #13, waiting for some tasks to finish. Expected max=0,
tasksSent=10, tasksDone=9, currentTasksDone=9, retries=9 hasError=false,
tableName=ldmns:indx_parameterstore
2015-11-17 22:26:55,946 INFO [htable-pool14-t2] client.AsyncProcess: #13,
table=ldmns:indx_parameterstore, attempt=10/400 failed 2 ops, last exception:
java.net.ConnectException: Connection refused on
ip-172-31-23-41.us-west-2.compute.internal,40359,1447816123180, tracking
started Tue Nov 17 22:26:27 PST 2015, retrying after 10046 ms, replay 2 ops.
Thanks,Sumit
From: Ted Yu <[email protected]>
To: Sumit Nigam <[email protected]>
Cc: "[email protected]" <[email protected]>
Sent: Thursday, November 19, 2015 10:50 AM
Subject: Re: About exceptions
bq. because 159 region(s) in transition
This case seems to be similar to the one I saw where user table region
assignment blocked system table region assignment.
Can you take a look at the user regions which got stuck in transition ?
One or more of them might continuously fail to open. You should get some
clue by checking region server log(s).
Cheers
On Wed, Nov 18, 2015 at 8:59 PM, Sumit Nigam <[email protected]> wrote:
> Hello Ted,
>
> I could finally replicate one of the issues below :
>
> 1. *Wed Nov 18* 02:27:36 EST 2015,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@1a8bbdc9,
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException):
> java.io.IOException: org.apache.hadoop.hbase.master.*TableNamespaceManager
> isn't ready to serve*
> at
> org.apache.hadoop.hbase.master.TableNamespaceManager.getNamespaceTable(TableNamespaceManager.java:112)
> at
> org.apache.hadoop.hbase.master.TableNamespaceManager.list(TableNamespaceManager.java:211)
> at
> org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3473)
> at
> org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3367)
> at
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:43312)
>
>
>
> At the same time, HMaster logs show following line:
>
> *2015-11-17* 22:27:21,607* WARN* [master:ip-172-31-23-41:48470]
> master.TableNamespaceManager: *Timedout* waiting for namespace table to
> be assigned.
> 2015-11-17 22:27:21,607 INFO [master:ip-172-31-23-41:48470]
> master.HMaster: *Master has completed* *initialization*
> 2015-11-17 22:31:21,616 DEBUG
> [ip-172-31-23-41.us-west-2.compute.internal,48470,1447827964772-BalancerChore]
> master.HMaster: Not running balancer because 159 region(s) in transition:
> {d93af1e3d8d460cf2ac980ad60ce3f3d={d93af1e3d8d460cf2ac980ad60ce3f3d
> state=PENDING_OPEN, ts=1447827986817,
> server=ip-172-31-23-41.us-west-2.compute.internal,37544,1447827973069},
> 83fc50ab0413f4a0e7f71e072ccaa6f5={83fc50ab0413f4a0e7f71e072ccaa6f5
> state=PE...
> 2015-11-17 22:36:21,616 DEBUG
> [ip-172-31-23-41.us-west-2.compute.internal,48470,1447827964772-BalancerChore]
> master.HMaster: *Not running balancer because 159 region(s) in transition*:
> {d93af1e3d8d460cf2ac980ad60ce3f3d={d93af1e3d8d460cf2ac980ad60ce3f3d
> state=PENDING_OPEN, ts=1447827986817,
> server=ip-172-31-23-41.us-west-2.compute.internal,37544,1447827973069},
> 83fc50ab0413f4a0e7f71e072ccaa6f5={83fc50ab0413f4a0e7f71e072ccaa6f5
> state=PE...
>
>
> Not sure, what makes it time out. I looked at that code and it seems it
> tries to load all the regions for a given table but times out. Not sure if
> it points to zookeeper or hdfs problem or some other.
>
> Would this give any clues?
>
> One more thing of interest is that the Hbase client (which shows up the
> error) and HMaster machines in this particular case are not time-synced. I
> notice a day's gap but I assume that NTP time-sync is only a requirement
> for Hbase master/ region servers and not also for their clients.
>
> Thanks,
> Sumit
>
> ------------------------------
> *From:* Ted Yu <[email protected]>
> *To:* Sumit Nigam <[email protected]>
> *Cc:* "[email protected]" <[email protected]>
> *Sent:* Sunday, November 15, 2015 9:14 PM
> *Subject:* Re: About exceptions
>
> bq. if we increase #retries from our end, is there a chance that it may
> get past the issue?
>
> Most likely the chance of getting past the issue would be low without
> manually fixing the condition.
>
> For #2, it is a mystery because 0.98 master does not have Procedure V2 in
> Apache. What distro are you using ?
>
> For #3, unclean shutdown could be one of the causes. To make further
> assessment, log snippet from master concerning the table is desirable.
>
> Cheers
>
>
>
> On Sun, Nov 15, 2015 at 2:25 AM, Sumit Nigam <[email protected]> wrote:
>
> Thank you Ted.
>
> I was unaware of both those issues. The issue with these exceptions is
> that they are intermittent and do not replicate easily. So, let me see if I
> can replicate it with trace enabled. For #1, should retrying be attempted?
> Or possibly, if we increase #retries from our end, is there a chance that
> it may get past the issue? I like the idea of master having a WAL (
> HBASE-14190) to find/ fix such inconsistencies.
>
> #2 That trace showed up in a hbase client.
>
> #3 unclean shutdown is possibly one case? I do not explicitly enable/
> disable tables. So, I assume those reasons may be related to Hbase code?
> And any advise on if I can somehow avoid it in first place?
>
> Thanks,
> Sumit
>
> ------------------------------
> *From:* Ted Yu <[email protected]>
> *To:* Sumit Nigam <[email protected]>
> *Cc:* "[email protected]" <[email protected]>
> *Sent:* Sunday, November 15, 2015 3:34 PM
> *Subject:* Re: About exceptions
>
> Sumit:
> For #1, I have seen a similar issue (HBASE-14190, though on hbase 1.x
> release).
> If you have debug logging enabled, please pastebin relevant master log
> snippet so that we can take a closer look.
>
> For #2, I am bit confused - I didn't find CreateTableProcedure.java in
> 0.98 branch. To my knowledge, CreateTableProcedure is only in hbase 1
> release.
> Did you see the stack trace in master log ?
>
> For #3, there could be various reasons a table was not enabled.
> You can trace the table assignment in master log, check log from
> hbase:meta server to see if you can find some clue.
>
> bq. Hbase fails only after it exhausts its attempts so retrying may not
> be helpful?
>
> Your understanding should be correct.
>
> I want to bring your attention to HBASE-12070 which helps you fix ZK
> inconsistencies.
>
> Cheers
>
>
>
> On Sun, Nov 15, 2015 at 12:29 AM, Sumit Nigam <[email protected]>
> wrote:
>
> Hi Ted,
>
> Thanks for your reply. I am using Hbase 0.98.14. I have used hbck, but for
> some (unknown) reason it has not always resolved inconsistencies.
>
> I have been able to get around these issues so far by deleting ZK entries
> for the offending table and restarting Hbase. But I am not sure what causes
> them in the first place and if I can avoid those issues through code or
> not. Also, upon getting these exceptions is it a good idea to retry the
> operation. I think Hbase fails only after it exhausts its attempts so
> retrying may not be helpful?
>
>
> Here are 3 logs snippets:
>
> 1. TableNamespaceManager isn't ready to serve:
>
> Fri Nov 13 17:47:19 IST 2015,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@44726f67,
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException):
> java.io.*IOException*: org.apache.hadoop.hbase.master.*TableNamespaceManager
> isn't ready to serve*
> at
> org.apache.hadoop.hbase.master.TableNamespaceManager.getNamespaceTable(TableNamespaceManager.java:112)
> at
> org.apache.hadoop.hbase.master.TableNamespaceManager.list(TableNamespaceManager.java:211)
> at
> org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3473)
> at
> org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:3367)
>
>
>
> 2. TableExistsException:
>
> Caused by: org.apache.hadoop.hbase.TableExistsException:
> org.apache.hadoop.hbase.*TableExistsException: ldmns:exDocStore*
> at
> org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.prepareCreate(CreateTableProcedure.java:300)
> at
> org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:106)
> at
> org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:58)
> ...
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:90)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3403)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:632)
> at org.apache.hadoop.hbase.client.HBaseAdmin.*createTable*
> (HBaseAdmin.java:523)
>
>
> 3. TableNotEnabledException:
>
> Caused by: org.apache.hadoop.hbase.*TableNotEnabledException*:
> ldmns:DataDomain_stage is disabled.
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1139)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:963)
> at
> org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:74)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:833)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:810)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:842)
> at
> com.thinkaurelius.titan.diskstorage.hbase.HBaseKeyColumnValueStore.getHelper(HBaseKeyColumnValueStore.java:155)
>
> ------------------------------
> *From:* Ted Yu <[email protected]>
> *To:* "[email protected]" <[email protected]>; Sumit Nigam <
> [email protected]>
> *Sent:* Sunday, November 15, 2015 10:50 AM
> *Subject:* Re: About exceptions
>
> bq. TableNotEnabledExceptionTableNotFoundExceptionIOException
>
> Can you show log snippets where these exceptions occurred ?
> Which release of hbase are you using ?
>
> Have you run hbck to repair the inconsistencies ?
>
> See http://hbase.apache.org/book.html#hbck.in.depth
>
> Cheers
>
>
>
> On Sat, Nov 14, 2015 at 8:42 PM, Sumit Nigam <[email protected]
> > wrote:
>
> Hi,
> There are some exceptions which I face intermittently with Hbase and I
> thought some help from experts online can really help me. These are:
> TableNotEnabledExceptionTableNotFoundExceptionIOException -
> TableNamespaceManager isn't ready to serve
>
> One of the reasons I can see for this seems to be zookeeper and Hbase/
> Hdfs data being out of sync due to an unclean shutdown.
> So, my questions are these:
> 1. Are these exceptions only related to unclean shutdowns?2. Do I need to
> explicitly handle them and retry the operation again because they also seem
> to indicate that it is some race condition between trying to access a table
> vs Hbase enabling them?
> Any help is greatly appreciated.
> Thanks,Sumit
>
>
>
>
>
>
>
>
>
>
>