If the node is actually down it's fine. But the node may not be that down (CAP theorem here); and then it's looking for trouble. HDFS, by default declare a node as dead after 10:30. 15 minutes is an extra security. It seems your hdfs settings are different (or there is a bug...). There should be some info in the hdfs logs.
On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges <[email protected]> wrote: > Will take a look. > > Actually, if node is down (someone unplugged network cable, it just died, > whatever) what's chance it's going to become live so write can continue? On > the other side, HBase is not starting recovery trying to contact node which > is not there anymore, and even elected as dead on Namenode side (another > thing I didn't understood quite good). > > So what I was expecting is that as soon as Namenode decided node is dead, > that it would be enough for RegionServer to stop trying to recover from the > dead node, but it wasn't the case. Also, this whole MTTR article in HBase > book doesn't work at all with this parameter set to it's default value (15 > minutes). > > So I'm still struggling to figure out what's drawback exactly on this? > > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <[email protected]> wrote: > > > Thanks for the explanation. There is an issue if you modify this setting > > however. > > hbase tries to recover the lease (i.e. be sure that nobody is writing) > > If you change hbase.lease.recovery.timeout hbase will start the recovery > > (i.e. start to read) even if it's not sure that nobody's writing. That > > means there is a dataloss risk. > > Basically, you must not see this warning: WARN > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after > trying > > for[] > > > > The recoverLease must succeed. The fact that it does not after X tries is > > strange. > > There may be a mistmatch between the hbase parameters and the hdfs ones. > > You may need to have a look at the comments in FSHDFSUtils.java > > > > > > > > > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <[email protected]> > > wrote: > > > > > I found the issue and fixed it, and will try to explain here what was > > > exactly in our case in case someone else finds this interesting too. > > > > > > So initially, we had (couple of times) some network and hardware issues > > in > > > our datacenters. When one server would die (literary die, we had some > > issue > > > with PSUs) we saw issues with overall cluster performance on HBase > side. > > As > > > cluster is quite big and live, it was also quite hard to figure out > exact > > > root cause and how to fix the stuff we wanted to fix. > > > > > > So I set up another cluster, four nodes (with DataNode and > RegionServer) > > > and two other nodes with HMaster and Namenode in HA, using same stuff > we > > > use on production. We pumped some data into it, and I was able to > > reproduce > > > same issue last week on it. What I tried to do is to cut one server > (shut > > > down it's interface) when all is good with cluster, when we have load, > > and > > > see what's going to happen. > > > > > > On Friday, after Nicolas mentioned, I started taking a look in HBase > logs > > > on the node which was mentioned in HMaster log as the one taking over > > > regions for the dead server. Basically what I was able to observe was > 15 > > > minutes time (+- couple of seconds, what was also interesting, and will > > got > > > later to that) between HMaster figures out that one of it's > RegionServers > > > is dead, and the time one of the mentioned nodes starts taking over > those > > > regions and they start appearing in HMaster's Web UI. > > > > > > I then set up everything like mentioned here > > > http://hbase.apache.org/book.html#mttr - but still had exactly the > same > > > issues. Went over (again and again) all currently configured stuff, but > > > still had the same issue. > > > > > > Then I started looking into HDFS. Opened NameNode UI, saw all is good, > > took > > > one node down, was also looking RegionServer logs in the same time - > and > > I > > > also see that it took ~15 minutes for Namenode to elect dead node as > > dead. > > > Somehow in the same moment regions started getting back to life. I > > remember > > > in some older versions dfs timeout checks and check retries. Looked > into > > > defaults for our Hadoop version - > > > > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/ > > hadoop-hdfs/hdfs-default.xml > > > - and saw there that there's no recheck parameter anymore. Strange, as > on > > > StackOverflow I found post from month ago, for newer version than we > use > > > (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat. > > recheck-interval) > > > I set it to 10 seconds as he mentioned, having checks every three > seconds > > > (default) and got DataNode elected as dead in ~45 seconds, as he > > mentioned. > > > Not sure why this parameter is not documented, but obviously it works. > > > > > > Then figured out it still didn't fix our HBase failover issue. I was > > > looking into HBase book again and again, and then saw this part: > > > > > > "How much time we allow elapse between calls to recover lease. Should > be > > > larger than the dfs timeout." > > > > > > This was description for hbase.lease.recovery.dfs.timeout. Wasn't sure > > from > > > the comment what of all timeouts that's possible to set in Hadoop/HBase > > and > > > that has something to do with DFS is this all about. But picked > > > hbase.lease.recovery.timeout, and that was the catch. > > > > > > Initially, by default, hbase.lease.recovery.timeout is set to 15 > minutes. > > > Not sure why, and wasn't able to find yet why, but getting this down to > > one > > > minute (what's more than OK for us) I was able to get rid of our issue. > > Not > > > also sure why this is not mentioned in MTTR section in HBase book, as > > > obviously MTTR doesn't work at all with this default timeout, at least > it > > > doesn't work the way we expected it to work. > > > > > > So thanks again for everyone being spammed with this, and specially > > thanks > > > to Nicolas pointing me to the right direction. > > > > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <[email protected]> > > wrote: > > > > > > > the attachments are rejected by the mailing list, can you put then on > > > > pastebin? > > > > > > > > stale is mandatory (so it's good), but the issue here is just before. > > The > > > > region server needs to read the file. In order to be sure that there > is > > > no > > > > data loss, it needs to "recover the lease", that means ensuring that > > > nobody > > > > is writing the file. The regionserver calls the namenode to do this > > > > recoverLease. So there should be some info in the namenode logs. You > > have > > > > HDFS-4721 on your hdfs? The hbase book details (more or less...) this > > > > recoverLease stuff. > > > > > > > > > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges < > [email protected] > > > > > > > wrote: > > > > > > > > > And also, just checked - dfs.namenode.avoid.read.stale.datanode and > > > > > dfs.namenode.avoid.write.stale.datanode > > > > > are both true, and dfs.namenode.stale.datanode.interval is set to > > > > default > > > > > 30000. > > > > > > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges < > > [email protected]> > > > > > wrote: > > > > > > > > > > > Hi Nicolas, > > > > > > > > > > > > Please find log attached. > > > > > > > > > > > > As I see it now more clearly, it was trying to recover HDFS WALs > > from > > > > > node > > > > > > that's dead: > > > > > > > > > > > > 2015-03-23 08:53:44,381 WARN > > > org.apache.hadoop.hbase.util.FSHDFSUtils: > > > > > > Cannot recoverLease after trying for 900000ms > > > > > > (hbase.lease.recovery.timeout); continuing, but may be > > DATALOSS!!!; > > > > > > attempt=40 on > > > > > > > > > > > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent > > > > ionally_get_down_by_getting_network_down},60020,1426862900 > > > > 506-splitting/{node_i_intentionally_get_down_by_gett > > > > ing_network_down}%2C60020%2C1426862900506.1427096924508 > > > > > > after 908210ms > > > > > > > > > > > > And as you can see from the log, it tried 40 times, what took it > > > > exactly > > > > > > 15 minutes. > > > > > > > > > > > > There's probably some parameter to tune to cut it of from 40 > times > > / > > > 15 > > > > > > minutes to something more useful, as for 15 minutes we don't have > > our > > > > > > regions available, and HDFS have however replication factor 3. > > > > > > > > > > > > Googling, if I figure out what's this I will post it here. Will > > also > > > > > > appreciate if someone knows how to cut this down. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Dejan > > > > > > > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > >> The split is done by the region servers (the master > coordinates). > > Is > > > > > there > > > > > >> some interesting stuff in their logs? > > > > > >> > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges < > > > [email protected] > > > > > > > > > > >> wrote: > > > > > >> > > > > > >> > With client issue was that it was retrying connecting to the > > same > > > > > region > > > > > >> > servers even when they were reassigned. Lowering it down > helped > > in > > > > > this > > > > > >> > specific use case, but yes, we still want servers to > reallocate > > > > > quickly. > > > > > >> > > > > > > >> > What got me here: > > > > > >> > > > > > > >> > http://hbase.apache.org/book.html#mttr > > > > > >> > > > > > > >> > I basically set configuration exactly the same way as it's > > > explained > > > > > >> here. > > > > > >> > *zookeeper.session.timeout* is (and was before) 60000 (one > > > minute). > > > > > >> > > > > > > >> > So basically what happens is: - simulating network issues we > had > > > > > >> recently). > > > > > >> > - After short time I see in HBase that my RegionServer is > dead, > > > and > > > > as > > > > > >> > total number of regions I see previous total minus number of > > > regions > > > > > >> that > > > > > >> > were hosted on the node hosting RegionServer that just > > > > 'disappeared'. > > > > > >> > - In this point I want my clus > > > > > >> > > > > > > >> > - I have test cluster consisting of four nodes, every node > being > > > > > >> DataNode > > > > > >> > and RegionServer. > > > > > >> > - I simulate network partition on one (connect to it through > > > console > > > > > and > > > > > >> > take network interface downter to recover as soon as possible, > > to > > > > > start > > > > > >> > serving missing regions. > > > > > >> > - First thing I see in HMaster logs are: > > > > > >> > > > > > > >> > 2015-03-20 14:17:26,015 INFO > > > > > >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: > > > RegionServer > > > > > >> > ephemeral node deleted, processing expiration > > > > > >> > [{name_of_node_I_took_down},60020,1426860403261] > > > > > >> > > > > > > >> > 2015-03-20 14:17:26,067 INFO > > > > > >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: > > > > > Splitting > > > > > >> > logs for {name_of_node_I_took_down},60020,1426860403261 before > > > > > >> assignment. > > > > > >> > > > > > > >> > 2015-03-20 14:17:26,105 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog > > > > workers > > > > > [ > > > > > >> > {name_of_node_I_took_down},60020,1426860403261] > > > > > >> > > > > > > >> > 2015-03-20 14:17:26,107 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: started > > splitting > > > 1 > > > > > >> logs in > > > > > >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_ > > > > took_down} > > > > > >> > ,60020,1426860403261-splitting] > > > > > >> > > > > > > >> > 2015-03-20 14:17:26,150 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task > > > > > >> > /hbase/splitWAL/WALs%2F > > > > > >> > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F > > > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268 > > > > 60404905 > > > > > >> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623 > > > > > >> > > > > > > >> > 2015-03-20 14:17:26,182 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = > 1 > > > > > >> unassigned > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} > > > > > >> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name = > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status = > > > > in_progress > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 > > > error = > > > > > 0} > > > > > >> > > > > > > >> > 2015-03-20 14:17:31,183 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = > 1 > > > > > >> unassigned > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} > > > > > >> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name = > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status = > > > > in_progress > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 > > > error = > > > > > 0} > > > > > >> > > > > > > >> > 2015-03-20 14:17:36,184 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = > 1 > > > > > >> unassigned > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} > > > > > >> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name = > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status = > > > > in_progress > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 > > > error = > > > > > 0} > > > > > >> > > > > > > >> > 2015-03-20 14:17:42,185 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = > 1 > > > > > >> unassigned > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} > > > > > >> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name = > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status = > > > > in_progress > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 > > > error = > > > > > 0} > > > > > >> > > > > > > >> > 2015-03-20 14:17:48,184 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = > 1 > > > > > >> unassigned > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} > > > > > >> > > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name = > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status = > > > > in_progress > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 > > > error = > > > > > 0} > > > > > >> > In the meantime, In hbase...out log I got this: > > > > > >> > > > > > > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <== > > > > > >> > > > > > > >> > java.io.IOException: Call to > > > > > >> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_dow > > > > n}:60020 > > > > > >> > failed on local exception: > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: > > Call > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000 > > > > > >> > > > > > > >> > at > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien > > > > > >> t.java:1532) > > > > > >> > > > > > > >> > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java: > > 1502) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc > > > > > >> Client.java:1684) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImpl > > > > ementati > > > > > >> on.callBlockingMethod(RpcClient.java:1737) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$ > > > > AdminService$ > > > > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState > > > > > >> (HBaseAdmin.java:2524) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.generated.master.table_jsp._jspServi > > > > > >> ce(table_jsp.java:167) > > > > > >> > > > > > > >> > at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase. > > > > java:98) > > > > > >> > > > > > > >> > at > javax.servlet.http.HttpServlet.service(HttpServlet.java:770) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder > > > > > >> .java:511) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte > > > > > >> r(ServletHandler.java:1221) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFil > > > > > >> ter.doFilter(StaticUserWebFilter.java:109) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte > > > > > >> r(ServletHandler.java:1212) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilte > > > > > >> r(HttpServer.java:1081) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte > > > > > >> r(ServletHandler.java:1212) > > > > > >> > > > > > > >> > at > > > > > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte > > > > > >> r(ServletHandler.java:1212) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl > > > > > >> er.java:399) > > > > > >> > > > > > > >> > at > > > > > >> > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHa > > > > > >> ndler.java:216) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl > > > > > >> er.java:182) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl > > > > > >> er.java:766) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext. > > > > > >> java:450) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(Co > > > > > >> ntextHandlerCollection.java:230) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp > > > > > >> er.java:152) > > > > > >> > > > > > > >> > at org.mortbay.jetty.Server.handle(Server.java:326) > > > > > >> > > > > > > >> > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio > > > > > >> n.java:542) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.headerComple > > > > > >> te(HttpConnection.java:928) > > > > > >> > > > > > > >> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > > > > > >> > > > > > > >> > at > > > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > > > > > >> > > > > > > >> > at > > > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEn > > > > > >> dPoint.java:410) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run( > > > > > >> QueuedThreadPool.java:582) > > > > > >> > > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$ > > > > CallTimeoutException: > > > > > >> Call > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000 > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCall > > > > > >> s(RpcClient.java:1234) > > > > > >> > > > > > > >> > at > > > > > >> > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.readRespons > > > > > >> e(RpcClient.java:1171) > > > > > >> > > > > > > >> > at > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie > > > > > >> nt.java:751) > > > > > >> > Beside this same issue, please note that first message was at > > > > > 2015-03-20 > > > > > >> > 14:17:26,015. And then (we got to the point when it started > > > > > transition): > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,059 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task > > > > > >> > /hbase/splitWAL/WALs%2F > > > > > >> > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F > > > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268 > > > > 60404905 > > > > > >> > entered state: DONE > {fqdn_of_new_live_node},60020,1426859445623 > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,109 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: Done splitting > > > > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down} > > > > > >> > > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down} > > > > > >> > %252C60020%252C1426860403261.1426860404905 > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,190 INFO > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: finished > > splitting > > > > > >> (more > > > > > >> > than or equal to) 9 bytes in 1 log files in > > > > > >> > > > > > > >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_ > > > > > >> node_I_took_down},60020,1426860403261-splitting] > > > > > >> > in 909083ms > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master > > > > > >> .RegionStates: > > > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN, > > > > > >> > ts=1426860639088, > > > > > server={name_of_node_I_took_down},60020,1426860403261} > > > > > >> to > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, > > ts=1426861955191, > > > > > >> server= > > > > > >> > {name_of_node_I_took_down},60020,1426860403261} > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master > > > > > >> .RegionStates: > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from > > > > > >> {name_of_node_I_took_down} > > > > > >> > ,60020,1426860403261 > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master > > > > > >> .RegionStates: > > > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN, > > > > > >> > ts=1426860641783, > > > > > server={name_of_node_I_took_down},60020,1426860403261} > > > > > >> to > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, > > ts=1426861955191, > > > > > >> server= > > > > > >> > {name_of_node_I_took_down},60020,1426860403261} > > > > > >> > > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master > > > > > >> .RegionStates: > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from > > > > > >> {name_of_node_I_took_down} > > > > > >> > ,60020,1426860403261 > > > > > >> > At this point, note that it finished SplitLogManager task at > > > > 14:32:35 > > > > > >> and > > > > > >> > started transitioning just after that. So this is 15 minutes > > that > > > > I'm > > > > > >> > talking about. > > > > > >> > > > > > > >> > What am I missing? > > > > > >> > > > > > > >> > > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon < > > > [email protected]> > > > > > >> wrote: > > > > > >> > > > > > > >> > > You've changed the value of hbase.zookeeper.timeout to 15 > > > > minutes? A > > > > > >> very > > > > > >> > > reasonable target is 1 minute before relocating the regions. > > > > That's > > > > > >> the > > > > > >> > > default iirc. You can push it to 20s, but then > > > > gc-stopping-the-world > > > > > >> > > becomes more of an issue. 15 minutes is really a lot. The > hdfs > > > > stale > > > > > >> mode > > > > > >> > > must always be used, with a lower timeout than the hbase > one. > > > > > >> > > > > > > > >> > > Client side there should be nothing to do (excepted for > > advanced > > > > > >> stuff); > > > > > >> > at > > > > > >> > > each retry the client checks the location of the regions. If > > you > > > > > lower > > > > > >> > the > > > > > >> > > number of retry the client will fail sooner, but usually you > > > don't > > > > > >> want > > > > > >> > the > > > > > >> > > client to fail, you want the servers to reallocate quickly. > > > > > >> > > > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges < > > > > > [email protected] > > > > > >> > > > > > > >> > > wrote: > > > > > >> > > > > > > > >> > > > Hi, > > > > > >> > > > > > > > > >> > > > Sorry for little bit late update, but managed to narrow it > > > > little > > > > > >> bit > > > > > >> > > down. > > > > > >> > > > > > > > > >> > > > We didn't update yet, as we are using Hortonworks > > distribution > > > > > right > > > > > >> > now, > > > > > >> > > > and even if we update we will get 0.98.4. However, looks > > that > > > > > issue > > > > > >> > here > > > > > >> > > > was in our use case and configuration (still looking into > > it). > > > > > >> > > > > > > > > >> > > > Basically, initially I saw that when one server goes down, > > we > > > > > start > > > > > >> > > having > > > > > >> > > > performance issues in general, but it managed to be on our > > > > client > > > > > >> side, > > > > > >> > > due > > > > > >> > > > to caching, and clients were trying to reconnect to nodes > > that > > > > > were > > > > > >> > > offline > > > > > >> > > > and later trying to get regions from those nodes too. This > > is > > > > > >> basically > > > > > >> > > why > > > > > >> > > > on server side I didn't manage to see anything in logs > that > > > > would > > > > > >> be at > > > > > >> > > > least little bit interesting and point me into desired > > > > direction. > > > > > >> > > > > > > > > >> > > > Another question that popped up to me is - in case server > is > > > > down > > > > > >> (and > > > > > >> > > with > > > > > >> > > > it DataNode and HRegionServer it was hosting) - what's > > optimal > > > > > time > > > > > >> to > > > > > >> > > set > > > > > >> > > > for HMaster to consider server dead reassign regions > > somewhere > > > > > >> else, as > > > > > >> > > > this is another performance bottleneck we hit during > > inability > > > > to > > > > > >> > access > > > > > >> > > > regions? In our case it's configured to 15 minutes, and > > simple > > > > > logic > > > > > >> > > tells > > > > > >> > > > me if you want it earlier then configure lower number of > > > > retries, > > > > > >> but > > > > > >> > > issue > > > > > >> > > > is as always in details, so not sure if anyone knows some > > > better > > > > > >> math > > > > > >> > for > > > > > >> > > > this? > > > > > >> > > > > > > > > >> > > > And last question - is it possible to manually force HBase > > to > > > > > >> reassign > > > > > >> > > > regions? In this case, while HMaster is retrying to > contact > > > node > > > > > >> that's > > > > > >> > > > dead, it's impossible to force it using 'balancer' > command. > > > > > >> > > > > > > > > >> > > > Thanks a lot! > > > > > >> > > > > > > > > >> > > > Dejan > > > > > >> > > > > > > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges < > > > > > >> [email protected]> > > > > > >> > > > wrote: > > > > > >> > > > > > > > > >> > > > > Hi, > > > > > >> > > > > > > > > > >> > > > > To be very honest - there's no particular reason why we > > > stick > > > > to > > > > > >> this > > > > > >> > > > one, > > > > > >> > > > > beside just lack of time currently to go through upgrade > > > > > process, > > > > > >> but > > > > > >> > > > looks > > > > > >> > > > > to me that's going to be next step. > > > > > >> > > > > > > > > > >> > > > > Had a crazy day, didn't have time to go through all logs > > > > again, > > > > > >> plus > > > > > >> > > one > > > > > >> > > > > of the machines (last one where we had this issue) is > > fully > > > > > >> > > reprovisioned > > > > > >> > > > > yesterday so I don't have logs from there anymore. > > > > > >> > > > > > > > > > >> > > > > Beside upgrading, what I will talk about today, can you > > > just > > > > > >> point > > > > > >> > me > > > > > >> > > to > > > > > >> > > > > the specific RPC issue in 0.98.0? Thing is that we have > > some > > > > > >> strange > > > > > >> > > > > moments with RPC in this case, and just want to see if > > > that's > > > > > the > > > > > >> > same > > > > > >> > > > > thing (and we were even suspecting to RPC). > > > > > >> > > > > > > > > > >> > > > > Thanks a lot! > > > > > >> > > > > Dejan > > > > > >> > > > > > > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell < > > > > > >> [email protected] > > > > > >> > > > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > >> > > > >> Is there a particular reason why you are using HBase > > > 0.98.0? > > > > > The > > > > > >> > > latest > > > > > >> > > > >> 0.98 release is 0.98.11. There's a known performance > > issue > > > > with > > > > > >> > 0.98.0 > > > > > >> > > > >> pertaining to RPC that was fixed in later releases, you > > > > should > > > > > >> move > > > > > >> > up > > > > > >> > > > >> from > > > > > >> > > > >> 0.98.0. In addition hundreds of improvements and bug > > fixes > > > > have > > > > > >> gone > > > > > >> > > > into > > > > > >> > > > >> the ten releases since 0.98.0. > > > > > >> > > > >> > > > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges < > > > > > >> > [email protected] > > > > > >> > > > > > > > > >> > > > >> wrote: > > > > > >> > > > >> > > > > > >> > > > >> > Hi All, > > > > > >> > > > >> > > > > > > >> > > > >> > We have a strange issue with HBase performance > (overall > > > > > cluster > > > > > >> > > > >> > performance) in case one of datanodes in the cluster > > > > > >> unexpectedly > > > > > >> > > goes > > > > > >> > > > >> > down. > > > > > >> > > > >> > > > > > > >> > > > >> > So scenario is like follows: > > > > > >> > > > >> > - Cluster works fine, it's stable. > > > > > >> > > > >> > - One DataNode unexpectedly goes down (PSU issue, > > network > > > > > >> issue, > > > > > >> > > > >> anything) > > > > > >> > > > >> > - Whole HBase cluster goes down (performance becomes > so > > > bad > > > > > >> that > > > > > >> > we > > > > > >> > > > >> have to > > > > > >> > > > >> > restart all RegionServers to get it back to life). > > > > > >> > > > >> > > > > > > >> > > > >> > Most funny and latest issue that happened was that we > > > added > > > > > new > > > > > >> > node > > > > > >> > > > to > > > > > >> > > > >> the > > > > > >> > > > >> > cluster (having 8 x 4T SATA disks) and we left just > > > > DataNode > > > > > >> > running > > > > > >> > > > on > > > > > >> > > > >> it > > > > > >> > > > >> > to give it couple of days to get some data. At some > > point > > > > in > > > > > >> time, > > > > > >> > > due > > > > > >> > > > >> to > > > > > >> > > > >> > hardware issue, server rebooted (twice during three > > > hours) > > > > in > > > > > >> > moment > > > > > >> > > > >> when > > > > > >> > > > >> > it had maybe 5% of data it would have in a couple of > > > days. > > > > > >> Nothing > > > > > >> > > > else > > > > > >> > > > >> > beside DataNode was running, and once it went down, > it > > > > > affected > > > > > >> > > > literary > > > > > >> > > > >> > everything, and restarting RegionServers in the end > > fixed > > > > it. > > > > > >> > > > >> > > > > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0 > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> -- > > > > > >> > > > >> Best regards, > > > > > >> > > > >> > > > > > >> > > > >> - Andy > > > > > >> > > > >> > > > > > >> > > > >> Problems worthy of attack prove their worth by hitting > > > back. > > > > - > > > > > >> Piet > > > > > >> > > Hein > > > > > >> > > > >> (via Tom White) > > > > > >> > > > >> > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
