Re: Strange issue when DataNode goes down

Bryan Beaudreault Mon, 23 Mar 2015 07:53:28 -0700

So it is safe to set hbase.lease.recovery.timeout lower if you also
set heartbeat.recheck.interval lower (lowering that 10.5 min dead node
timer)?  Or is it recommended to not touch either of those?


Reading the above with interest, thanks for digging in here guys.

On Mon, Mar 23, 2015 at 10:13 AM, Nicolas Liochon <[email protected]> wrote:

> If the node is actually down it's fine. But the node may not be that down
> (CAP theorem here); and then it's looking for trouble.
> HDFS, by default declare a node as dead after 10:30. 15 minutes is an extra
> security. It seems your hdfs settings are different (or there is a bug...).
> There should be some info in the hdfs logs.
>
> On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges <[email protected]>
> wrote:
>
> > Will take a look.
> >
> > Actually, if node is down (someone unplugged network cable, it just died,
> > whatever) what's chance it's going to become live so write can continue?
> On
> > the other side, HBase is not starting recovery trying to contact node
> which
> > is not there anymore, and even elected as dead on Namenode side (another
> > thing I didn't understood quite good).
> >
> > So what I was expecting is that as soon as Namenode decided node is dead,
> > that it would be enough for RegionServer to stop trying to recover from
> the
> > dead node, but it wasn't the case. Also, this whole MTTR article in HBase
> > book doesn't work at all with this parameter set to it's default value
> (15
> > minutes).
> >
> > So I'm still struggling to figure out what's drawback exactly on this?
> >
> > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <[email protected]>
> wrote:
> >
> > > Thanks for the explanation. There is an issue if you modify this
> setting
> > > however.
> > > hbase tries to recover the lease (i.e. be sure that nobody is writing)
> > > If you change hbase.lease.recovery.timeout hbase will start the
> recovery
> > > (i.e. start to read) even if it's not sure that nobody's writing. That
> > > means there is a dataloss risk.
> > > Basically, you must not see this warning: WARN
> > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after
> > trying
> > > for[]
> > >
> > > The recoverLease must succeed. The fact that it does not after X tries
> is
> > > strange.
> > > There may be a mistmatch between the hbase parameters and the hdfs
> ones.
> > > You may need to have a look at the comments in FSHDFSUtils.java
> > >
> > >
> > >
> > >
> > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <[email protected]>
> > > wrote:
> > >
> > > > I found the issue and fixed it, and will try to explain here what was
> > > > exactly in our case in case someone else finds this interesting too.
> > > >
> > > > So initially, we had (couple of times) some network and hardware
> issues
> > > in
> > > > our datacenters. When one server would die (literary die, we had some
> > > issue
> > > > with PSUs) we saw issues with overall cluster performance on HBase
> > side.
> > > As
> > > > cluster is quite big and live, it was also quite hard to figure out
> > exact
> > > > root cause and how to fix the stuff we wanted to fix.
> > > >
> > > > So I set up another cluster, four nodes (with DataNode and
> > RegionServer)
> > > > and two other nodes with HMaster and Namenode in HA, using same stuff
> > we
> > > > use on production. We pumped some data into it, and I was able to
> > > reproduce
> > > > same issue last week on it. What I tried to do is to cut one server
> > (shut
> > > > down it's interface) when all is good with cluster, when we have
> load,
> > > and
> > > > see what's going to happen.
> > > >
> > > > On Friday, after Nicolas mentioned, I started taking a look in HBase
> > logs
> > > > on the node which was mentioned in HMaster log as the one taking over
> > > > regions for the dead server. Basically what I was able to observe was
> > 15
> > > > minutes time (+- couple of seconds, what was also interesting, and
> will
> > > got
> > > > later to that) between HMaster figures out that one of it's
> > RegionServers
> > > > is dead, and the time one of the mentioned nodes starts taking over
> > those
> > > > regions and they start appearing in HMaster's Web UI.
> > > >
> > > > I then set up everything like mentioned here
> > > > http://hbase.apache.org/book.html#mttr - but still had exactly the
> > same
> > > > issues. Went over (again and again) all currently configured stuff,
> but
> > > > still had the same issue.
> > > >
> > > > Then I started looking into HDFS. Opened NameNode UI, saw all is
> good,
> > > took
> > > > one node down, was also looking RegionServer logs in the same time -
> > and
> > > I
> > > > also see that it took ~15 minutes for Namenode to elect dead node as
> > > dead.
> > > > Somehow in the same moment regions started getting back to life. I
> > > remember
> > > > in some older versions dfs timeout checks and check retries. Looked
> > into
> > > > defaults for our Hadoop version -
> > > >
> > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/
> > > hadoop-hdfs/hdfs-default.xml
> > > > - and saw there that there's no recheck parameter anymore. Strange,
> as
> > on
> > > > StackOverflow I found post from month ago, for newer version than we
> > use
> > > > (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat.
> > > recheck-interval)
> > > > I set it to 10 seconds as he mentioned, having checks every three
> > seconds
> > > > (default) and got DataNode elected as dead in ~45 seconds, as he
> > > mentioned.
> > > > Not sure why this parameter is not documented, but obviously it
> works.
> > > >
> > > > Then figured out it still didn't fix our HBase failover issue. I was
> > > > looking into HBase book again and again, and then saw this part:
> > > >
> > > > "How much time we allow elapse between calls to recover lease. Should
> > be
> > > > larger than the dfs timeout."
> > > >
> > > > This was description for hbase.lease.recovery.dfs.timeout. Wasn't
> sure
> > > from
> > > > the comment what of all timeouts that's possible to set in
> Hadoop/HBase
> > > and
> > > > that has something to do with DFS is this all about. But picked
> > > > hbase.lease.recovery.timeout, and that was the catch.
> > > >
> > > > Initially, by default, hbase.lease.recovery.timeout is set to 15
> > minutes.
> > > > Not sure why, and wasn't able to find yet why, but getting this down
> to
> > > one
> > > > minute (what's more than OK for us) I was able to get rid of our
> issue.
> > > Not
> > > > also sure why this is not mentioned in MTTR section in HBase book, as
> > > > obviously MTTR doesn't work at all with this default timeout, at
> least
> > it
> > > > doesn't work the way we expected it to work.
> > > >
> > > > So thanks again for everyone being spammed with this, and specially
> > > thanks
> > > > to Nicolas pointing me to the right direction.
> > > >
> > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <[email protected]>
> > > wrote:
> > > >
> > > > > the attachments are rejected by the mailing list, can you put then
> on
> > > > > pastebin?
> > > > >
> > > > > stale is mandatory (so it's good), but the issue here is just
> before.
> > > The
> > > > > region server needs to read the file. In order to be sure that
> there
> > is
> > > > no
> > > > > data loss, it needs to "recover the lease", that means ensuring
> that
> > > > nobody
> > > > > is writing the file. The regionserver calls the namenode to do this
> > > > > recoverLease. So there should be some info in the namenode logs.
> You
> > > have
> > > > > HDFS-4721 on your hdfs? The hbase book details (more or less...)
> this
> > > > > recoverLease stuff.
> > > > >
> > > > >
> > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > And also, just checked - dfs.namenode.avoid.read.stale.datanode
> and
> > > > > > dfs.namenode.avoid.write.stale.datanode
> > > > > > are both true, and dfs.namenode.stale.datanode.interval is set to
> > > > > default
> > > > > > 30000.
> > > > > >
> > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Nicolas,
> > > > > > >
> > > > > > > Please find log attached.
> > > > > > >
> > > > > > > As I see it now more clearly, it was trying to recover HDFS
> WALs
> > > from
> > > > > > node
> > > > > > > that's dead:
> > > > > > >
> > > > > > > 2015-03-23 08:53:44,381 WARN
> > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > Cannot recoverLease after trying for 900000ms
> > > > > > > (hbase.lease.recovery.timeout); continuing, but may be
> > > DATALOSS!!!;
> > > > > > > attempt=40 on
> > > > > > >
> > > > > > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
> > > > > ionally_get_down_by_getting_network_down},60020,1426862900
> > > > > 506-splitting/{node_i_intentionally_get_down_by_gett
> > > > > ing_network_down}%2C60020%2C1426862900506.1427096924508
> > > > > > > after 908210ms
> > > > > > >
> > > > > > > And as you can see from the log, it tried 40 times, what took
> it
> > > > > exactly
> > > > > > > 15 minutes.
> > > > > > >
> > > > > > > There's probably some parameter to tune to cut it of from 40
> > times
> > > /
> > > > 15
> > > > > > > minutes to something more useful, as for 15 minutes we don't
> have
> > > our
> > > > > > > regions available, and HDFS have however replication factor 3.
> > > > > > >
> > > > > > > Googling, if I figure out what's this I will post it here. Will
> > > also
> > > > > > > appreciate if someone knows how to cut this down.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Dejan
> > > > > > >
> > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> The split is done by the region servers (the master
> > coordinates).
> > > Is
> > > > > > there
> > > > > > >> some interesting stuff in their logs?
> > > > > > >>
> > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
> > > > [email protected]
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > With client issue was that it was retrying connecting to the
> > > same
> > > > > > region
> > > > > > >> > servers even when they were reassigned. Lowering it down
> > helped
> > > in
> > > > > > this
> > > > > > >> > specific use case, but yes, we still want servers to
> > reallocate
> > > > > > quickly.
> > > > > > >> >
> > > > > > >> > What got me here:
> > > > > > >> >
> > > > > > >> > http://hbase.apache.org/book.html#mttr
> > > > > > >> >
> > > > > > >> > I basically set configuration exactly the same way as it's
> > > > explained
> > > > > > >> here.
> > > > > > >> > *zookeeper.session.timeout* is (and was before) 60000 (one
> > > > minute).
> > > > > > >> >
> > > > > > >> > So basically what happens is: - simulating network issues we
> > had
> > > > > > >> recently).
> > > > > > >> > - After short time I see in HBase that my RegionServer is
> > dead,
> > > > and
> > > > > as
> > > > > > >> > total number of regions I see previous total minus number of
> > > > regions
> > > > > > >> that
> > > > > > >> > were hosted on the node hosting RegionServer that just
> > > > > 'disappeared'.
> > > > > > >> > - In this point I want my clus
> > > > > > >> >
> > > > > > >> > - I have test cluster consisting of four nodes, every node
> > being
> > > > > > >> DataNode
> > > > > > >> > and RegionServer.
> > > > > > >> > - I simulate network partition on one (connect to it through
> > > > console
> > > > > > and
> > > > > > >> > take network interface downter to recover as soon as
> possible,
> > > to
> > > > > > start
> > > > > > >> > serving missing regions.
> > > > > > >> > - First thing I see in HMaster logs are:
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:26,015 INFO
> > > > > > >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> > > > RegionServer
> > > > > > >> > ephemeral node deleted, processing expiration
> > > > > > >> > [{name_of_node_I_took_down},60020,1426860403261]
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:26,067 INFO
> > > > > > >> >
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > > > > Splitting
> > > > > > >> > logs for {name_of_node_I_took_down},60020,1426860403261
> before
> > > > > > >> assignment.
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:26,105 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: dead
> splitlog
> > > > > workers
> > > > > > [
> > > > > > >> > {name_of_node_I_took_down},60020,1426860403261]
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:26,107 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: started
> > > splitting
> > > > 1
> > > > > > >> logs in
> > > > > > >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_
> > > > > took_down}
> > > > > > >> > ,60020,1426860403261-splitting]
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:26,150 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > >> >
> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268
> > > > > 60404905
> > > > > > >> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:26,182 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
> =
> > 1
> > > > > > >> unassigned
> > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > >> >
> > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > in_progress
> > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> > > > error =
> > > > > > 0}
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:31,183 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
> =
> > 1
> > > > > > >> unassigned
> > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > >> >
> > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > in_progress
> > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> > > > error =
> > > > > > 0}
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:36,184 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
> =
> > 1
> > > > > > >> unassigned
> > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > >> >
> > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > in_progress
> > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> > > > error =
> > > > > > 0}
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:42,185 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
> =
> > 1
> > > > > > >> unassigned
> > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > >> >
> > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > in_progress
> > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> > > > error =
> > > > > > 0}
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:17:48,184 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
> =
> > 1
> > > > > > >> unassigned
> > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > >> >
> > > > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > > > in_progress
> > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> > > > error =
> > > > > > 0}
> > > > > > >> > In the meantime, In hbase...out log I got this:
> > > > > > >> >
> > > > > > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==
> > > > > > >> >
> > > > > > >> > java.io.IOException: Call to
> > > > > > >> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_dow
> > > > > n}:60020
> > > > > > >> > failed on local exception:
> > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException:
> > > Call
> > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > >> >
> > > > > > >> > at
> > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
> > > > > > >> t.java:1532)
> > > > > > >> >
> > > > > > >> > at
> org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:
> > > 1502)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc
> > > > > > >> Client.java:1684)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImpl
> > > > > ementati
> > > > > > >> on.callBlockingMethod(RpcClient.java:1737)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
> > > > > AdminService$
> > > > > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState
> > > > > > >> (HBaseAdmin.java:2524)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.generated.master.table_jsp._jspServi
> > > > > > >> ce(table_jsp.java:167)
> > > > > > >> >
> > > > > > >> > at
> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
> > > > > java:98)
> > > > > > >> >
> > > > > > >> > at
> > javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
> > > > > > >> .java:511)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > > > >> r(ServletHandler.java:1221)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFil
> > > > > > >> ter.doFilter(StaticUserWebFilter.java:109)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > > > >> r(ServletHandler.java:1212)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilte
> > > > > > >> r(HttpServer.java:1081)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > > > >> r(ServletHandler.java:1212)
> > > > > > >> >
> > > > > > >> > at
> > > > > >
> > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > > > >> r(ServletHandler.java:1212)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
> > > > > > >> er.java:399)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHa
> > > > > > >> ndler.java:216)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
> > > > > > >> er.java:182)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
> > > > > > >> er.java:766)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
> > > > > > >> java:450)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(Co
> > > > > > >> ntextHandlerCollection.java:230)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
> > > > > > >> er.java:152)
> > > > > > >> >
> > > > > > >> > at org.mortbay.jetty.Server.handle(Server.java:326)
> > > > > > >> >
> > > > > > >> > at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
> > > > > > >> n.java:542)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.headerComple
> > > > > > >> te(HttpConnection.java:928)
> > > > > > >> >
> > > > > > >> > at
> org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > > > >> >
> > > > > > >> > at
> > > > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > > > >> >
> > > > > > >> > at
> > > > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEn
> > > > > > >> dPoint.java:410)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run(
> > > > > > >> QueuedThreadPool.java:582)
> > > > > > >> >
> > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$
> > > > > CallTimeoutException:
> > > > > > >> Call
> > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCall
> > > > > > >> s(RpcClient.java:1234)
> > > > > > >> >
> > > > > > >> > at
> > > > > > >> >
> > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.readRespons
> > > > > > >> e(RpcClient.java:1171)
> > > > > > >> >
> > > > > > >> > at
> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
> > > > > > >> nt.java:751)
> > > > > > >> > Beside this same issue, please note that first message was
> at
> > > > > > 2015-03-20
> > > > > > >> > 14:17:26,015. And then (we got to the point when it started
> > > > > > transition):
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,059 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > >> >
> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268
> > > > > 60404905
> > > > > > >> > entered state: DONE
> > {fqdn_of_new_live_node},60020,1426859445623
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,109 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: Done
> splitting
> > > > > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > >> >
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > > > > > >> > %252C60020%252C1426860403261.1426860404905
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,190 INFO
> > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: finished
> > > splitting
> > > > > > >> (more
> > > > > > >> > than or equal to) 9 bytes in 1 log files in
> > > > > > >> >
> > > > > > >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_
> > > > > > >> node_I_took_down},60020,1426860403261-splitting]
> > > > > > >> > in 909083ms
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > > > >> .RegionStates:
> > > > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
> > > > > > >> > ts=1426860639088,
> > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > >> to
> > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE,
> > > ts=1426861955191,
> > > > > > >> server=
> > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > > > >> .RegionStates:
> > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> > > > > > >> {name_of_node_I_took_down}
> > > > > > >> > ,60020,1426860403261
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > > > >> .RegionStates:
> > > > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
> > > > > > >> > ts=1426860641783,
> > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > >> to
> > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE,
> > > ts=1426861955191,
> > > > > > >> server=
> > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > >> >
> > > > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > > > >> .RegionStates:
> > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> > > > > > >> {name_of_node_I_took_down}
> > > > > > >> > ,60020,1426860403261
> > > > > > >> > At this point, note that it finished SplitLogManager task at
> > > > > 14:32:35
> > > > > > >> and
> > > > > > >> > started transitioning just after that. So this is 15 minutes
> > > that
> > > > > I'm
> > > > > > >> > talking about.
> > > > > > >> >
> > > > > > >> > What am I missing?
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <
> > > > [email protected]>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > You've changed the value of hbase.zookeeper.timeout to 15
> > > > > minutes? A
> > > > > > >> very
> > > > > > >> > > reasonable target is 1 minute before relocating the
> regions.
> > > > > That's
> > > > > > >> the
> > > > > > >> > > default iirc. You can push it to 20s, but then
> > > > > gc-stopping-the-world
> > > > > > >> > > becomes more of an issue. 15 minutes is really a lot. The
> > hdfs
> > > > > stale
> > > > > > >> mode
> > > > > > >> > > must always be used, with a lower timeout than the hbase
> > one.
> > > > > > >> > >
> > > > > > >> > > Client side there should be nothing to do (excepted for
> > > advanced
> > > > > > >> stuff);
> > > > > > >> > at
> > > > > > >> > > each retry the client checks the location of the regions.
> If
> > > you
> > > > > > lower
> > > > > > >> > the
> > > > > > >> > > number of retry the client will fail sooner, but usually
> you
> > > > don't
> > > > > > >> want
> > > > > > >> > the
> > > > > > >> > > client to fail, you want the servers to reallocate
> quickly.
> > > > > > >> > >
> > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <
> > > > > > [email protected]
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi,
> > > > > > >> > > >
> > > > > > >> > > > Sorry for little bit late update, but managed to narrow
> it
> > > > > little
> > > > > > >> bit
> > > > > > >> > > down.
> > > > > > >> > > >
> > > > > > >> > > > We didn't update yet, as we are using Hortonworks
> > > distribution
> > > > > > right
> > > > > > >> > now,
> > > > > > >> > > > and even if we update we will get 0.98.4. However, looks
> > > that
> > > > > > issue
> > > > > > >> > here
> > > > > > >> > > > was in our use case and configuration (still looking
> into
> > > it).
> > > > > > >> > > >
> > > > > > >> > > > Basically, initially I saw that when one server goes
> down,
> > > we
> > > > > > start
> > > > > > >> > > having
> > > > > > >> > > > performance issues in general, but it managed to be on
> our
> > > > > client
> > > > > > >> side,
> > > > > > >> > > due
> > > > > > >> > > > to caching, and clients were trying to reconnect to
> nodes
> > > that
> > > > > > were
> > > > > > >> > > offline
> > > > > > >> > > > and later trying to get regions from those nodes too.
> This
> > > is
> > > > > > >> basically
> > > > > > >> > > why
> > > > > > >> > > > on server side I didn't manage to see anything in logs
> > that
> > > > > would
> > > > > > >> be at
> > > > > > >> > > > least little bit interesting and point me into desired
> > > > > direction.
> > > > > > >> > > >
> > > > > > >> > > > Another question that popped up to me is - in case
> server
> > is
> > > > > down
> > > > > > >> (and
> > > > > > >> > > with
> > > > > > >> > > > it DataNode and HRegionServer it was hosting) - what's
> > > optimal
> > > > > > time
> > > > > > >> to
> > > > > > >> > > set
> > > > > > >> > > > for HMaster to consider server dead reassign regions
> > > somewhere
> > > > > > >> else, as
> > > > > > >> > > > this is another performance bottleneck we hit during
> > > inability
> > > > > to
> > > > > > >> > access
> > > > > > >> > > > regions? In our case it's configured to 15 minutes, and
> > > simple
> > > > > > logic
> > > > > > >> > > tells
> > > > > > >> > > > me if you want it earlier then configure lower number of
> > > > > retries,
> > > > > > >> but
> > > > > > >> > > issue
> > > > > > >> > > > is as always in details, so not sure if anyone knows
> some
> > > > better
> > > > > > >> math
> > > > > > >> > for
> > > > > > >> > > > this?
> > > > > > >> > > >
> > > > > > >> > > > And last question - is it possible to manually force
> HBase
> > > to
> > > > > > >> reassign
> > > > > > >> > > > regions? In this case, while HMaster is retrying to
> > contact
> > > > node
> > > > > > >> that's
> > > > > > >> > > > dead, it's impossible to force it using 'balancer'
> > command.
> > > > > > >> > > >
> > > > > > >> > > > Thanks a lot!
> > > > > > >> > > >
> > > > > > >> > > > Dejan
> > > > > > >> > > >
> > > > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
> > > > > > >> [email protected]>
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi,
> > > > > > >> > > > >
> > > > > > >> > > > > To be very honest - there's no particular reason why
> we
> > > > stick
> > > > > to
> > > > > > >> this
> > > > > > >> > > > one,
> > > > > > >> > > > > beside just lack of time currently to go through
> upgrade
> > > > > > process,
> > > > > > >> but
> > > > > > >> > > > looks
> > > > > > >> > > > > to me that's going to be next step.
> > > > > > >> > > > >
> > > > > > >> > > > > Had a crazy day, didn't have time to go through all
> logs
> > > > > again,
> > > > > > >> plus
> > > > > > >> > > one
> > > > > > >> > > > > of the machines (last one where we had this issue) is
> > > fully
> > > > > > >> > > reprovisioned
> > > > > > >> > > > > yesterday so I don't have logs from there anymore.
> > > > > > >> > > > >
> > > > > > >> > > > > Beside upgrading,  what I will talk about today, can
> you
> > > > just
> > > > > > >> point
> > > > > > >> > me
> > > > > > >> > > to
> > > > > > >> > > > > the specific RPC issue in 0.98.0? Thing is that we
> have
> > > some
> > > > > > >> strange
> > > > > > >> > > > > moments with RPC in this case, and just want to see if
> > > > that's
> > > > > > the
> > > > > > >> > same
> > > > > > >> > > > > thing (and we were even suspecting to RPC).
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks a lot!
> > > > > > >> > > > > Dejan
> > > > > > >> > > > >
> > > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <
> > > > > > >> [email protected]
> > > > > > >> > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > >> Is there a particular reason why you are using HBase
> > > > 0.98.0?
> > > > > > The
> > > > > > >> > > latest
> > > > > > >> > > > >> 0.98 release is 0.98.11. There's a known performance
> > > issue
> > > > > with
> > > > > > >> > 0.98.0
> > > > > > >> > > > >> pertaining to RPC that was fixed in later releases,
> you
> > > > > should
> > > > > > >> move
> > > > > > >> > up
> > > > > > >> > > > >> from
> > > > > > >> > > > >> 0.98.0. In addition hundreds of improvements and bug
> > > fixes
> > > > > have
> > > > > > >> gone
> > > > > > >> > > > into
> > > > > > >> > > > >> the ten releases since 0.98.0.
> > > > > > >> > > > >>
> > > > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <
> > > > > > >> > [email protected]
> > > > > > >> > > >
> > > > > > >> > > > >> wrote:
> > > > > > >> > > > >>
> > > > > > >> > > > >> > Hi All,
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > We have a strange issue with HBase performance
> > (overall
> > > > > > cluster
> > > > > > >> > > > >> > performance) in case one of datanodes in the
> cluster
> > > > > > >> unexpectedly
> > > > > > >> > > goes
> > > > > > >> > > > >> > down.
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > So scenario is like follows:
> > > > > > >> > > > >> > - Cluster works fine, it's stable.
> > > > > > >> > > > >> > - One DataNode unexpectedly goes down (PSU issue,
> > > network
> > > > > > >> issue,
> > > > > > >> > > > >> anything)
> > > > > > >> > > > >> > - Whole HBase cluster goes down (performance
> becomes
> > so
> > > > bad
> > > > > > >> that
> > > > > > >> > we
> > > > > > >> > > > >> have to
> > > > > > >> > > > >> > restart all RegionServers to get it back to life).
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > Most funny and latest issue that happened was that
> we
> > > > added
> > > > > > new
> > > > > > >> > node
> > > > > > >> > > > to
> > > > > > >> > > > >> the
> > > > > > >> > > > >> > cluster (having 8 x 4T SATA disks) and we left just
> > > > > DataNode
> > > > > > >> > running
> > > > > > >> > > > on
> > > > > > >> > > > >> it
> > > > > > >> > > > >> > to give it couple of days to get some data. At some
> > > point
> > > > > in
> > > > > > >> time,
> > > > > > >> > > due
> > > > > > >> > > > >> to
> > > > > > >> > > > >> > hardware issue, server rebooted (twice during three
> > > > hours)
> > > > > in
> > > > > > >> > moment
> > > > > > >> > > > >> when
> > > > > > >> > > > >> > it had maybe 5% of data it would have in a couple
> of
> > > > days.
> > > > > > >> Nothing
> > > > > > >> > > > else
> > > > > > >> > > > >> > beside DataNode was running, and once it went down,
> > it
> > > > > > affected
> > > > > > >> > > > literary
> > > > > > >> > > > >> > everything, and restarting RegionServers in the end
> > > fixed
> > > > > it.
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> > > > > > >> > > > >> >
> > > > > > >> > > > >>
> > > > > > >> > > > >>
> > > > > > >> > > > >>
> > > > > > >> > > > >> --
> > > > > > >> > > > >> Best regards,
> > > > > > >> > > > >>
> > > > > > >> > > > >>    - Andy
> > > > > > >> > > > >>
> > > > > > >> > > > >> Problems worthy of attack prove their worth by
> hitting
> > > > back.
> > > > > -
> > > > > > >> Piet
> > > > > > >> > > Hein
> > > > > > >> > > > >> (via Tom White)
> > > > > > >> > > > >>
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Strange issue when DataNode goes down

Reply via email to