Re: Strange issue when DataNode goes down

Dejan Menges Mon, 23 Mar 2015 02:34:37 -0700

And also, just checked - dfs.namenode.avoid.read.stale.datanode and
dfs.namenode.avoid.write.stale.datanode
are both true, and dfs.namenode.stale.datanode.interval is set to default
30000.


On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <[email protected]>
wrote:

> Hi Nicolas,
>
> Please find log attached.
>
> As I see it now more clearly, it was trying to recover HDFS WALs from node
> that's dead:
>
> 2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils:
> Cannot recoverLease after trying for 900000ms
> (hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!;
> attempt=40 on
> file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intentionally_get_down_by_getting_network_down},60020,1426862900506-splitting/{node_i_intentionally_get_down_by_getting_network_down}%2C60020%2C1426862900506.1427096924508
> after 908210ms
>
> And as you can see from the log, it tried 40 times, what took it exactly
> 15 minutes.
>
> There's probably some parameter to tune to cut it of from 40 times / 15
> minutes to something more useful, as for 15 minutes we don't have our
> regions available, and HDFS have however replication factor 3.
>
> Googling, if I figure out what's this I will post it here. Will also
> appreciate if someone knows how to cut this down.
>
> Thanks,
>
> Dejan
>
> On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <[email protected]> wrote:
>
>> The split is done by the region servers (the master coordinates). Is there
>> some interesting stuff in their logs?
>>
>> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <[email protected]>
>> wrote:
>>
>> > With client issue was that it was retrying connecting to the same region
>> > servers even when they were reassigned. Lowering it down helped in this
>> > specific use case, but yes, we still want servers to reallocate quickly.
>> >
>> > What got me here:
>> >
>> > http://hbase.apache.org/book.html#mttr
>> >
>> > I basically set configuration exactly the same way as it's explained
>> here.
>> > *zookeeper.session.timeout* is (and was before) 60000 (one minute).
>> >
>> > So basically what happens is: - simulating network issues we had
>> recently).
>> > - After short time I see in HBase that my RegionServer is dead, and as
>> > total number of regions I see previous total minus number of regions
>> that
>> > were hosted on the node hosting RegionServer that just 'disappeared'.
>> > - In this point I want my clus
>> >
>> > - I have test cluster consisting of four nodes, every node being
>> DataNode
>> > and RegionServer.
>> > - I simulate network partition on one (connect to it through console and
>> > take network interface downter to recover as soon as possible, to start
>> > serving missing regions.
>> > - First thing I see in HMaster logs are:
>> >
>> > 2015-03-20 14:17:26,015 INFO
>> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
>> > ephemeral node deleted, processing expiration
>> > [{name_of_node_I_took_down},60020,1426860403261]
>> >
>> > 2015-03-20 14:17:26,067 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
>> > logs for {name_of_node_I_took_down},60020,1426860403261 before
>> assignment.
>> >
>> > 2015-03-20 14:17:26,105 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [
>> > {name_of_node_I_took_down},60020,1426860403261]
>> >
>> > 2015-03-20 14:17:26,107 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1
>> logs in
>> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_took_down}
>> > ,60020,1426860403261-splitting]
>> >
>> > 2015-03-20 14:17:26,150 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: task
>> > /hbase/splitWAL/WALs%2F
>> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
>> > {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
>> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623
>> >
>> > 2015-03-20 14:17:26,182 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
>> unassigned
>> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>> >
>> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
>> down}%252C60020%252C1426860403261.1426860404905=last_update
>> > = 1426861046182 last_version = 2 cur_worker_name =
>> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
>> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>> >
>> > 2015-03-20 14:17:31,183 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
>> unassigned
>> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>> >
>> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
>> down}%252C60020%252C1426860403261.1426860404905=last_update
>> > = 1426861046182 last_version = 2 cur_worker_name =
>> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
>> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>> >
>> > 2015-03-20 14:17:36,184 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
>> unassigned
>> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>> >
>> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
>> down}%252C60020%252C1426860403261.1426860404905=last_update
>> > = 1426861046182 last_version = 2 cur_worker_name =
>> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
>> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>> >
>> > 2015-03-20 14:17:42,185 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
>> unassigned
>> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>> >
>> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
>> down}%252C60020%252C1426860403261.1426860404905=last_update
>> > = 1426861046182 last_version = 2 cur_worker_name =
>> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
>> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>> >
>> > 2015-03-20 14:17:48,184 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
>> unassigned
>> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>> >
>> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
>> down}%252C60020%252C1426860403261.1426860404905=last_update
>> > = 1426861046182 last_version = 2 cur_worker_name =
>> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
>> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>> > In the meantime, In hbase...out log I got this:
>> >
>> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==
>> >
>> > java.io.IOException: Call to
>> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_down}:60020
>> > failed on local exception:
>> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
>> > id=93152, waitTime=60044, rpcTimeout=60000
>> >
>> > at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
>> t.java:1532)
>> >
>> > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502)
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc
>> Client.java:1684)
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementati
>> on.callBlockingMethod(RpcClient.java:1737)
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$
>> BlockingStub.getRegionInfo(AdminProtos.java:20806)
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState
>> (HBaseAdmin.java:2524)
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.generated.master.table_jsp._jspServi
>> ce(table_jsp.java:167)
>> >
>> > at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
>> >
>> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
>> >
>> > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
>> .java:511)
>> >
>> > at
>> >
>> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
>> r(ServletHandler.java:1221)
>> >
>> > at
>> >
>> > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFil
>> ter.doFilter(StaticUserWebFilter.java:109)
>> >
>> > at
>> >
>> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
>> r(ServletHandler.java:1212)
>> >
>> > at
>> >
>> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilte
>> r(HttpServer.java:1081)
>> >
>> > at
>> >
>> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
>> r(ServletHandler.java:1212)
>> >
>> > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>> >
>> > at
>> >
>> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
>> r(ServletHandler.java:1212)
>> >
>> > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
>> er.java:399)
>> >
>> > at
>> > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHa
>> ndler.java:216)
>> >
>> > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
>> er.java:182)
>> >
>> > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
>> er.java:766)
>> >
>> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
>> java:450)
>> >
>> > at
>> >
>> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(Co
>> ntextHandlerCollection.java:230)
>> >
>> > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
>> er.java:152)
>> >
>> > at org.mortbay.jetty.Server.handle(Server.java:326)
>> >
>> > at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
>> n.java:542)
>> >
>> > at
>> >
>> > org.mortbay.jetty.HttpConnection$RequestHandler.headerComple
>> te(HttpConnection.java:928)
>> >
>> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>> >
>> > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>> >
>> > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>> >
>> > at
>> >
>> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEn
>> dPoint.java:410)
>> >
>> > at
>> >
>> > org.mortbay.thread.QueuedThreadPool$PoolThread.run(
>> QueuedThreadPool.java:582)
>> >
>> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException:
>> Call
>> > id=93152, waitTime=60044, rpcTimeout=60000
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCall
>> s(RpcClient.java:1234)
>> >
>> > at
>> >
>> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.readRespons
>> e(RpcClient.java:1171)
>> >
>> > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
>> nt.java:751)
>> > Beside this same issue, please note that first message was at 2015-03-20
>> > 14:17:26,015. And then (we got to the point when it started transition):
>> >
>> > 2015-03-20 14:32:35,059 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: task
>> > /hbase/splitWAL/WALs%2F
>> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
>> > {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
>> > entered state: DONE {fqdn_of_new_live_node},60020,1426859445623
>> >
>> > 2015-03-20 14:32:35,109 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: Done splitting
>> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
>> > %252C60020%252C1426860403261.1426860404905
>> >
>> > 2015-03-20 14:32:35,190 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting
>> (more
>> > than or equal to) 9 bytes in 1 log files in
>> >
>> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_
>> node_I_took_down},60020,1426860403261-splitting]
>> > in 909083ms
>> >
>> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
>> .RegionStates:
>> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
>> > ts=1426860639088, server={name_of_node_I_took_down},60020,1426860403261}
>> to
>> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191,
>> server=
>> > {name_of_node_I_took_down},60020,1426860403261}
>> >
>> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
>> .RegionStates:
>> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
>> {name_of_node_I_took_down}
>> > ,60020,1426860403261
>> >
>> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
>> .RegionStates:
>> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
>> > ts=1426860641783, server={name_of_node_I_took_down},60020,1426860403261}
>> to
>> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191,
>> server=
>> > {name_of_node_I_took_down},60020,1426860403261}
>> >
>> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
>> .RegionStates:
>> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
>> {name_of_node_I_took_down}
>> > ,60020,1426860403261
>> > At this point, note that it finished SplitLogManager task at 14:32:35
>> and
>> > started transitioning just after that. So this is 15 minutes that I'm
>> > talking about.
>> >
>> > What am I missing?
>> >
>> >
>> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <[email protected]>
>> wrote:
>> >
>> > > You've changed the value of hbase.zookeeper.timeout to 15 minutes? A
>> very
>> > > reasonable target is 1 minute before relocating the regions. That's
>> the
>> > > default iirc. You can push it to 20s, but then gc-stopping-the-world
>> > > becomes more of an issue. 15 minutes is really a lot. The hdfs stale
>> mode
>> > > must always be used, with a lower timeout than the hbase one.
>> > >
>> > > Client side there should be nothing to do (excepted for advanced
>> stuff);
>> > at
>> > > each retry the client checks the location of the regions. If you lower
>> > the
>> > > number of retry the client will fail sooner, but usually you don't
>> want
>> > the
>> > > client to fail, you want the servers to reallocate quickly.
>> > >
>> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <[email protected]
>> >
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > Sorry for little bit late update, but managed to narrow it little
>> bit
>> > > down.
>> > > >
>> > > > We didn't update yet, as we are using Hortonworks distribution right
>> > now,
>> > > > and even if we update we will get 0.98.4. However, looks that issue
>> > here
>> > > > was in our use case and configuration (still looking into it).
>> > > >
>> > > > Basically, initially I saw that when one server goes down, we start
>> > > having
>> > > > performance issues in general, but it managed to be on our client
>> side,
>> > > due
>> > > > to caching, and clients were trying to reconnect to nodes that were
>> > > offline
>> > > > and later trying to get regions from those nodes too. This is
>> basically
>> > > why
>> > > > on server side I didn't manage to see anything in logs that would
>> be at
>> > > > least little bit interesting and point me into desired direction.
>> > > >
>> > > > Another question that popped up to me is - in case server is down
>> (and
>> > > with
>> > > > it DataNode and HRegionServer it was hosting) - what's optimal time
>> to
>> > > set
>> > > > for HMaster to consider server dead reassign regions somewhere
>> else, as
>> > > > this is another performance bottleneck we hit during inability to
>> > access
>> > > > regions? In our case it's configured to 15 minutes, and simple logic
>> > > tells
>> > > > me if you want it earlier then configure lower number of retries,
>> but
>> > > issue
>> > > > is as always in details, so not sure if anyone knows some better
>> math
>> > for
>> > > > this?
>> > > >
>> > > > And last question - is it possible to manually force HBase to
>> reassign
>> > > > regions? In this case, while HMaster is retrying to contact node
>> that's
>> > > > dead, it's impossible to force it using 'balancer' command.
>> > > >
>> > > > Thanks a lot!
>> > > >
>> > > > Dejan
>> > > >
>> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > To be very honest - there's no particular reason why we stick to
>> this
>> > > > one,
>> > > > > beside just lack of time currently to go through upgrade process,
>> but
>> > > > looks
>> > > > > to me that's going to be next step.
>> > > > >
>> > > > > Had a crazy day, didn't have time to go through all logs again,
>> plus
>> > > one
>> > > > > of the machines (last one where we had this issue) is fully
>> > > reprovisioned
>> > > > > yesterday so I don't have logs from there anymore.
>> > > > >
>> > > > > Beside upgrading,  what I will talk about today, can you just
>> point
>> > me
>> > > to
>> > > > > the specific RPC issue in 0.98.0? Thing is that we have some
>> strange
>> > > > > moments with RPC in this case, and just want to see if that's the
>> > same
>> > > > > thing (and we were even suspecting to RPC).
>> > > > >
>> > > > > Thanks a lot!
>> > > > > Dejan
>> > > > >
>> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > >
>> > > > >> Is there a particular reason why you are using HBase 0.98.0? The
>> > > latest
>> > > > >> 0.98 release is 0.98.11. There's a known performance issue with
>> > 0.98.0
>> > > > >> pertaining to RPC that was fixed in later releases, you should
>> move
>> > up
>> > > > >> from
>> > > > >> 0.98.0. In addition hundreds of improvements and bug fixes have
>> gone
>> > > > into
>> > > > >> the ten releases since 0.98.0.
>> > > > >>
>> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <
>> > [email protected]
>> > > >
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Hi All,
>> > > > >> >
>> > > > >> > We have a strange issue with HBase performance (overall cluster
>> > > > >> > performance) in case one of datanodes in the cluster
>> unexpectedly
>> > > goes
>> > > > >> > down.
>> > > > >> >
>> > > > >> > So scenario is like follows:
>> > > > >> > - Cluster works fine, it's stable.
>> > > > >> > - One DataNode unexpectedly goes down (PSU issue, network
>> issue,
>> > > > >> anything)
>> > > > >> > - Whole HBase cluster goes down (performance becomes so bad
>> that
>> > we
>> > > > >> have to
>> > > > >> > restart all RegionServers to get it back to life).
>> > > > >> >
>> > > > >> > Most funny and latest issue that happened was that we added new
>> > node
>> > > > to
>> > > > >> the
>> > > > >> > cluster (having 8 x 4T SATA disks) and we left just DataNode
>> > running
>> > > > on
>> > > > >> it
>> > > > >> > to give it couple of days to get some data. At some point in
>> time,
>> > > due
>> > > > >> to
>> > > > >> > hardware issue, server rebooted (twice during three hours) in
>> > moment
>> > > > >> when
>> > > > >> > it had maybe 5% of data it would have in a couple of days.
>> Nothing
>> > > > else
>> > > > >> > beside DataNode was running, and once it went down, it affected
>> > > > literary
>> > > > >> > everything, and restarting RegionServers in the end fixed it.
>> > > > >> >
>> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
>> > > > >> >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Best regards,
>> > > > >>
>> > > > >>    - Andy
>> > > > >>
>> > > > >> Problems worthy of attack prove their worth by hitting back. -
>> Piet
>> > > Hein
>> > > > >> (via Tom White)
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Strange issue when DataNode goes down

Reply via email to