And also, just checked - dfs.namenode.avoid.read.stale.datanode and dfs.namenode.avoid.write.stale.datanode are both true, and dfs.namenode.stale.datanode.interval is set to default 30000.
On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <[email protected]> wrote: > Hi Nicolas, > > Please find log attached. > > As I see it now more clearly, it was trying to recover HDFS WALs from node > that's dead: > > 2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils: > Cannot recoverLease after trying for 900000ms > (hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!; > attempt=40 on > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intentionally_get_down_by_getting_network_down},60020,1426862900506-splitting/{node_i_intentionally_get_down_by_getting_network_down}%2C60020%2C1426862900506.1427096924508 > after 908210ms > > And as you can see from the log, it tried 40 times, what took it exactly > 15 minutes. > > There's probably some parameter to tune to cut it of from 40 times / 15 > minutes to something more useful, as for 15 minutes we don't have our > regions available, and HDFS have however replication factor 3. > > Googling, if I figure out what's this I will post it here. Will also > appreciate if someone knows how to cut this down. > > Thanks, > > Dejan > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <[email protected]> wrote: > >> The split is done by the region servers (the master coordinates). Is there >> some interesting stuff in their logs? >> >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <[email protected]> >> wrote: >> >> > With client issue was that it was retrying connecting to the same region >> > servers even when they were reassigned. Lowering it down helped in this >> > specific use case, but yes, we still want servers to reallocate quickly. >> > >> > What got me here: >> > >> > http://hbase.apache.org/book.html#mttr >> > >> > I basically set configuration exactly the same way as it's explained >> here. >> > *zookeeper.session.timeout* is (and was before) 60000 (one minute). >> > >> > So basically what happens is: - simulating network issues we had >> recently). >> > - After short time I see in HBase that my RegionServer is dead, and as >> > total number of regions I see previous total minus number of regions >> that >> > were hosted on the node hosting RegionServer that just 'disappeared'. >> > - In this point I want my clus >> > >> > - I have test cluster consisting of four nodes, every node being >> DataNode >> > and RegionServer. >> > - I simulate network partition on one (connect to it through console and >> > take network interface downter to recover as soon as possible, to start >> > serving missing regions. >> > - First thing I see in HMaster logs are: >> > >> > 2015-03-20 14:17:26,015 INFO >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer >> > ephemeral node deleted, processing expiration >> > [{name_of_node_I_took_down},60020,1426860403261] >> > >> > 2015-03-20 14:17:26,067 INFO >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting >> > logs for {name_of_node_I_took_down},60020,1426860403261 before >> assignment. >> > >> > 2015-03-20 14:17:26,105 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [ >> > {name_of_node_I_took_down},60020,1426860403261] >> > >> > 2015-03-20 14:17:26,107 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1 >> logs in >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_took_down} >> > ,60020,1426860403261-splitting] >> > >> > 2015-03-20 14:17:26,150 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: task >> > /hbase/splitWAL/WALs%2F >> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905 >> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623 >> > >> > 2015-03-20 14:17:26,182 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 >> unassigned >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} >> > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ >> down}%252C60020%252C1426860403261.1426860404905=last_update >> > = 1426861046182 last_version = 2 cur_worker_name = >> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} >> > >> > 2015-03-20 14:17:31,183 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 >> unassigned >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} >> > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ >> down}%252C60020%252C1426860403261.1426860404905=last_update >> > = 1426861046182 last_version = 2 cur_worker_name = >> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} >> > >> > 2015-03-20 14:17:36,184 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 >> unassigned >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} >> > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ >> down}%252C60020%252C1426860403261.1426860404905=last_update >> > = 1426861046182 last_version = 2 cur_worker_name = >> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} >> > >> > 2015-03-20 14:17:42,185 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 >> unassigned >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} >> > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ >> down}%252C60020%252C1426860403261.1426860404905=last_update >> > = 1426861046182 last_version = 2 cur_worker_name = >> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} >> > >> > 2015-03-20 14:17:48,184 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 >> unassigned >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} >> > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_ >> down}%252C60020%252C1426860403261.1426860404905=last_update >> > = 1426861046182 last_version = 2 cur_worker_name = >> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} >> > In the meantime, In hbase...out log I got this: >> > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <== >> > >> > java.io.IOException: Call to >> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_down}:60020 >> > failed on local exception: >> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call >> > id=93152, waitTime=60044, rpcTimeout=60000 >> > >> > at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien >> t.java:1532) >> > >> > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502) >> > >> > at >> > >> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc >> Client.java:1684) >> > >> > at >> > >> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementati >> on.callBlockingMethod(RpcClient.java:1737) >> > >> > at >> > >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$ >> BlockingStub.getRegionInfo(AdminProtos.java:20806) >> > >> > at >> > >> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState >> (HBaseAdmin.java:2524) >> > >> > at >> > >> > org.apache.hadoop.hbase.generated.master.table_jsp._jspServi >> ce(table_jsp.java:167) >> > >> > at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) >> > >> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:770) >> > >> > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder >> .java:511) >> > >> > at >> > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte >> r(ServletHandler.java:1221) >> > >> > at >> > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFil >> ter.doFilter(StaticUserWebFilter.java:109) >> > >> > at >> > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte >> r(ServletHandler.java:1212) >> > >> > at >> > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilte >> r(HttpServer.java:1081) >> > >> > at >> > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte >> r(ServletHandler.java:1212) >> > >> > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) >> > >> > at >> > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte >> r(ServletHandler.java:1212) >> > >> > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl >> er.java:399) >> > >> > at >> > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHa >> ndler.java:216) >> > >> > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl >> er.java:182) >> > >> > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl >> er.java:766) >> > >> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext. >> java:450) >> > >> > at >> > >> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(Co >> ntextHandlerCollection.java:230) >> > >> > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp >> er.java:152) >> > >> > at org.mortbay.jetty.Server.handle(Server.java:326) >> > >> > at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio >> n.java:542) >> > >> > at >> > >> > org.mortbay.jetty.HttpConnection$RequestHandler.headerComple >> te(HttpConnection.java:928) >> > >> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) >> > >> > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) >> > >> > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) >> > >> > at >> > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEn >> dPoint.java:410) >> > >> > at >> > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run( >> QueuedThreadPool.java:582) >> > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: >> Call >> > id=93152, waitTime=60044, rpcTimeout=60000 >> > >> > at >> > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCall >> s(RpcClient.java:1234) >> > >> > at >> > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.readRespons >> e(RpcClient.java:1171) >> > >> > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie >> nt.java:751) >> > Beside this same issue, please note that first message was at 2015-03-20 >> > 14:17:26,015. And then (we got to the point when it started transition): >> > >> > 2015-03-20 14:32:35,059 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: task >> > /hbase/splitWAL/WALs%2F >> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905 >> > entered state: DONE {fqdn_of_new_live_node},60020,1426859445623 >> > >> > 2015-03-20 14:32:35,109 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: Done splitting >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down} >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down} >> > %252C60020%252C1426860403261.1426860404905 >> > >> > 2015-03-20 14:32:35,190 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting >> (more >> > than or equal to) 9 bytes in 1 log files in >> > >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_ >> node_I_took_down},60020,1426860403261-splitting] >> > in 909083ms >> > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master >> .RegionStates: >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN, >> > ts=1426860639088, server={name_of_node_I_took_down},60020,1426860403261} >> to >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191, >> server= >> > {name_of_node_I_took_down},60020,1426860403261} >> > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master >> .RegionStates: >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from >> {name_of_node_I_took_down} >> > ,60020,1426860403261 >> > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master >> .RegionStates: >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN, >> > ts=1426860641783, server={name_of_node_I_took_down},60020,1426860403261} >> to >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191, >> server= >> > {name_of_node_I_took_down},60020,1426860403261} >> > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master >> .RegionStates: >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from >> {name_of_node_I_took_down} >> > ,60020,1426860403261 >> > At this point, note that it finished SplitLogManager task at 14:32:35 >> and >> > started transitioning just after that. So this is 15 minutes that I'm >> > talking about. >> > >> > What am I missing? >> > >> > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <[email protected]> >> wrote: >> > >> > > You've changed the value of hbase.zookeeper.timeout to 15 minutes? A >> very >> > > reasonable target is 1 minute before relocating the regions. That's >> the >> > > default iirc. You can push it to 20s, but then gc-stopping-the-world >> > > becomes more of an issue. 15 minutes is really a lot. The hdfs stale >> mode >> > > must always be used, with a lower timeout than the hbase one. >> > > >> > > Client side there should be nothing to do (excepted for advanced >> stuff); >> > at >> > > each retry the client checks the location of the regions. If you lower >> > the >> > > number of retry the client will fail sooner, but usually you don't >> want >> > the >> > > client to fail, you want the servers to reallocate quickly. >> > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <[email protected] >> > >> > > wrote: >> > > >> > > > Hi, >> > > > >> > > > Sorry for little bit late update, but managed to narrow it little >> bit >> > > down. >> > > > >> > > > We didn't update yet, as we are using Hortonworks distribution right >> > now, >> > > > and even if we update we will get 0.98.4. However, looks that issue >> > here >> > > > was in our use case and configuration (still looking into it). >> > > > >> > > > Basically, initially I saw that when one server goes down, we start >> > > having >> > > > performance issues in general, but it managed to be on our client >> side, >> > > due >> > > > to caching, and clients were trying to reconnect to nodes that were >> > > offline >> > > > and later trying to get regions from those nodes too. This is >> basically >> > > why >> > > > on server side I didn't manage to see anything in logs that would >> be at >> > > > least little bit interesting and point me into desired direction. >> > > > >> > > > Another question that popped up to me is - in case server is down >> (and >> > > with >> > > > it DataNode and HRegionServer it was hosting) - what's optimal time >> to >> > > set >> > > > for HMaster to consider server dead reassign regions somewhere >> else, as >> > > > this is another performance bottleneck we hit during inability to >> > access >> > > > regions? In our case it's configured to 15 minutes, and simple logic >> > > tells >> > > > me if you want it earlier then configure lower number of retries, >> but >> > > issue >> > > > is as always in details, so not sure if anyone knows some better >> math >> > for >> > > > this? >> > > > >> > > > And last question - is it possible to manually force HBase to >> reassign >> > > > regions? In this case, while HMaster is retrying to contact node >> that's >> > > > dead, it's impossible to force it using 'balancer' command. >> > > > >> > > > Thanks a lot! >> > > > >> > > > Dejan >> > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges < >> [email protected]> >> > > > wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > To be very honest - there's no particular reason why we stick to >> this >> > > > one, >> > > > > beside just lack of time currently to go through upgrade process, >> but >> > > > looks >> > > > > to me that's going to be next step. >> > > > > >> > > > > Had a crazy day, didn't have time to go through all logs again, >> plus >> > > one >> > > > > of the machines (last one where we had this issue) is fully >> > > reprovisioned >> > > > > yesterday so I don't have logs from there anymore. >> > > > > >> > > > > Beside upgrading, what I will talk about today, can you just >> point >> > me >> > > to >> > > > > the specific RPC issue in 0.98.0? Thing is that we have some >> strange >> > > > > moments with RPC in this case, and just want to see if that's the >> > same >> > > > > thing (and we were even suspecting to RPC). >> > > > > >> > > > > Thanks a lot! >> > > > > Dejan >> > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell < >> [email protected] >> > > >> > > > > wrote: >> > > > > >> > > > >> Is there a particular reason why you are using HBase 0.98.0? The >> > > latest >> > > > >> 0.98 release is 0.98.11. There's a known performance issue with >> > 0.98.0 >> > > > >> pertaining to RPC that was fixed in later releases, you should >> move >> > up >> > > > >> from >> > > > >> 0.98.0. In addition hundreds of improvements and bug fixes have >> gone >> > > > into >> > > > >> the ten releases since 0.98.0. >> > > > >> >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges < >> > [email protected] >> > > > >> > > > >> wrote: >> > > > >> >> > > > >> > Hi All, >> > > > >> > >> > > > >> > We have a strange issue with HBase performance (overall cluster >> > > > >> > performance) in case one of datanodes in the cluster >> unexpectedly >> > > goes >> > > > >> > down. >> > > > >> > >> > > > >> > So scenario is like follows: >> > > > >> > - Cluster works fine, it's stable. >> > > > >> > - One DataNode unexpectedly goes down (PSU issue, network >> issue, >> > > > >> anything) >> > > > >> > - Whole HBase cluster goes down (performance becomes so bad >> that >> > we >> > > > >> have to >> > > > >> > restart all RegionServers to get it back to life). >> > > > >> > >> > > > >> > Most funny and latest issue that happened was that we added new >> > node >> > > > to >> > > > >> the >> > > > >> > cluster (having 8 x 4T SATA disks) and we left just DataNode >> > running >> > > > on >> > > > >> it >> > > > >> > to give it couple of days to get some data. At some point in >> time, >> > > due >> > > > >> to >> > > > >> > hardware issue, server rebooted (twice during three hours) in >> > moment >> > > > >> when >> > > > >> > it had maybe 5% of data it would have in a couple of days. >> Nothing >> > > > else >> > > > >> > beside DataNode was running, and once it went down, it affected >> > > > literary >> > > > >> > everything, and restarting RegionServers in the end fixed it. >> > > > >> > >> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0 >> > > > >> > >> > > > >> >> > > > >> >> > > > >> >> > > > >> -- >> > > > >> Best regards, >> > > > >> >> > > > >> - Andy >> > > > >> >> > > > >> Problems worthy of attack prove their worth by hitting back. - >> Piet >> > > Hein >> > > > >> (via Tom White) >> > > > >> >> > > > > >> > > > > >> > > > >> > > >> > >> >
