We have determined that this issue is being caused by our HDFS (running on an Isilon cluster). For some reason, its HDFS daemon seems to get overwhelmed and just locks up. When we tell it to use more threads to process HDFS connections, everything unblocks. Aside from the fact that this behavior is apparently buggy on their part, I would like to know a little more about how HBase interacts with the HDFS cluster so we can make sure we're appropriately configuring out stuff.
How many connections will a single regionserver make to HDFS? Is it one shared connection for the whole RS, one connection for each region served, multiple connections per region, or some other formula? --Tom On Wed, Jun 25, 2014 at 1:48 PM, Ted Yu <[email protected]> wrote: > Can you look at the tail of master log to see which WAL takes long time to > split ? > > Checking Namenode log if needed. > > Cheers > > On Wed, Jun 25, 2014 at 12:09 PM, Tom Brown <[email protected]> wrote: > > > Yes, that stack is still there: > > > > java.lang.Thread.State: TIMED_WAITING (on object monitor) > > at java.lang.Object.wait(Native Method) > > at > > > > > org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:372) > > - locked <0x00000000bfa0a068> (a > > org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch) > > at > > > > > org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) > > at > > > > > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371) > > at > > > > > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:337) > > > > What does that mean? That HDFS is behaving badly, or something else > > entirely? > > > > --Tom > > > > > > > > On Wed, Jun 25, 2014 at 11:45 AM, Ted Yu <[email protected]> wrote: > > > > > Looks like master was stuck in FileSystem.listStatus() call. > > > I noticed the following - did this show up if you take jstack one more > > time > > > ? > > > > > > 1. at > > > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:372) > > > 2. - locked <0x00000000bfa0a068> (a > > > org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch) > > > 3. at > > > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) > > > 4. at > > > > > > > > > > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371) > > > 5. at > > > > > > > > > > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:337) > > > 6. at > > > > > > > > > > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:282) > > > 7. at > > > > > > > > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:127) > > > > > > > > > > > > On Wed, Jun 25, 2014 at 10:09 AM, Tom Brown <[email protected]> > > wrote: > > > > > > > Before I was able to acquire a stack trace, I restarted the master. > > > > However, the issue has just happened again and I was able to get a > > stack > > > > trace: > > > > > > > > http://pastebin.com/Mz5c6AML > > > > > > > > (The pastebin is set to never expire, so anyone viewing an archived > > > version > > > > of this message should still be able to see the stack) > > > > > > > > The version of hbase is 0.94.10. > > > > > > > > Thanks! > > > > > > > > --Tom > > > > > > > > > > > > On Wed, Jun 18, 2014 at 8:55 PM, Qiang Tian <[email protected]> > wrote: > > > > > > > > > Hi Tom, > > > > > Can you collect your master jvm stacktrace when problem happens and > > put > > > > it > > > > > to pastbin? > > > > > what is your hbase version? > > > > > > > > > > > > > > > On Thu, Jun 19, 2014 at 1:34 AM, Tom Brown <[email protected]> > > > wrote: > > > > > > > > > > > Could this happen if the master is running too many RPC tasks and > > > can't > > > > > > keep up? What about if there's too many connections to the > server? > > > > > > > > > > > > --Tom > > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2014 at 11:33 AM, Tom Brown < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > That server is the master and is not a regionserver. > > > > > > > > > > > > > > --Tom > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2014 at 11:29 AM, Ted Yu <[email protected]> > > > > wrote: > > > > > > > > > > > > > >> Have you checked region server log on 10.100.101.221 > > > > > > >> <http://hdpmgr001.pse.movenetworks.com/10.100.101.221:60000> > ? > > > > > > >> > > > > > > >> Cheers > > > > > > >> > > > > > > >> > > > > > > >> On Wed, Jun 18, 2014 at 10:19 AM, Tom Brown < > > [email protected] > > > > > > > > > > wrote: > > > > > > >> > > > > > > >> > Hello all, > > > > > > >> > > > > > > > >> > I'm trying to view the master status of a 6 node (0.94.10; > > > hadoop > > > > > > 1.1.2) > > > > > > >> > cluster but I keep getting a timeout exception. > > > > > > >> > > > > > > > >> > The rest of the cluster is operating quite normally. From > the > > > > > > >> exception, it > > > > > > >> > seems like the "list tables" function (required to display > the > > > web > > > > > UI) > > > > > > >> is > > > > > > >> > timing out for some reason. > > > > > > >> > > > > > > > >> > From the shell, I'm able to scan the entire .META. table, so > > the > > > > > table > > > > > > >> > information is conceivably available. I don't understand the > > > rest > > > > of > > > > > > the > > > > > > >> > architecture well enough to know what might be causing this > > > > timeout > > > > > > >> during > > > > > > >> > "list". > > > > > > >> > > > > > > > >> > Any suggestions? > > > > > > >> > > > > > > > >> > java.net.SocketTimeoutException: Call to > > > > > > >> > hdpmgr001.pse.movenetworks.com/10.100.101.221:60000 failed > on > > > > > socket > > > > > > >> > timeout exception: java.net.SocketTimeoutException: 60000 > > millis > > > > > > >> > timeout while waiting for channel to be ready for read. ch : > > > > > > >> > java.nio.channels.SocketChannel[connected local=/ > > > > > 10.100.101.221:36722 > > > > > > >> > remote=hdpmgr001.pse.movenetworks.com/10.100.101.221:60000] > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1026) > > > > > > >> > at > > > > > > >> > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:999) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86) > > > > > > >> > at $Proxy11.getHTableDescriptors(Unknown Source) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.listTables(HConnectionManager.java:1870) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > org.apache.hadoop.hbase.client.HBaseAdmin.listTables(HBaseAdmin.java:279) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.__jamon_innerUnit__userTables(MasterStatusTmplImpl.java:504) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:297) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:399) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:389) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:82) > > > > > > >> > at > > > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > > > > > > >> > at > > > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > > > > > > >> > at > > > > > > >> > > > > > > > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:101) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > > > > > > >> > at > > > > > > >> > > > > > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > > > > > > >> > at org.mortbay.jetty.Server.handle(Server.java:326) > > > > > > >> > at > > > > > > >> > > > > > > > > > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > > > > > > >> > at > > > > > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > > > > > > >> > at > > > > > > >> > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > > > > > > >> > at > > > > > > >> > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > > > > > >> > Caused by: java.net.SocketTimeoutException: 60000 millis > > timeout > > > > > while > > > > > > >> > waiting for channel to be ready for read. ch : > > > > > > >> > java.nio.channels.SocketChannel[connected local=/ > > > > > 10.100.101.221:36722 > > > > > > >> > remote=hdpmgr001.pse.movenetworks.com/10.100.101.221:60000] > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > > > > > > >> > at java.io.FilterInputStream.read(Unknown Source) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:373) > > > > > > >> > at java.io.BufferedInputStream.fill(Unknown Source) > > > > > > >> > at java.io.BufferedInputStream.read(Unknown Source) > > > > > > >> > at java.io.DataInputStream.readInt(Unknown Source) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:646) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:580) > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > The master log file is unhelpful. Almost all of it is > notices > > > > about > > > > > > >> > skipping load balancing, but at least the exception appears > in > > > the > > > > > > >> > log: > > > > > > >> > > > > > > > >> > > > > > > > >> > 2014-06-18 16:18:50,359 INFO > > > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load > > > > balancing > > > > > > >> > because balanced cluster; servers=6 regions=20 > > average=3.3333333 > > > > mos > > > > > > >> > tloaded=4 leastloaded=3 > > > > > > >> > 2014-06-18 16:18:50,359 INFO > > > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load > > > > balancing > > > > > > >> > because balanced cluster; servers=6 regions=1 > > average=0.16666667 > > > > mos > > > > > > >> > tloaded=1 leastloaded=0 > > > > > > >> > 2014-06-18 16:18:50,360 INFO > > > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load > > > > balancing > > > > > > >> > because balanced cluster; servers=6 regions=17 > > average=2.8333333 > > > > mos > > > > > > >> > tloaded=3 leastloaded=2 > > > > > > >> > 2014-06-18 16:18:50,360 INFO > > > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load > > > > balancing > > > > > > >> > because balanced cluster; servers=6 regions=12 average=2.0 > > > > mostloade > > > > > > >> > d=2 leastloaded=2 > > > > > > >> > 2014-06-18 16:18:50,360 INFO > > > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load > > > > balancing > > > > > > >> > because balanced cluster; servers=6 regions=68 > > average=11.333333 > > > > mos > > > > > > >> > tloaded=12 leastloaded=11 > > > > > > >> > 2014-06-18 16:20:36,118 WARN org.mortbay.log: > /master-status: > > > > > > >> > java.net.SocketTimeoutException: Call to > > > > > > >> > hdpmgr001.pse.movenetworks.com/10.100.101.221:60000 failed > > on s > > > > > > >> > ocket timeout exception: java.net.SocketTimeoutException: > > 60000 > > > > > millis > > > > > > >> > timeout while waiting for channel to be ready for read. ch : > > > > > > >> > java.nio.channels.SocketChannel[co > > > > > > >> > nnected local=/10.100.101.221:36674 > > > > > > >> > remote=hdpmgr001.pse.movenetworks.com/10.100.101.221:60000] > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > --Tom > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
