Re: Master timeout during list tables

Ted Yu Wed, 25 Jun 2014 12:54:24 -0700

Can you look at the tail of master log to see which WAL takes long time to
split ?


Checking Namenode log if needed.

Cheers

On Wed, Jun 25, 2014 at 12:09 PM, Tom Brown <[email protected]> wrote:

> Yes, that stack is still there:
>
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at
>
> org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:372)
> - locked <0x00000000bfa0a068> (a
> org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
> at
>
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
> at
>
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371)
> at
>
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:337)
>
> What does that mean? That HDFS is behaving badly, or something else
> entirely?
>
> --Tom
>
>
>
> On Wed, Jun 25, 2014 at 11:45 AM, Ted Yu <[email protected]> wrote:
>
> > Looks like master was stuck in FileSystem.listStatus() call.
> > I noticed the following - did this show up if you take jstack one more
> time
> > ?
> >
> >    1.         at
> >
> >
>  
> org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:372)
> >    2.         - locked <0x00000000bfa0a068> (a
> >    org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
> >    3.         at
> >
> >
>  
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
> >    4.         at
> >
> >
>  
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371)
> >    5.         at
> >
> >
>  
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:337)
> >    6.         at
> >
> >
>  
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:282)
> >    7.         at
> >
> >
>  
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:127)
> >
> >
> >
> > On Wed, Jun 25, 2014 at 10:09 AM, Tom Brown <[email protected]>
> wrote:
> >
> > > Before I was able to acquire a stack trace, I restarted the master.
> > > However, the issue has just happened again and I was able to get a
> stack
> > > trace:
> > >
> > > http://pastebin.com/Mz5c6AML
> > >
> > > (The pastebin is set to never expire, so anyone viewing an archived
> > version
> > > of this message should still be able to see the stack)
> > >
> > > The version of hbase is 0.94.10.
> > >
> > > Thanks!
> > >
> > > --Tom
> > >
> > >
> > > On Wed, Jun 18, 2014 at 8:55 PM, Qiang Tian <[email protected]> wrote:
> > >
> > > > Hi Tom,
> > > > Can you collect your master jvm stacktrace when problem happens and
> put
> > > it
> > > > to pastbin?
> > > > what is your hbase version?
> > > >
> > > >
> > > > On Thu, Jun 19, 2014 at 1:34 AM, Tom Brown <[email protected]>
> > wrote:
> > > >
> > > > > Could this happen if the master is running too many RPC tasks and
> > can't
> > > > > keep up? What about if there's too many connections to the server?
> > > > >
> > > > > --Tom
> > > > >
> > > > >
> > > > > On Wed, Jun 18, 2014 at 11:33 AM, Tom Brown <[email protected]>
> > > > wrote:
> > > > >
> > > > > > That server is the master and is not a regionserver.
> > > > > >
> > > > > > --Tom
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 18, 2014 at 11:29 AM, Ted Yu <[email protected]>
> > > wrote:
> > > > > >
> > > > > >> Have you checked region server log on 10.100.101.221
> > > > > >> <http://hdpmgr001.pse.movenetworks.com/10.100.101.221:60000> ?
> > > > > >>
> > > > > >> Cheers
> > > > > >>
> > > > > >>
> > > > > >> On Wed, Jun 18, 2014 at 10:19 AM, Tom Brown <
> [email protected]
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> > Hello all,
> > > > > >> >
> > > > > >> > I'm trying to view the master status of a 6 node (0.94.10;
> > hadoop
> > > > > 1.1.2)
> > > > > >> > cluster but I keep getting a timeout exception.
> > > > > >> >
> > > > > >> > The rest of the cluster is operating quite normally. From the
> > > > > >> exception, it
> > > > > >> > seems like the "list tables" function (required to display the
> > web
> > > > UI)
> > > > > >> is
> > > > > >> > timing out for some reason.
> > > > > >> >
> > > > > >> > From the shell, I'm able to scan the entire .META. table, so
> the
> > > > table
> > > > > >> > information is conceivably available. I don't understand the
> > rest
> > > of
> > > > > the
> > > > > >> > architecture well enough to know what might be causing this
> > > timeout
> > > > > >> during
> > > > > >> > "list".
> > > > > >> >
> > > > > >> > Any suggestions?
> > > > > >> >
> > > > > >> > java.net.SocketTimeoutException: Call to
> > > > > >> > hdpmgr001.pse.movenetworks.com/10.100.101.221:60000 failed on
> > > > socket
> > > > > >> > timeout exception: java.net.SocketTimeoutException: 60000
> millis
> > > > > >> > timeout while waiting for channel to be ready for read. ch :
> > > > > >> > java.nio.channels.SocketChannel[connected local=/
> > > > 10.100.101.221:36722
> > > > > >> > remote=hdpmgr001.pse.movenetworks.com/10.100.101.221:60000]
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1026)
> > > > > >> >         at
> > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:999)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
> > > > > >> >         at $Proxy11.getHTableDescriptors(Unknown Source)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.listTables(HConnectionManager.java:1870)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > >
> org.apache.hadoop.hbase.client.HBaseAdmin.listTables(HBaseAdmin.java:279)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.__jamon_innerUnit__userTables(MasterStatusTmplImpl.java:504)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:297)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:399)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:389)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:82)
> > > > > >> >         at
> > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> > > > > >> >         at
> > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> > > > > >> >         at
> > > > > >> >
> > > >
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:101)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > >
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > >
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > >
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> > > > > >> >         at
> > > > > >> >
> > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > >
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> > > > > >> >         at org.mortbay.jetty.Server.handle(Server.java:326)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> > > > > >> >         at
> > > > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > > >> >         at
> > > > > >> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > > >> >         at
> > > > > >> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> > > > > >> > Caused by: java.net.SocketTimeoutException: 60000 millis
> timeout
> > > > while
> > > > > >> > waiting for channel to be ready for read. ch :
> > > > > >> > java.nio.channels.SocketChannel[connected local=/
> > > > 10.100.101.221:36722
> > > > > >> > remote=hdpmgr001.pse.movenetworks.com/10.100.101.221:60000]
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > > > > >> >         at
> > > > > >> >
> > > > >
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > > > > >> >         at java.io.FilterInputStream.read(Unknown Source)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:373)
> > > > > >> >         at java.io.BufferedInputStream.fill(Unknown Source)
> > > > > >> >         at java.io.BufferedInputStream.read(Unknown Source)
> > > > > >> >         at java.io.DataInputStream.readInt(Unknown Source)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:646)
> > > > > >> >         at
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:580)
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > The master log file is unhelpful. Almost all of it is notices
> > > about
> > > > > >> > skipping load balancing, but at least the exception appears in
> > the
> > > > > >> > log:
> > > > > >> >
> > > > > >> >
> > > > > >> > 2014-06-18 16:18:50,359 INFO
> > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load
> > > balancing
> > > > > >> > because balanced cluster; servers=6 regions=20
> average=3.3333333
> > > mos
> > > > > >> > tloaded=4 leastloaded=3
> > > > > >> > 2014-06-18 16:18:50,359 INFO
> > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load
> > > balancing
> > > > > >> > because balanced cluster; servers=6 regions=1
> average=0.16666667
> > > mos
> > > > > >> > tloaded=1 leastloaded=0
> > > > > >> > 2014-06-18 16:18:50,360 INFO
> > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load
> > > balancing
> > > > > >> > because balanced cluster; servers=6 regions=17
> average=2.8333333
> > > mos
> > > > > >> > tloaded=3 leastloaded=2
> > > > > >> > 2014-06-18 16:18:50,360 INFO
> > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load
> > > balancing
> > > > > >> > because balanced cluster; servers=6 regions=12 average=2.0
> > > mostloade
> > > > > >> > d=2 leastloaded=2
> > > > > >> > 2014-06-18 16:18:50,360 INFO
> > > > > >> > org.apache.hadoop.hbase.master.LoadBalancer: Skipping load
> > > balancing
> > > > > >> > because balanced cluster; servers=6 regions=68
> average=11.333333
> > > mos
> > > > > >> > tloaded=12 leastloaded=11
> > > > > >> > 2014-06-18 16:20:36,118 WARN org.mortbay.log: /master-status:
> > > > > >> > java.net.SocketTimeoutException: Call to
> > > > > >> > hdpmgr001.pse.movenetworks.com/10.100.101.221:60000 failed
> on s
> > > > > >> > ocket timeout exception: java.net.SocketTimeoutException:
> 60000
> > > > millis
> > > > > >> > timeout while waiting for channel to be ready for read. ch :
> > > > > >> > java.nio.channels.SocketChannel[co
> > > > > >> > nnected local=/10.100.101.221:36674
> > > > > >> > remote=hdpmgr001.pse.movenetworks.com/10.100.101.221:60000]
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --Tom
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Master timeout during list tables

Reply via email to