Hi, It doesn't look like the servers are loaded, we're not passing that much traffic though the cluster at the moment. Can you explain how to take the dump from the Thrift server? I couldn't find how to do that.
At the moment we have only 1 Thrift gateway, I'm going to add some more with load balancing. Thanks again. On Wed, Feb 1, 2012 at 6:57 PM, Stack <[email protected]> wrote: > On Wed, Feb 1, 2012 at 1:00 AM, Galed Friedmann > <[email protected]> wrote: > > 1. I've taken a dump from the HMaster when we felt some timeouts, I hope > > that's what you're looking for, attached. > > I was looking for dumps of the hung up thrift server. > > The master dump shows it idle. > > > 2. The timeout occurs around 10-12 hours after the ZK established the > > connection with the Thrift server so it's not immediate. On the Thrift > logs > > you see that nothing happened and only see the timeouts on the ZK logs. > > Actually we hadn't had errors in the last 15 hours nor ZK timeouts for > > Thrift but it'll happen again I'm sure.. > > OK. Thread dump it when its hung up. Thrift is getting stuck going > against the cluster it seems. How many gateways are you running? Run > more? > > > 3. The lease expiration happens all the time, we're using mostly JRuby > > scripts and closing the scans when we're done. > > > > Could it be the client is taking a long time to get back to the > server? Or maybe the server is taking long time to respond because > its heavily loaded (is it?). > > St.Ack > > > Thanks again, > > Galed. > > > > > > On Tue, Jan 31, 2012 at 10:51 PM, Stack <[email protected]> wrote: > >> > >> On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann > >> <[email protected]> wrote: > >> > Lately we're having weird issues with Thrift, after several hours the > >> > Thrift server "hangs" - the scripts that are using it to access HBase > >> > get > >> > connection timeouts, we're also using Heroku and ruby on rails apps > that > >> > use Thrift and they simply get stuck. Only when restarting the Thrift > >> > process everything goes back to normal. > >> > > >> > >> Can you thread dump the thrift server when its all hung up? > >> > >> Have you enabled > >> > >> > >> > 2012-01-30 10:52:08,823 INFO > org.apache.zookeeper.server.NIOServerCnxn: > >> > Established session 0x1352a393d18051e with negotiated timeout 90000 > for > >> > client /10.217.55.193:35940 > >> > 2012-01-30 10:52:28,001 INFO > >> > org.apache.zookeeper.server.ZooKeeperServer: > >> > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded > >> > 2012-01-30 10:52:28,001 INFO > >> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session > >> > termination for sessionid: 0x1352a393d18051b > >> > >> ZK is establishing a session w/ 90second timeout and then timing out > >> immediately? > >> > >> > >> > >> > >> > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC > >> > Server > >> > listener on 60020: readAndProcess threw exception java.io.IOException: > >> > Connection rese > >> > t by peer. Count of bytes read: 0 > >> > java.io.IOException: Connection reset by peer > >> > at sun.nio.ch.FileDispatcher.read0(Native Method) > >> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > >> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) > >> > at sun.nio.ch.IOUtil.read(IOUtil.java:210) > >> > at > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316) > >> > at > >> > > >> > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >> > at > >> > > >> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >> > at java.lang.Thread.run(Thread.java:619) > >> > 2012-01-30 10:52:24,016 INFO > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > >> > -4511393305838866925 lease expired > >> > 2012-01-30 10:52:24,016 INFO > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > >> > -5818959718437063034 lease expired > >> > 2012-01-30 10:52:24,016 INFO > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > >> > -1408921590864341720 lease expired > >> > > >> > >> Client went away? All the lease expireds happen always or just around > >> time of the hangup (You are closing scanners when done?) > >> > >> St.Ack > > > > >
