Excellent. I updated versions, the rate of these errors has dropped significantly. Thanks!
On Mon, Aug 11, 2014 at 2:03 AM, Qiang Tian <[email protected]> wrote: > update - it has been fixed in 0.98.4, HBASE-11374 > > > On Sat, Aug 9, 2014 at 5:53 PM, Qiang Tian <[email protected]> wrote: > > > please see HBASE-11714 > > > > > > On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian <[email protected]> wrote: > > > >> took a quick look. it looks a bug, will dig more and update later. > >> > >> > >> > >> On Sat, Aug 9, 2014 at 1:45 AM, Walter King <[email protected]> wrote: > >> > >>> awesome, thanks! > >>> the resets and such were actually what I've been trying to investigate. > >>> Recently switched from 0.94 and 0.98, and finding that periodically > >>> things > >>> are having issues - lots of retry exceptions. Don't have much > >>> information, > >>> still looking into it when saw this. > >>> > >>> By timestamp, this seems to be the corresponding client exception: > >>> > >>> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, > >>> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 > >>> ops, > >>> last exception: java.net.SocketTimeoutException: Call to > >>> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed > >>> because java.net.SocketTimeoutException: 2000 millis timeout while > >>> waiting > >>> for channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 > >>> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 > ] > >>> on > >>> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, > tracking > >>> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay > 500 > >>> ops. > >>> > >>> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for > >>> some > >>> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10, > >>> currentTasksDone=10, retries=10 hasError=false, > >>> tableName=rt_global_monthly_campaign_deliveries > >>> > >>> > >>> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez < > [email protected]> > >>> wrote: > >>> > >>> > The RequestContext seems to be ok since it hasn't been set. So just > >>> > decrementing the call size should be enough. > >>> > > >>> > cheers, > >>> > esteban. > >>> > > >>> > > >>> > -- > >>> > Cloudera, Inc. > >>> > > >>> > > >>> > > >>> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez < > >>> [email protected]> > >>> > wrote: > >>> > > >>> > > > >>> > > Looks like after HBASE-10506 we didn't do what CallRunner used to > do > >>> > > before: clear the RequestContext and decrement callQueueSize. > >>> > > > >>> > > Created HBASE-11705 to address this issue. > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > -- > >>> > > Cloudera, Inc. > >>> > > > >>> > > > >>> > > > >>> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <[email protected]> > >>> wrote: > >>> > > > >>> > >> good point. that is a big suspect. > >>> > >> > >>> > >> I check your log, ClosedChannelException should be triggered by > >>> > >> call.sendResponseIfReady()(it is the only request in the queue, > so > >>> > >> handler > >>> > >> send response directly), but at that point the callqueueSize has > >>> been > >>> > >> decremented. > >>> > >> > >>> > >> 2014-08-05 00:50:06,727 WARN [RpcServer.handler=57,port=60020] > >>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) - > >>> > >> RpcServer.respondercallId: 118504 service: ClientService > methodName: > >>> > Multi > >>> > >> size: 141.9 K connection: 10.248.134.67:55347: output error > >>> > >> 2014-08-05 00:50:06,727 WARN [RpcServer.handler=57,port=60020] > >>> > >> ipc.RpcServer (CallRunner.java:run(135)) - > >>> > >> RpcServer.handler=57,port=60020: > >>> > >> caught a ClosedChannelException, this means that the server was > >>> > processing > >>> > >> a request but the client went away. The error message was: null > >>> > >> > >>> > >> it looks you have got the fix, would you file a jira? > >>> > >> thanks. > >>> > >> > >>> > >> > >>> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <[email protected]> > >>> wrote: > >>> > >> > >>> > >> > I've only looked at the code a little, and likely missed > >>> something, > >>> > but > >>> > >> > does this if block decrement the call queue, if the client > already > >>> > >> closed > >>> > >> > the connection? > >>> > >> > > >>> > >> > > >>> > >> > > >>> > >> > >>> > > >>> > https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74 > >>> > >> > > >>> > >> > > >>> > >> > > >>> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <[email protected] > > > >>> > wrote: > >>> > >> > > >>> > >> > > Yes, sorry, CallQueueTooBigException. but that value never > >>> returns > >>> > to > >>> > >> > > zero, even when number of requests goes to zero. The call > >>> queue too > >>> > >> big > >>> > >> > > happens if any regionserver is up for a long enough period of > >>> time, > >>> > >> so I > >>> > >> > > have to periodically restart them. Also at that 15:30 time I > >>> wasn't > >>> > >> > > seeing that exception, but it seems like that is one time in > >>> which a > >>> > >> call > >>> > >> > > didnt properly decrement the callqueuesize because it was at > >>> zero > >>> > >> before > >>> > >> > > and has never hit zero again - today the minimum is even > higher. > >>> > >> > > > >>> > >> > > > >>> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <[email protected] > > > >>> > wrote: > >>> > >> > > > >>> > >> > >> bq. "Eventually we ran into ipc queue size full messages > being > >>> > >> returned > >>> > >> > to > >>> > >> > >> clients trying large batch puts, as it approaches a > gigabyte." > >>> > >> > >> > >>> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue > >>> size, > >>> > >> but > >>> > >> > the > >>> > >> > >> data size that client sends..configured by > >>> > >> > >> "hbase.ipc.server.max.callqueue.size". > >>> > >> > >> > >>> > >> > >> I guess when you client got the exception, it closed the > >>> exception > >>> > >> and > >>> > >> > >> causing other shared connection RPC failed. > >>> > >> > >> > >>> > >> > >> > >>> > >> > >> 2014-08-06 22:27:57,253 WARN [RpcServer.reader=9,port=60020] > >>> > >> > >> ipc.RpcServer > >>> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: > >>> count > >>> > >> of > >>> > >> > >> bytes read: 0 > >>> > >> > >> java.io.IOException: Connection reset by peer > >>> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > >>> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > >>> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > >>> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197) > >>> > >> > >> at > >>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > >>> > >> > >> at > >>> > >> > > >>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229) > >>> > >> > >> at > >>> > >> > >> > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415) > >>> > >> > >> at > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790) > >>> > >> > >> at > >>> > >> > >> > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581) > >>> > >> > >> at > >>> > >> > >> > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556) > >>> > >> > >> at > >>> > >> > >> > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > >>> > >> > >> at > >>> > >> > >> > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >>> > >> > >> at java.lang.Thread.run(Thread.java:744) > >>> > >> > >> 2014-08-06 22:27:57,257 WARN > [RpcServer.handler=18,port=60020] > >>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) - > >>> > >> > >> RpcServer.respondercallId: 84968 service: ClientService > >>> methodName: > >>> > >> > Multi > >>> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error > >>> > >> > >> 2014-08-06 22:27:57,258 WARN > [RpcServer.handler=18,port=60020] > >>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) - > >>> > >> > >> RpcServer.handler=18,port=60020: > >>> > >> > >> caught a ClosedChannelException, this means that the server > was > >>> > >> > processing > >>> > >> > >> a request but the client went away. The error message was: > null > >>> > >> > >> 2014-08-06 22:27:57,260 WARN > [RpcServer.handler=61,port=60020] > >>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) - > >>> > >> > >> RpcServer.respondercallId: 83907 service: ClientService > >>> methodName: > >>> > >> > Multi > >>> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error > >>> > >> > >> 2014-08-06 22:27:57,263 WARN > [RpcServer.handler=61,port=60020] > >>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) - > >>> > >> > >> RpcServer.handler=61,port=60020: > >>> > >> > >> caught a ClosedChannelException, this means that the server > was > >>> > >> > processing > >>> > >> > >> a request but the client went away. The error message was: > null > >>> > >> > >> > >>> > >> > >> > >>> > >> > >> > >>> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King < > [email protected] > >>> > > >>> > >> wrote: > >>> > >> > >> > >>> > >> > >> > > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > > >>> > https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt > >>> > >> > >> > > >>> > >> > >> > > >>> > >> > > >>> > > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz > >>> > >> > >> > > >>> > >> > >> > These are logs from that particular server, and the debug > >>> dump > >>> > from > >>> > >> > >> now(no > >>> > >> > >> > restart in between). The times in the graph are pacific, > so > >>> it > >>> > >> should > >>> > >> > >> be > >>> > >> > >> > around 2014-08-06 22:25:00. I do see some exceptions > around > >>> > there. > >>> > >> > >> > > >>> > >> > >> > >>> > >> > > > >>> > >> > > > >>> > >> > > >>> > >> > >>> > > > >>> > > > >>> > > >>> > >> > >> > > >
