Re: Hbase cluster is suddenly unable to respond

吴国泉wgq Sun, 07 Aug 2016 19:38:50 -0700

hi all:
       There are two conditions caused the region server crashed (we met).
       1.NIO,  out of  direct memory
       2.zookeeper session timeout


       You can find the reason in the region server log (or .out) or GC log.

       If it is the "out of  direct memory",you will see “ kill -9 XXX” in the 
regionserverXX.out .
       Then change the “-XX:+DisableExplicitGC” to  
"-XX:+ExplicitGCInvokesConcurrent”


       If it is the “zookeeper session timeout”, you can see “XX seconds 
timeout to zookeeper” in the regionserverXX.log.
       Here is the point: The ‘zookeeper.session.timeout’  in hbase-site.xml  
does not  work  if you use a  zookeeper  is not managed by hbase.
       “maxSessionTimeout” in the zoo.cfg is the really property controls the 
timeout.  default it is 40s. That may be the reason why 
‘zookeeper.session.timeout’ does not work.

      GC can stop the world,You can optical the GC or change 
‘maxSessionTimeout’ bigger to make sure  Hbase won’t shutdown the region server 
after an acceptable timeout.





在 2016年8月6日，下午2:08，kiran 
<[email protected]<mailto:[email protected]>> 写道：

We are also facing the same issue. Please tell us what is the solution. I
have increased the rpc timeout and caching is reduced but with no effect.
We are using hbase 0.98.7. Please suggest a work around as we are facing
the issue very frequently now and we are having downtime in production.

On Fri, Oct 30, 2015 at 9:21 AM, 聪聪 <[email protected]<mailto:[email protected]>> 
wrote:

There is a view child nodes loop code  http://paste2.org/zm8GE7xH




------------------ 原始邮件 ------------------
发件人: "蒲聪-北京";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月30日(星期五) 上午10:24
收件人: "user"<[email protected]<mailto:[email protected]>>;

主题: 回复： 回复： Hbase cluster is suddenly unable to respond





The client code is http://paste2.org/p3BXkKtV


Is the client version compatible with it?
I see  that the client version is hbase0.96.1.1-hadoop2




------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月30日(星期五) 凌晨0:08
收件人: 
"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>;

主题: Re: 回复： Hbase cluster is suddenly unable to respond



Client side, have they tried increasing direct memory size ?
-XX:MaxDirectMemorySize=

Do you know how wide the rows returned may get ?

Cheers

On Thu, Oct 29, 2015 at 9:03 AM, 聪聪 <[email protected]<mailto:[email protected]>> 
wrote:

Developers feedback their client has the following error：

[2015/10/29 19:20:42.260][WARN][RpcClient:724] IPC Client (1904394969)
connection to 
l-hbase28.data.cn8.qunar.com/192.168.44.32:60020<http://l-hbase28.data.cn8.qunar.com/192.168.44.32:60020>
 from
tomcat: unexpected exception receiving call responses

java.lang.OutOfMemoryError: Direct buffer memory

   at java.nio.Bits.reserveMemory(Bits.java:633) ~[na:1.6.0_20]

   at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:95)
~[na:1.6.0_20]

   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
~[na:1.6.0_20]

   at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:57)
~[na:1.6.0_20]

   at sun.nio.ch.IOUtil.read(IOUtil.java:205) ~[na:1.6.0_20]

   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
~[na:1.6.0_20]

   at
org.apache.hadoop.net.SocketInputStream$Reader.
performIO(SocketInputStream.java:57)
~[hadoop-common-2.2.0.jar:na]

   at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(
SocketIOWithTimeout.java:142)
~[hadoop-common-2.2.0.jar:na]

   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
~[hadoop-common-2.2.0.jar:na]

   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
~[hadoop-common-2.2.0.jar:na]

   at java.io.FilterInputStream.read(FilterInputStream.java:116)
~[na:1.6.0_20]

   at java.io.FilterInputStream.read(FilterInputStream.java:116)
~[na:1.6.0_20]

   at
org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(
RpcClient.java:555)
~[hbase-client-0.96.1.1-hadoop2.jar:0.96.1.1-hadoop2]

   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
~[na:1.6.0_20]

   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
~[na:1.6.0_20]

   at java.io.DataInputStream.read(DataInputStream.java:132)
~[na:1.6.0_20]

   at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
~[hadoop-common-2.2.0.jar:na]

   at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.
readResponse(RpcClient.java:1101)
~[hbase-client-0.96.1.1-hadoop2.jar:0.96.1.1-hadoop2]

   at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:721)
~[hbase-client-0.96.1.1-hadoop2.jar:0.96.1.1-hadoop2]







------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月29日(星期四) 晚上10:48
收件人: 
"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>;

主题: Re: 回复： Hbase cluster is suddenly unable to respond



I took a look at the jstack.
The threads involving RpcServer$Connection.readAndProcess() were in
RUNNABLE state, not BLOCKED or IN_NATIVE state - as described in
HBASE-11277
.

The protobuf exception shown in your earlier email corresponded to the
following in hbase-protocol/src/main/protobuf/Client.proto :

message GetRequest {
 required RegionSpecifier region = 1;
 required Get get = 2;
}

Are all your hbase clients running in the same version ?

Cheers

On Thu, Oct 29, 2015 at 7:28 AM, 聪聪 <[email protected]<mailto:[email protected]>> 
wrote:

the regionserver jstack log is    http://paste2.org/yLDJeXgL




------------------ 原始邮件 ------------------
发件人: "蒲聪-北京";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月29日(星期四) 晚上9:10
收件人: "user"<[email protected]<mailto:[email protected]>>;

主题: 回复： 回复： Hbase cluster is suddenly unable to respond



hi Ted：


Yesterday around 14:40,one of regionservers hang once against.At that
time
I saw web ui can not open.Hbase cluster is  unable to respond.Very
anxious,
hoping to get help！


jstack log is as follows：
"RpcServer.reader=4,port=60020" daemon prio=10 tid=0x00007f4466146800
nid=0x12d3 runnable [0x00007f3bebe58000]
  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.NativeThread.current(Native Method)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:325)
   - locked <0x00007f3d27360fb0> (a java.lang.Object)
   - locked <0x00007f3d27360f90> (a java.lang.Object)
   at
org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2402)
   at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2368)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1476)
   at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
   - locked <0x00007f3c584ce990> (a
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
   at

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)


"RpcServer.reader=3,port=60020" daemon prio=10 tid=0x00007f4466145000
nid=0x12d2 runnable [0x00007f3bebf59000]
  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.NativeThread.current(Native Method)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:325)
   - locked <0x00007f3d27360530> (a java.lang.Object)
   - locked <0x00007f3d27360510> (a java.lang.Object)
   at
org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2402)
   at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2368)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1476)
   at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
   - locked <0x00007f3c584cf7d8> (a
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
   at

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)





region server log :
2015-10-28 14:38:19,801 INFO  [Thread-15] regionserver.HRegion:
Finished
memstore flush of ~3.6 M/3820648, currentsize=536/536 for region

order_history,2801xyz140618175732642$3,1418829598639.
afc853471a8cd4184bc9e7be00b8eea0.
in 45ms, sequenceid=9599960557, compaction requested=true
2015-10-28 14:38:32,693 ERROR [RpcServer.handler=3,port=60020]
regionserver.HRegionServer:
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
Expected nextCallSeq: 1 But the nextCallSeq got from client: 0;
request=scanner_id: 16740356019163164014 number_of_rows: 10
close_scanner:
false next_call_seq: 0
   at

org.apache.hadoop.hbase.regionserver.HRegionServer.
scan(HRegionServer.java:3007)
   at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
callBlockingMethod(ClientProtos.java:26929)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146)
   at
org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851)
2015-10-28 14:38:32,693 ERROR [RpcServer.handler=9,port=60020]
regionserver.HRegionServer:
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
Expected nextCallSeq: 1 But the nextCallSeq got from client: 0;
request=scanner_id: 16740356019163164014 number_of_rows: 10
close_scanner:
false next_call_seq: 0
   at

org.apache.hadoop.hbase.regionserver.HRegionServer.
scan(HRegionServer.java:3007)
   at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
callBlockingMethod(ClientProtos.java:26929)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146)
   at
org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851)

2015-10-28 14:38:32,696 WARN  [RpcServer.reader=2,port=60020]
ipc.RpcServer: Unable to read call parameter from client 192.168.37.135
com.google.protobuf.UninitializedMessageException: Message missing
required fields: region, get
   at

com.google.protobuf.AbstractMessage$Builder.
newUninitializedMessageException(AbstractMessage.java:770)
   at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4474)
   at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4406)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processRequest(RpcServer.java:1689)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processOneRpc(RpcServer.java:1631)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1491)
   at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
   at

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
2015-10-28 14:38:32,696 WARN  [RpcServer.reader=1,port=60020]
ipc.RpcServer: Unable to read call parameter from client 192.168.37.135
com.google.protobuf.UninitializedMessageException: Message missing
required fields: region, get
   at

com.google.protobuf.AbstractMessage$Builder.
newUninitializedMessageException(AbstractMessage.java:770)
   at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4474)
   at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4406)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processRequest(RpcServer.java:1689)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processOneRpc(RpcServer.java:1631)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1491)
   at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
   at

org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
   at

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
   at

java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)



------------------ 原始邮件 ------------------
发件人: "蒲聪-北京";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月26日(星期一) 晚上9:28
收件人: "user"<[email protected]<mailto:[email protected]>>;

主题: 回复： 回复： Hbase cluster is suddenly unable to respond





Thank you very much！


------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月26日(星期一) 晚上8:28
收件人: "user"<[email protected]<mailto:[email protected]>>;

主题: Re: 回复： Hbase cluster is suddenly unable to respond



The fix from HBASE-11277 may solve your problem - if you collect stack
trace during the hang, we would have more clue.

I suggest upgrading to newer release such as 1.1.2 or 0.98.15

Cheers

On Oct 26, 2015, at 12:42 AM, 聪聪 <[email protected]<mailto:[email protected]>> 
wrote:

hi,Ted:


I use the HBase version is hbase-0.96.0.
Around 17:33,other region servers also appeared in this warn log.I
don't
know if it's normal or not.At that time I saw web ui can not open.I
restart
the regionserver  then hbase back to normal. Is it possible  bug
HBASE-11277？


Regionserver on the log basically almost  this warn log
mater on the log  is as follows：
2015-10-21 22:15:43,575 INFO  [CatalogJanitor-l-namenode2:60000]
master.CatalogJanitor: Scanned 672 catalog row(s), gc'd 0 unreferenced
merged region(s) and 1 unreferenced parent region(s)
2015-10-23 17:47:25,617 INFO  [RpcServer.handler=28,port=60000]
master.HMaster: Client=hbase//192.168.39.19 set balanceSwitch=false
2015-10-23 17:49:45,513 WARN  [RpcServer.handler=24,port=60000]
ipc.RpcServer: (responseTooSlow):

{"processingtimems":70266,"call":"ListTableDescriptorsByNamespac
e(org.apache.hadoop.hbase.protobuf.generated.MasterProtos$
ListTableDescriptorsByNamespaceRequest)","client":"
192.168.39.22:60292

","starttimems":1445593715207,"queuetimems":0,"class":"
HMaster","responsesize":704,"method":"ListTableDescriptorsByNamespace"}
2015-10-23 17:49:45,513 WARN  [RpcServer.handler=6,port=60000]
ipc.RpcServer: (responseTooSlow):

{"processingtimems":130525,"call":"ListTableDescriptorsByNamespac
e(org.apache.hadoop.hbase.protobuf.generated.MasterProtos$
ListTableDescriptorsByNamespaceRequest)","client":"
192.168.39.22:60286

","starttimems":1445593654945,"queuetimems":0,"class":"
HMaster","responsesize":704,"method":"ListTableDescriptorsByNamespace"}
2015-10-23 17:49:45,513 WARN  [RpcServer.handler=24,port=60000]
ipc.RpcServer: RpcServer.respondercallId: 130953 service: MasterService
methodName: ListTableDescriptorsByNamespace size: 48 connection:
192.168.39.22:60292: output error
2015-10-23 17:49:45,513 WARN  [RpcServer.handler=6,port=60000]
ipc.RpcServer: RpcServer.respondercallId: 130945 service: MasterService
methodName: ListTableDescriptorsByNamespace size: 48 connection:
192.168.39.22:60286: output error
2015-10-23 17:49:45,513 WARN  [RpcServer.handler=6,port=60000]
ipc.RpcServer: RpcServer.handler=6,port=60000: caught a
ClosedChannelException, this means that the server was processing a
request
but the client went away. The error message was: null











------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月23日(星期五) 晚上11:39
收件人: 
"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>;

主题: Re: Hbase cluster is suddenly unable to respond



Were other region servers functioning normally around 17:33 ?

Which hbase release are you using ?

Can you pastebin more of the region server log ?

Thanks

On Fri, Oct 23, 2015 at 8:28 AM, 聪聪 <[email protected]<mailto:[email protected]>> 
wrote:

hi，all：


This afternoon,The whole Hbase cluster is suddenly unable to
respond.after
I restart a regionserver after，the cluster has recovered.I don't
know
the
cause of the trouble.I hope I can get help from you.


Regionserver on the log is as follows：
2015-10-23 17:28:49,335 INFO  [regionserver60020.logRoller]
wal.FSHLog:
moving old hlog file 
/hbase/WALs/l-hbase30.data.cn8.qunar.com<http://l-hbase30.data.cn8.qunar.com>
,60020,1442810406218/l-hbase30.data.cn8.qunar.com<http://l-hbase30.data.cn8.qunar.com>
%2C60020%2C1442810406218.1445580462689
whose highest sequenceid is 9071525521 to /hbase/oldWALs/
l-hbase30.data.cn8.qunar.com<http://l-hbase30.data.cn8.qunar.com>%2C60020%2C1442810406218.1445580462689
2015-10-23 17:33:31,375 WARN  [RpcServer.reader=8,port=60020]
ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.
read(SocketDispatcher.java:39)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
      at sun.nio.ch.IOUtil.read(IOUtil.java:197)
      at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
      at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(
RpcServer.java:2368)
      at


org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1403)
      at

org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
      at


org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
      at


org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
      at


java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
      at


java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:744)
2015-10-23 17:33:31,779 WARN  [RpcServer.reader=2,port=60020]
ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.
read(SocketDispatcher.java:39)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
      at sun.nio.ch.IOUtil.read(IOUtil.java:197)
      at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
      at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(
RpcServer.java:2368)
      at


org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1403)
      at

org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
      at


org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
      at


org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
      at


java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
      at


java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:744)






--
Thank you
Kiran Sarvabhotla

-----Even a correct decision is wrong when it is taken late


安全提示：本邮件非QUNAR内部邮件，请注意保护个人及公司信息安全，如有索取帐号密码等可疑情况请向 secteam发送邮件

Re: Hbase cluster is suddenly unable to respond

Reply via email to