hi all:
There are two conditions caused the region server crashed (we met).
1.NIO, out of direct memory
2.zookeeper session timeout
You can find the reason in the region server log (or .out) or GC log.
If it is the "out of direct memory",you will see “ kill -9 XXX” in the
regionserverXX.out .
Then change the “-XX:+DisableExplicitGC” to
"-XX:+ExplicitGCInvokesConcurrent”
If it is the “zookeeper session timeout”, you can see “XX seconds
timeout to zookeeper” in the regionserverXX.log.
Here is the point: The ‘zookeeper.session.timeout’ in hbase-site.xml
does not work if you use a zookeeper is not managed by hbase.
“maxSessionTimeout” in the zoo.cfg is the really property controls the
timeout. default it is 40s. That may be the reason why
‘zookeeper.session.timeout’ does not work.
GC can stop the world,You can optical the GC or change
‘maxSessionTimeout’ bigger to make sure Hbase won’t shutdown the region server
after an acceptable timeout.
在 2016年8月6日,下午2:08,kiran
<[email protected]<mailto:[email protected]>> 写道:
We are also facing the same issue. Please tell us what is the solution. I
have increased the rpc timeout and caching is reduced but with no effect.
We are using hbase 0.98.7. Please suggest a work around as we are facing
the issue very frequently now and we are having downtime in production.
On Fri, Oct 30, 2015 at 9:21 AM, 聪聪 <[email protected]<mailto:[email protected]>>
wrote:
There is a view child nodes loop code http://paste2.org/zm8GE7xH
------------------ 原始邮件 ------------------
发件人: "蒲聪-北京";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月30日(星期五) 上午10:24
收件人: "user"<[email protected]<mailto:[email protected]>>;
主题: 回复: 回复: Hbase cluster is suddenly unable to respond
The client code is http://paste2.org/p3BXkKtV
Is the client version compatible with it?
I see that the client version is hbase0.96.1.1-hadoop2
------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月30日(星期五) 凌晨0:08
收件人:
"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>;
主题: Re: 回复: Hbase cluster is suddenly unable to respond
Client side, have they tried increasing direct memory size ?
-XX:MaxDirectMemorySize=
Do you know how wide the rows returned may get ?
Cheers
On Thu, Oct 29, 2015 at 9:03 AM, 聪聪 <[email protected]<mailto:[email protected]>>
wrote:
Developers feedback their client has the following error:
[2015/10/29 19:20:42.260][WARN][RpcClient:724] IPC Client (1904394969)
connection to
l-hbase28.data.cn8.qunar.com/192.168.44.32:60020<http://l-hbase28.data.cn8.qunar.com/192.168.44.32:60020>
from
tomcat: unexpected exception receiving call responses
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:633) ~[na:1.6.0_20]
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:95)
~[na:1.6.0_20]
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
~[na:1.6.0_20]
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:57)
~[na:1.6.0_20]
at sun.nio.ch.IOUtil.read(IOUtil.java:205) ~[na:1.6.0_20]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
~[na:1.6.0_20]
at
org.apache.hadoop.net.SocketInputStream$Reader.
performIO(SocketInputStream.java:57)
~[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(
SocketIOWithTimeout.java:142)
~[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
~[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
~[hadoop-common-2.2.0.jar:na]
at java.io.FilterInputStream.read(FilterInputStream.java:116)
~[na:1.6.0_20]
at java.io.FilterInputStream.read(FilterInputStream.java:116)
~[na:1.6.0_20]
at
org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(
RpcClient.java:555)
~[hbase-client-0.96.1.1-hadoop2.jar:0.96.1.1-hadoop2]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
~[na:1.6.0_20]
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
~[na:1.6.0_20]
at java.io.DataInputStream.read(DataInputStream.java:132)
~[na:1.6.0_20]
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
~[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.
readResponse(RpcClient.java:1101)
~[hbase-client-0.96.1.1-hadoop2.jar:0.96.1.1-hadoop2]
at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:721)
~[hbase-client-0.96.1.1-hadoop2.jar:0.96.1.1-hadoop2]
------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月29日(星期四) 晚上10:48
收件人:
"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>;
主题: Re: 回复: Hbase cluster is suddenly unable to respond
I took a look at the jstack.
The threads involving RpcServer$Connection.readAndProcess() were in
RUNNABLE state, not BLOCKED or IN_NATIVE state - as described in
HBASE-11277
.
The protobuf exception shown in your earlier email corresponded to the
following in hbase-protocol/src/main/protobuf/Client.proto :
message GetRequest {
required RegionSpecifier region = 1;
required Get get = 2;
}
Are all your hbase clients running in the same version ?
Cheers
On Thu, Oct 29, 2015 at 7:28 AM, 聪聪 <[email protected]<mailto:[email protected]>>
wrote:
the regionserver jstack log is http://paste2.org/yLDJeXgL
------------------ 原始邮件 ------------------
发件人: "蒲聪-北京";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月29日(星期四) 晚上9:10
收件人: "user"<[email protected]<mailto:[email protected]>>;
主题: 回复: 回复: Hbase cluster is suddenly unable to respond
hi Ted:
Yesterday around 14:40,one of regionservers hang once against.At that
time
I saw web ui can not open.Hbase cluster is unable to respond.Very
anxious,
hoping to get help!
jstack log is as follows:
"RpcServer.reader=4,port=60020" daemon prio=10 tid=0x00007f4466146800
nid=0x12d3 runnable [0x00007f3bebe58000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.NativeThread.current(Native Method)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:325)
- locked <0x00007f3d27360fb0> (a java.lang.Object)
- locked <0x00007f3d27360f90> (a java.lang.Object)
at
org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2402)
at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2368)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1476)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
- locked <0x00007f3c584ce990> (a
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
"RpcServer.reader=3,port=60020" daemon prio=10 tid=0x00007f4466145000
nid=0x12d2 runnable [0x00007f3bebf59000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.NativeThread.current(Native Method)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:325)
- locked <0x00007f3d27360530> (a java.lang.Object)
- locked <0x00007f3d27360510> (a java.lang.Object)
at
org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2402)
at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2368)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1476)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
- locked <0x00007f3c584cf7d8> (a
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
region server log :
2015-10-28 14:38:19,801 INFO [Thread-15] regionserver.HRegion:
Finished
memstore flush of ~3.6 M/3820648, currentsize=536/536 for region
order_history,2801xyz140618175732642$3,1418829598639.
afc853471a8cd4184bc9e7be00b8eea0.
in 45ms, sequenceid=9599960557, compaction requested=true
2015-10-28 14:38:32,693 ERROR [RpcServer.handler=3,port=60020]
regionserver.HRegionServer:
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
Expected nextCallSeq: 1 But the nextCallSeq got from client: 0;
request=scanner_id: 16740356019163164014 number_of_rows: 10
close_scanner:
false next_call_seq: 0
at
org.apache.hadoop.hbase.regionserver.HRegionServer.
scan(HRegionServer.java:3007)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
callBlockingMethod(ClientProtos.java:26929)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146)
at
org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851)
2015-10-28 14:38:32,693 ERROR [RpcServer.handler=9,port=60020]
regionserver.HRegionServer:
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
Expected nextCallSeq: 1 But the nextCallSeq got from client: 0;
request=scanner_id: 16740356019163164014 number_of_rows: 10
close_scanner:
false next_call_seq: 0
at
org.apache.hadoop.hbase.regionserver.HRegionServer.
scan(HRegionServer.java:3007)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
callBlockingMethod(ClientProtos.java:26929)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146)
at
org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851)
2015-10-28 14:38:32,696 WARN [RpcServer.reader=2,port=60020]
ipc.RpcServer: Unable to read call parameter from client 192.168.37.135
com.google.protobuf.UninitializedMessageException: Message missing
required fields: region, get
at
com.google.protobuf.AbstractMessage$Builder.
newUninitializedMessageException(AbstractMessage.java:770)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4474)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4406)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processRequest(RpcServer.java:1689)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processOneRpc(RpcServer.java:1631)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1491)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2015-10-28 14:38:32,696 WARN [RpcServer.reader=1,port=60020]
ipc.RpcServer: Unable to read call parameter from client 192.168.37.135
com.google.protobuf.UninitializedMessageException: Message missing
required fields: region, get
at
com.google.protobuf.AbstractMessage$Builder.
newUninitializedMessageException(AbstractMessage.java:770)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4474)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest$
Builder.build(ClientProtos.java:4406)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processRequest(RpcServer.java:1689)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
processOneRpc(RpcServer.java:1631)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1491)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
------------------ 原始邮件 ------------------
发件人: "蒲聪-北京";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月26日(星期一) 晚上9:28
收件人: "user"<[email protected]<mailto:[email protected]>>;
主题: 回复: 回复: Hbase cluster is suddenly unable to respond
Thank you very much!
------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月26日(星期一) 晚上8:28
收件人: "user"<[email protected]<mailto:[email protected]>>;
主题: Re: 回复: Hbase cluster is suddenly unable to respond
The fix from HBASE-11277 may solve your problem - if you collect stack
trace during the hang, we would have more clue.
I suggest upgrading to newer release such as 1.1.2 or 0.98.15
Cheers
On Oct 26, 2015, at 12:42 AM, 聪聪 <[email protected]<mailto:[email protected]>>
wrote:
hi,Ted:
I use the HBase version is hbase-0.96.0.
Around 17:33,other region servers also appeared in this warn log.I
don't
know if it's normal or not.At that time I saw web ui can not open.I
restart
the regionserver then hbase back to normal. Is it possible bug
HBASE-11277?
Regionserver on the log basically almost this warn log
mater on the log is as follows:
2015-10-21 22:15:43,575 INFO [CatalogJanitor-l-namenode2:60000]
master.CatalogJanitor: Scanned 672 catalog row(s), gc'd 0 unreferenced
merged region(s) and 1 unreferenced parent region(s)
2015-10-23 17:47:25,617 INFO [RpcServer.handler=28,port=60000]
master.HMaster: Client=hbase//192.168.39.19 set balanceSwitch=false
2015-10-23 17:49:45,513 WARN [RpcServer.handler=24,port=60000]
ipc.RpcServer: (responseTooSlow):
{"processingtimems":70266,"call":"ListTableDescriptorsByNamespac
e(org.apache.hadoop.hbase.protobuf.generated.MasterProtos$
ListTableDescriptorsByNamespaceRequest)","client":"
192.168.39.22:60292
","starttimems":1445593715207,"queuetimems":0,"class":"
HMaster","responsesize":704,"method":"ListTableDescriptorsByNamespace"}
2015-10-23 17:49:45,513 WARN [RpcServer.handler=6,port=60000]
ipc.RpcServer: (responseTooSlow):
{"processingtimems":130525,"call":"ListTableDescriptorsByNamespac
e(org.apache.hadoop.hbase.protobuf.generated.MasterProtos$
ListTableDescriptorsByNamespaceRequest)","client":"
192.168.39.22:60286
","starttimems":1445593654945,"queuetimems":0,"class":"
HMaster","responsesize":704,"method":"ListTableDescriptorsByNamespace"}
2015-10-23 17:49:45,513 WARN [RpcServer.handler=24,port=60000]
ipc.RpcServer: RpcServer.respondercallId: 130953 service: MasterService
methodName: ListTableDescriptorsByNamespace size: 48 connection:
192.168.39.22:60292: output error
2015-10-23 17:49:45,513 WARN [RpcServer.handler=6,port=60000]
ipc.RpcServer: RpcServer.respondercallId: 130945 service: MasterService
methodName: ListTableDescriptorsByNamespace size: 48 connection:
192.168.39.22:60286: output error
2015-10-23 17:49:45,513 WARN [RpcServer.handler=6,port=60000]
ipc.RpcServer: RpcServer.handler=6,port=60000: caught a
ClosedChannelException, this means that the server was processing a
request
but the client went away. The error message was: null
------------------ 原始邮件 ------------------
发件人: "Ted Yu";<[email protected]<mailto:[email protected]>>;
发送时间: 2015年10月23日(星期五) 晚上11:39
收件人:
"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>;
主题: Re: Hbase cluster is suddenly unable to respond
Were other region servers functioning normally around 17:33 ?
Which hbase release are you using ?
Can you pastebin more of the region server log ?
Thanks
On Fri, Oct 23, 2015 at 8:28 AM, 聪聪 <[email protected]<mailto:[email protected]>>
wrote:
hi,all:
This afternoon,The whole Hbase cluster is suddenly unable to
respond.after
I restart a regionserver after,the cluster has recovered.I don't
know
the
cause of the trouble.I hope I can get help from you.
Regionserver on the log is as follows:
2015-10-23 17:28:49,335 INFO [regionserver60020.logRoller]
wal.FSHLog:
moving old hlog file
/hbase/WALs/l-hbase30.data.cn8.qunar.com<http://l-hbase30.data.cn8.qunar.com>
,60020,1442810406218/l-hbase30.data.cn8.qunar.com<http://l-hbase30.data.cn8.qunar.com>
%2C60020%2C1442810406218.1445580462689
whose highest sequenceid is 9071525521 to /hbase/oldWALs/
l-hbase30.data.cn8.qunar.com<http://l-hbase30.data.cn8.qunar.com>%2C60020%2C1442810406218.1445580462689
2015-10-23 17:33:31,375 WARN [RpcServer.reader=8,port=60020]
ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.
read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(
RpcServer.java:2368)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1403)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2015-10-23 17:33:31,779 WARN [RpcServer.reader=2,port=60020]
ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.
read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
org.apache.hadoop.hbase.ipc.RpcServer.channelRead(
RpcServer.java:2368)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.
readAndProcess(RpcServer.java:1403)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
RpcServer.java:770)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
doRunLoop(RpcServer.java:563)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
RpcServer.java:538)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
--
Thank you
Kiran Sarvabhotla
-----Even a correct decision is wrong when it is taken late
安全提示:本邮件非QUNAR内部邮件,请注意保护个人及公司信息安全,如有索取帐号密码等可疑情况请向 secteam发送邮件