Hi all:
I've a HDP big data cluster with 4 nodes and create by Ambari the HBase is
1.1.2.
As running YCSB for benchmark the RegionServer instance or the Hmaster instance
crashes which it's logs shows:
---------------------log start ---------------------
2016-10-12 23:48:13,591 INFO [main-SendThread(Node1:2181)]
zookeeper.ClientCnxn: Unable to read additional data from server sessionid
0x157b7f5f0bc0005, likely server has closed socket, closing socket connection
and attempting reconnect
2016-10-12 23:48:13,595 INFO [HBase-Metrics2-1] impl.MetricsSinkAdapter: Sink
timeline started
2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl:
Scheduled snapshot period at 10 second(s).
2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase
metrics system started
2016-10-12 23:48:14,496 INFO [main-SendThread(Node4:2181)]
zookeeper.ClientCnxn: Opening socket connection to server Node4/1.1.6.104:2181.
Will not attempt to authenticate using SASL (unknown error)
2016-10-12 23:48:14,506 INFO [main-SendThread(Node4:2181)]
zookeeper.ClientCnxn: Socket connection established to Node4/1.17.6.104:2181,
initiating session
2016-10-12 23:48:14,517 INFO [main-SendThread(Node4:2181)]
zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session
0x157b7f5f0bc0005 has expired, closing socket connection
2016-10-12 23:48:14,517 FATAL [main-EventThread] regionserver.HRegionServer:
ABORTING region server node1,16020,1476260847716:
regionserver:16020-0x157b7f5f0bc0005, quorum=node2:2181,node1:2181,node4:2181,
baseZNode=/hbase-unsecure regionserver:16020-0x157b7f5f0bc0005 received expired
from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2016-10-12 23:48:14,518 FATAL [main-EventThread] regionserver.HRegionServer:
RegionServer abort: loaded coprocessors are:
[org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint]
---------------------log end---------------------
After checked the log ,it shows that the region server jvm paused a long time
and the zkclient cannot send heartbeats, the session times out Which the
'reference guide' had descripted
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired .So a read the
log detail and to find the java GC event but there's no full gc occurred.
And more a found the same symptom??in the DataNode instance .
The node os is Centos7 maybe the kernel futex bug ,after checking the bug
was fixed in my OS .
There's any other factor caused the problem except java GC?
Anyone who got the same problem ? Any ideas ?
Thank you .