Can you show more of the region server log prior to 23:48:13 (including the pause) ?
Was the region server under heavy load during the pause ? Consider turning on DEBUG logging if you haven't. Please also share GC parameters. Thanks On Tue, Oct 18, 2016 at 7:58 PM, who.cat <[email protected]> wrote: > Hi all: > I've a HDP big data cluster with 4 nodes and create by Ambari the HBase > is 1.1.2. > As running YCSB for benchmark the RegionServer instance or the Hmaster > instance crashes which it's logs shows: > > ---------------------log start --------------------- > 2016-10-12 23:48:13,591 INFO [main-SendThread(Node1:2181)] > zookeeper.ClientCnxn: Unable to read additional data from server sessionid > 0x157b7f5f0bc0005, likely server has closed socket, closing socket > connection and attempting reconnect > 2016-10-12 23:48:13,595 INFO [HBase-Metrics2-1] impl.MetricsSinkAdapter: > Sink timeline started > 2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: > Scheduled snapshot period at 10 second(s). > 2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: > HBase metrics system started > 2016-10-12 23:48:14,496 INFO [main-SendThread(Node4:2181)] > zookeeper.ClientCnxn: Opening socket connection to server Node4/ > 1.1.6.104:2181. Will not attempt to authenticate using SASL (unknown > error) > 2016-10-12 23:48:14,506 INFO [main-SendThread(Node4:2181)] > zookeeper.ClientCnxn: Socket connection established to Node4/ > 1.17.6.104:2181, initiating session > 2016-10-12 23:48:14,517 INFO [main-SendThread(Node4:2181)] > zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session > 0x157b7f5f0bc0005 has expired, closing socket connection > 2016-10-12 23:48:14,517 FATAL [main-EventThread] > regionserver.HRegionServer: ABORTING region server > node1,16020,1476260847716: regionserver:16020-0x157b7f5f0bc0005, > quorum=node2:2181,node1:2181,node4:2181, baseZNode=/hbase-unsecure > regionserver:16020-0x157b7f5f0bc0005 received expired from ZooKeeper, > aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > connectionEvent(ZooKeeperWatcher.java:585) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > process(ZooKeeperWatcher.java:517) > at org.apache.zookeeper.ClientCnxn$EventThread. > processEvent(ClientCnxn.java:534) > at org.apache.zookeeper.ClientCnxn$EventThread.run( > ClientCnxn.java:510) > 2016-10-12 23:48:14,518 FATAL [main-EventThread] > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: > [org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint] > ---------------------log end--------------------- > > After checked the log ,it shows that the region server jvm paused a long > time and the zkclient cannot send heartbeats, the session times out Which > the 'reference guide' had descripted http://hbase.apache.org/book. > html#trouble.rs.runtime.zkexpired .So a read the log detail and to find > the java GC event but there's no full gc occurred. > And more a found the same symptom in the DataNode instance . > > The node os is Centos7 maybe the kernel futex bug ,after checking the > bug was fixed in my OS . > There's any other factor caused the problem except java GC? > Anyone who got the same problem ? Any ideas ? > Thank you .
