Hi J-D, Thank you for response. Right, I see in logs the pause. I see it in both ZK log and regionserver (I pasted respective parts of logs with previous message). Just not sure about the cause of the pause.
Anyways, I restarted things with GC being logged this time. Will inspect the issue if I face it again. Alex Baranau ---- Sematext :: http://sematext.com/ On Thu, Dec 9, 2010 at 7:44 PM, Jean-Daniel Cryans <[email protected]>wrote: > Hey Alex, > > It's either the client or the server box that wasn't responding, > either way they didn't talk for 52 seconds when your session timeout > was set at 40 seconds. I suggest you also take a look at the ZK log, > it should tell you exactly when the session was expired. > > J-D > > On Thu, Dec 9, 2010 at 9:19 AM, Alex Baranau <[email protected]> > wrote: > > Hello guys, > > > > I got regionserver crash and trying to find out why. I found > > * in regionserver log [1] : ZK session expired, and before that slow hlog > > edits; > > * nothing in DataNode log [2] and HMaster[4] log; > > * some warns in ZK log [3] with EndOfStreamException. > > > > I wonder can it be because of long GC and thus "stop the world" pause for > > about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an > > old test box). If it can be caused by GC I'd restart the server with GC > > logging, but if not, I'd like to figure out the cause... > > > > HBase version is 0.89.20100924. > > > > Thank you, > > > > Alex Baranau > > ---- > > Sematext :: http://sematext.com/ > > > > [1] Regionserver log: ZK session expired, before that slow hlog edits > > > > 2010-12-08 06:43:15,403 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Starting compaction on region agg2,system|tq|s\x00|in licensed License > > project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c. > > 2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store: > > Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed > > License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c., > > hasReferences=true, into > > hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp, > > sequenceid=1153119174 > > 2010-12-08 06:43:24,825 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 39.93 MB of total=339.31 MB > > 2010-12-08 06:43:24,829 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=39.95 MB, total=299.36 MB, single=147.93 MB, > multi=154.82 > > MB, memory=33.28 MB > > 2010-12-08 06:43:50,871 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 40 MB of total=339.39 MB > > 2010-12-08 06:43:50,875 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=40.01 MB, total=299.37 MB, single=147.86 MB, > multi=154.96 > > MB, memory=33.28 MB > > 2010-12-08 06:43:57,095 WARN > org.apache.hadoop.hbase.regionserver.wal.HLog: > > IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog; > > editcount=8882 > > 2010-12-08 06:44:25,086 WARN > org.apache.hadoop.hbase.regionserver.wal.HLog: > > IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog; > > editcount=13054 > > 2010-12-08 06:44:55,815 WARN > org.apache.hadoop.hbase.regionserver.wal.HLog: > > IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog; > > editcount=13472 > > 2010-12-08 06:45:33,003 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=337.5 > MB, > > free=61.68 MB, max=399.18 MB, blocks=9015, accesses=553728227, > > hits=40478965, hitRatio=7.31%%, evictions=135536, evicted=86306453, > > evictedPerRun=636.77880859375 > > 2010-12-08 06:45:45,333 INFO > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of > > raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global > > memstore limit of 798.4m exceeded; currently 812.8m and flushing till > 698.6m > > 2010-12-08 06:45:45,363 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Started memstore flush for region > > raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region > > memstore size 45.0m > > 2010-12-08 06:45:45,364 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Finished snapshotting, commencing flushing stores > > 2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store: > > Renaming flushed file at > > > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521672231455831 > > to > > > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580 > > 2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store: > > Added > > > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580, > > entries=8356, sequenceid=1153215388, memsize=1.7m, filesize=73.2k to > > raw3,,1291780273350.73d076a68191712c68d16c4805973561. > > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client > session > > timed out, have not heard from server in 52481ms for sessionid > > 0x12c34f1783c0002, closing socket connection and attempting reconnect > > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client > session > > timed out, have not heard from server in 52481ms for sessionid > > 0x12c34f1783c00bf, closing socket connection and attempting reconnect > > 2010-12-08 06:46:44,200 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > > state: Disconnected, type: None, path: null > > 2010-12-08 06:46:44,200 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > > state: Disconnected, type: None, path: null > > 2010-12-08 06:46:44,201 WARN > > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No > longer > > connected to ZooKeeper, current state: Disconnected > > 2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181 > > 2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket > > connection established to ip-10-245-202-207.ec2.internal/ > 10.245.202.207:2181, > > initiating session > > 2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session > > establishment complete on server ip-10-245-202-207.ec2.internal/ > > 10.245.202.207:2181, sessionid = 0x12c34f1783c00bf, negotiated timeout = > > 40000 > > 2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session: > > 0x12c34f1783c00bf closed > > 2010-12-08 06:46:51,264 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: > > > <localhost:/hbase,org.apache.hadoop.hbase.client.HConnectionManager>Closed > > connection with ZooKeeper; /hbase/root-region-server > > 2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181 > > 2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket > > connection established to ip-10-245-202-207.ec2.internal/ > 10.245.202.207:2181, > > initiating session > > 2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to > > reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired, > > closing socket connection > > 2010-12-08 06:46:58,491 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > > state: Expired, type: None, path: null > > 2010-12-08 06:46:58,491 ERROR > > org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session > > expired > > 2010-12-08 06:46:58,491 FATAL > > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region > server > > serverName=ip-10-245-202-207.ec2.internal,60020,1289379196235, > > load=(requests=73, regions=649, usedHeap=1983, maxHeap=1995): ZooKeeper > > session expired > > > > > > [2] DataNode log: nothing, just noticeable pause for almost 1 minute > > > > 2010-12-08 06:45:58,120 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ, > cliID: > > > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > > offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > > blockid: blk_6449721912917715607_913675, duration: 3639142 > > 2010-12-08 06:45:58,134 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ, > cliID: > > > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > > offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > > blockid: blk_6449721912917715607_913675, duration: 427441 > > 2010-12-08 06:45:58,148 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ, > > cliID: > > > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > > offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > > blockid: blk_825672696763040280_900371, duration: 6352102264 > > 2010-12-08 06:46:44,192 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > > blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: / > 127.0.0.1:50010 > > 2010-12-08 06:46:51,263 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ, > > cliID: > > > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > > offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > > blockid: blk_8449446729722678098_531164, duration: 401874426400 > > 2010-12-08 06:46:51,333 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ, > > cliID: > > > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > > offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > > blockid: blk_-4815658898725325280_624191, duration: 401614939902 > > 2010-12-08 06:46:58,266 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls > > recoverBlock(block=blk_7609568532460932683_912614, targets=[ > 127.0.0.1:50010 > > ]) > > > > [3] Zookeeper log: some warns > > > > 2010-12-08 02:50:46,100 - INFO > [ProcessThread:-1:preprequestproces...@385] > > - Processed session termination for sessionid: 0x12c34f1783d6eba > > 2010-12-08 02:50:46,104 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > > client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba > > 2010-12-08 06:46:44,092 - WARN [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable > to > > read additional data from client sessionid 0x12c34f1783c00bf, likely > client > > has closed socket > > 2010-12-08 06:46:44,661 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > > client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf > > 2010-12-08 06:46:44,662 - WARN [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable > to > > read additional data from client sessionid 0x12c34f1783c0002, likely > client > > has closed socket > > 2010-12-08 06:46:44,662 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > > client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002 > > 2010-12-08 06:46:51,219 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - Accepted socket > connection > > from /10.245.202.207:33388 > > 2010-12-08 06:46:51,219 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@769] - Client attempting to renew > session > > 0x12c34f1783c00bf at /10.245.202.207:33388 > > 2010-12-08 06:46:51,220 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1579] - Established session > > 0x12c34f1783c00bf with negotiated timeout 40000 for client / > > 10.245.202.207:33388 > > 2010-12-08 06:46:51,220 - INFO > [ProcessThread:-1:preprequestproces...@385] > > - Processed session termination for sessionid: 0x12c34f1783c00bf > > 2010-12-08 06:46:51,223 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > > client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf > > 2010-12-08 06:46:58,001 - INFO [SessionTracker:zookeeperser...@315] - > > Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded > > 2010-12-08 06:46:58,002 - INFO > [ProcessThread:-1:preprequestproces...@385] > > - Processed session termination for sessionid: 0x12c34f1783c0002 > > 2010-12-08 06:46:58,376 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - Accepted socket > connection > > from /10.245.202.207:33394 > > 2010-12-08 06:46:58,489 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@769] - Client attempting to renew > session > > 0x12c34f1783c0002 at /10.245.202.207:33394 > > 2010-12-08 06:46:58,489 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1572] - Invalid session > 0x12c34f1783c0002 > > for client /10.245.202.207:33394, probably expired > > 2010-12-08 06:46:58,490 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > > client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002 > > > > [4] HMaster log: nothing > > > > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.metaScanner scan of 648 row(s) of meta region {server: > > 10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>} > > complete > > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: > All > > 1 .META. region(s) scanned > > 2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scanning meta region {server: > 10.245.202.207:60020, > > regionname: -ROOT-,,0.70236052, startKey: <>} > > 2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} > complete > > 2010-12-08 06:45:35,367 INFO > org.apache.hadoop.hbase.master.ServerManager: 1 > > region servers, 0 dead, average load 649.0 > > 2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.metaScanner scanning meta region {server: > 10.245.202.207:60020, > > regionname: .META.,,1.1028785192, startKey: <>} > > 2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scanning meta region {server: > 10.245.202.207:60020, > > regionname: -ROOT-,,0.70236052, startKey: <>} > > 2010-12-08 06:46:35,369 INFO > org.apache.hadoop.hbase.master.ServerManager: 1 > > region servers, 0 dead, average load 649.0 > > 2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} > complete > > 2010-12-08 06:46:58,005 INFO > org.apache.hadoop.hbase.master.ServerManager: > > ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired > > 2010-12-08 06:46:58,005 DEBUG > org.apache.hadoop.hbase.master.ServerManager: > > Added=ip-10-245-202-207.ec2.internal,60020,1289379196235 to dead servers, > > added shutdown processing operation > > 2010-12-08 06:46:58,006 INFO > org.apache.hadoop.hbase.master.RegionManager: > > -ROOT- region unset (but not set to be reassigned) > > 2010-12-08 06:46:58,006 INFO > org.apache.hadoop.hbase.master.RegionManager: > > META region whose startkey is removed from onlineMetaRegions > > 2010-12-08 06:46:58,006 DEBUG > > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing > todo: > > ProcessServerShutdown of > ip-10-245-202-207.ec2.internal,60020,1289379196235 > > 2010-12-08 06:46:58,006 INFO > > org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of > > server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit: > false, > > rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0 > > 2010-12-08 06:46:58,012 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > > Splitting 34 hlog(s) in > > > hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379196235 > > >
