Hi Kaveh, the respons is maybe already displayed on the logs you sent ;)
"This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you verify your environment." Do you have GC logs? Have you tried anything to solve that? JM 2013/4/22 kaveh minooie <[email protected]>: > > Hi > > after a few mapreduce jobs my regionservers shut themselves down. this is > the latest time that this has happened: > > 2013-04-22 16:47:21,843 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > This client just lost it's session with ZooKeeper, trying to reconnect. > 2013-04-22 16:47:21,843 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5 > 392, regions=196, usedHeap=1063, maxHeap=3966): > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired fr > om ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499) > 2013-04-22 16:47:21,843 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Trying to reconnect to zookeeper. > 2013-04-22 16:47:21,844 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > requests=1794, regions=196, stores=1561, storefiles=1585, > storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10, > flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032, > blockCacheFree=169901776, blockCacheCount=7242, blockCacheHitCount=910925, > blockCacheMissCount=1558134, blockCacheEvictedCount=1344753, > blockCacheHitRatio=36, blockCacheHitCachingRatio=40 > 2013-04-22 16:47:21,844 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired from > ZooKeeper, aborting > 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2013-04-22 16:47:21,900 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > Too many consecutive RollWriter requests, it's a sign of the total number of > live datanodes is lower than the tolerable replicas. > 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=zk1:2181 sessionTimeout=180000 > watcher=hconnection > 2013-04-22 16:47:22,357 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions to > close > 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not attempt > to authenticate using SASL (unknown error) > 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181, initiating > session > 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181, > sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000 > 2013-04-22 16:47:22,400 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Reconnected successfully. This disconnect could have been caused by a > network partition or a long-running GC pause, either way it's recommended > that you verify your environment. > 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction interrupted by user: > java.io.InterruptedIOException: Aborting compaction of store f in region > t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. > because user requested stop. > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998) > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion: > aborted compaction on region > t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. > after 5mins, 58sec > 2013-04-22 16:47:56,830 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > regionserver60020.compactor exiting > 2013-04-22 16:47:56,832 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Closed > t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. > 2013-04-22 16:47:57,363 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer exiting > 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020 closing leases > 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020 closed leases > 2013-04-22 16:47:57,366 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 > exiting > 2013-04-22 16:47:57,497 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main] > 2013-04-22 16:47:57,497 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook > 2013-04-22 16:47:57,497 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook > thread. > 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020.leaseChecker closing leases > 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020.leaseChecker closed leases > 2013-04-22 16:47:57,598 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. > > I would appreciate it very much if someone could explain to me what just > happened here. > > thanks,
