How much heap are you running on your RegionServers? 6GB of total RAM is on the low end. For high throughput applications, I would recommend at least 6-8GB of heap (so 8+ GB of RAM).
> -----Original Message----- > From: charan kumar [mailto:charan.ku...@gmail.com] > Sent: Thursday, February 03, 2011 11:47 AM > To: user@hbase.apache.org > Subject: Region Servers Crashing during Random Reads > > Hello, > > I am using hbase 0.90.0 with hadoop-append. h/w ( Dell 1950, 2 CPU, 6 GB > RAM) > > I had 9 Region Servers crash (out of 30) in a span of 30 minutes during a > heavy > reads. It looks like a GC, ZooKeeper Connection Timeout thingy to me. > I did all recommended configuration from the Hbase wiki... Any other > suggestions? > > > 2011-02-03T09:43:07.890-0800: 70693.632: [GC 70693.632: [ParNew > (promotion > failed): 5555K->5540K(5568K), 0.0280950 secs]70693.660: > [CMS2011-02-03T09:43:16.864-0800: 70702.606: [CMS-concurrent-mark: > 12.549/69.323 secs] [Times: user=11.90 sys=1.26, real=69.31 secs] > > 2011-02-03T09:53:35.165-0800: 71320.785: [GC 71320.785: [ParNew > (promotion > failed): 5568K->5568K(5568K), 0.4384530 secs]71321.224: > [CMS2011-02-03T09:53:45.111-0800: 71330.731: [CMS-concurrent-mark: > 17.511/51.564 secs] [Times: user=38.72 sys=5.67, real=51.60 secs] > > 2011-02-03T09:43:07.890-0800: 70693.632: [GC 70693.632: [ParNew > (promotion > failed): 5555K->5540K(5568K), 0.0280950 secs]70693.660: > [CMS2011-02-03T09:43:16.864-0800: 70702.606: [CMS-concurrent-mark: > 12.549/69.323 secs] [Times: user=11.90 sys=1.26, real=69.31 secs] > > > The following is the log entry in region Server > > 2011-02-03 10:37:43,946 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 47172ms for sessionid > 0x12db9f722421ce3, closing socket connection and attempting reconnect > 2011-02-03 10:37:43,947 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 48159ms for sessionid > 0x22db9f722501d93, closing socket connection and attempting reconnect > 2011-02-03 10:37:44,401 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server XXXXXXXXXXXXXXXX > 2011-02-03 10:37:44,402 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to XXXXXXXXX, initiating session > 2011-02-03 10:37:44,709 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server XXXXXXXXXXXXXXX > 2011-02-03 10:37:44,709 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to XXXXXXXXXXXXXXXXXXXXX, initiating session > 2011-02-03 10:37:44,767 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started; Attempting to free 81.93 MB of total=696.25 MB > 2011-02-03 10:37:44,784 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed; freed=81.94 MB, total=614.81 MB, single=379.98 MB, > multi=309.77 MB, memory=0 KB > 2011-02-03 10:37:45,205 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x22db9f722501d93 has expired, > closing socket connection > 2011-02-03 10:37:45,206 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem > entation: > This client just lost it's session with ZooKeeper, trying to reconnect. > 2011-02-03 10:37:45,453 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem > entation: > Trying to reconnect to zookeeper > 2011-02-03 10:37:45,206 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x12db9f722421ce3 has expired, > closing socket connection > gionserver:60020-0x22db9f722501d93 regionserver:60020- > 0x22db9f722501d93 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent( > ZooKeeperWatcher.java:328) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeep > erWatcher.java:246) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja > va:530) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > handled exception: org.apache.hadoop.hbase.YouAreDeadException: Server > REPORT rejected; currently processing XXXXXXXXXXXX,60020,1296684296172 > as dead server > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing XXXXXXXXXXXX,60020,1296684296172 as dead server > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor > AccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon > structorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce > ption.java:96) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote > Exception.java:80) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep > ort(HRegionServer.java:729) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j > ava:586) > at java.lang.Thread.run(Thread.java:619) > > > 2011-02-03T09:53:35.165-0800: 71320.785: [GC 71320.785: [ParNew > (promotion > failed): 5568K->5568K(5568K), 0.4384530 secs]71321.224: > [CMS2011-02-03T09:53:45.111-0800: 71330.731: [CMS-concurrent-mark: > 17.511/51.564 secs] [Times: user=38.72 sys=5.67, real=51.60 secs] > > > > Thanks, > Charan