What I'm seeing looks more like GC issues. Start reading this: http://hbase.apache.org/book.html#gc
J-D On Thu, May 31, 2012 at 12:10 PM, Cyril Scetbon <[email protected]> wrote: > I'm still having region nodes that crash :( > > You can see reasons on 2 region nodes : > > *node1 :* > > 2012-05-31 17:57:03,937 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region ise,ID-TST-100-1Rw6ue7L8i > FMpWMqMmZYxDdoRLCIzI70e5kVMrgLrXW,1338486988233.aaaa5b0661020d1107eb31e982d71126. > 2012-05-31 17:57:03,938 INFO org.apache.hadoop.hbase.regionserver.Store: > Started compaction of 2 file(s) in cf=core, hasReferences > =true, into > hdfs://hb-zk1:54310/hbase/ise/aaaa5b0661020d1107eb31e982d71126/.tmp, > seqid=995472, totalSize=256.5m > 2012-05-31 17:58:24,860 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block blk_98999474 > 0052943354_38499java.net.SocketTimeoutException: 72000 millis timeout while > waiting for channel to be ready for read. ch : java.ni > o.channels.SocketChannel[connected local=/10.0.0.16:45314 > remote=/10.0.0.16:50010] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at java.io.DataInputStream.readFully(DataInputStream.java:178) > at java.io.DataInputStream.readLong(DataInputStream.java:399) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3049) > 2012-05-31 17:59:11,655 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x337a359bb9800 > 15, likely server has closed socket, closing socket connection and > attempting reconnect > 2012-05-31 17:59:11,687 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_989994740052943354_38499 bad datanode[ > 0] 10.0.0.16:50010 > 2012-05-31 17:59:11,689 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_989994740052943354_38499 in pipeline 1 > 0.0.0.16:50010, 10.0.0.13:50010, 10.0.0.11:50010, 10.0.0.15:50010: bad > datanode 10.0.0.16:50010 > 2012-05-31 17:59:11,688 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=hb-d6,60020,1338475673130, load=(requests=0, regions=13, > usedHeap=434, maxHeap=4091): Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing hb-d6,60020,1338475673130 as dead server > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing hb-d6,60020,1338475673130 as dead server > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:749) > > *node2 : > * > 2012-05-31 18:26:44,339 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server hb-zk1/10.0.0.50:2181 > 2012-05-31 18:26:44,340 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to hb-zk1/10.0.0.50:2181, initiating s > ession > 2012-05-31 18:26:44,343 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > This client just lost it > 's session with ZooKeeper, trying to reconnect. > 2012-05-31 18:26:44,343 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Trying to reconnect to z > ookeeper. > 2012-05-31 18:26:44,450 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x237a35960010023 closed > 2012-05-31 18:26:44,450 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions to > close > 2012-05-31 18:26:44,451 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=hb-zk2:2181,hb-zk1:2181,h > b-zk3:2181 sessionTimeout=180000 watcher=hconnection > 2012-05-31 18:26:44,459 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server hb-zk3/10.0.1.50:2181 > 2012-05-31 18:26:44,490 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to hb-zk3/10.0.1.50:2181, initiating s > ession > 2012-05-31 18:26:44,552 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction interrupted: > java.io.InterruptedIOException: Aborting compaction of store core in region > ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFONbKmFSA > KxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. because it was > interrupted. > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1008) > at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:789) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:750) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > 2012-05-31 18:26:44,553 INFO org.apache.hadoop.hbase.regionserver.HRegion: > aborted compaction on region ise,ID-TST-100-P8xzKBeOhD3 > p9z9iRQTws5X454eyixFONbKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. > after 2mins, 12sec > 2012-05-31 18:26:44,553 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > regionserver60020.compactor exiting > 2012-05-31 18:26:44,555 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Closed ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFON > bKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. > 2012-05-31 18:26:44,650 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server hb-zk3/10.0.1.50:2181, sess > ionid = 0x337a359bb980024, negotiated timeout = 40000 > 2012-05-31 18:26:44,659 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Reconnected successfully > . This disconnect could have been caused by a network partition or a > long-running GC pause, either way it's recommended that you v > erify your environment. > 2012-05-31 18:26:44,659 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2012-05-31 18:26:45,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server hb-zk2/10.0.0.51:2181 > 2012-05-31 18:26:45,009 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to hb-zk2/10.0.0.51:2181, initiating session > 2012-05-31 18:26:45,012 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x337a359bb980001 has expired, > closing socket connection > 2012-05-31 18:26:45,012 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=hb-d1,60020,1338475437868, load=(requests=145, regions=17, > usedHeap=499, maxHeap=4091): regionserver:60020-0x337a359bb980001 > regionserver:60020-0x337a359bb980001 received expired from ZooKeeper, > aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:361) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:279) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502) > > As I understand, node 2 failed cause of a long compaction, but I don't know > how I can fix it. And for node 1 I don't really understand the cause and how > to fix it too :( > > I'm using hbase 0.90 from cdh3 packages (cloudera). I don't understand why a > node that is considered as dead if there is an error can't come back after > that ... > > thanks > > > On 5/29/12 5:17 PM, Cyril Scetbon wrote: >> >> Hi, >> >> I've installed hbase on the following configuration : >> >> 12 x (rest hbase + regionserver hbase + datanode hadoop) >> 2 x (zookeeper + hbase master) >> 1 x (zookeeper + hbase master + namenode hadoop) >> >> OS used is ubuntu lucid (10.04) >> >> The issue is that when I try to load data using rest api, some hosts >> become unreachable even if I can ping them. I can no longer connect to them >> and even monitoring tools can not work during a laps of time. For example, I >> use SAR on each host and you can see that between 7:10 and 7:35 pm the host >> does not write any information : >> >> 06:45:01 PM all 0.18 0.00 0.37 3.61 0.25 >> 95.58 >> 06:45:01 PM 0 0.24 0.00 0.54 6.62 0.35 >> 92.25 >> 06:45:01 PM 1 0.12 0.00 0.20 0.61 0.15 >> 98.92 >> 06:50:02 PM all 5.69 0.00 1.79 4.23 1.94 >> 86.36 >> 06:50:02 PM 0 5.68 0.00 3.00 7.91 2.21 >> 81.21 >> 06:50:02 PM 1 5.70 0.00 0.59 0.55 1.66 >> 91.51 >> 06:55:01 PM all 0.68 0.00 0.14 1.62 0.23 >> 97.33 >> 06:55:01 PM 0 0.87 0.00 0.20 3.19 0.31 >> 95.44 >> 06:55:01 PM 1 0.49 0.00 0.08 0.05 0.15 >> 99.22 >> 06:58:36 PM all 0.03 0.00 0.02 0.45 0.07 >> 99.43 >> 06:58:36 PM 0 0.01 0.00 0.02 0.40 0.13 >> 99.43 >> 06:58:36 PM 1 0.04 0.00 0.01 0.51 0.00 >> 99.43 >> 07:05:01 PM all 0.03 0.00 0.00 0.10 0.07 >> 99.80 >> 07:05:01 PM 0 0.02 0.00 0.00 0.10 0.10 >> 99.78 >> 07:05:01 PM 1 0.04 0.00 0.01 0.09 0.03 >> 99.83 <--- last measure before host becomes reachable >> 07:40:07 PM all 14.72 0.00 17.93 0.02 13.31 >> 54.02 <--- new measure after host becomes reachable >> 07:40:07 PM 0 29.43 0.00 35.87 0.00 26.57 >> 8.13 >> 07:40:07 PM 1 0.00 0.00 0.00 0.04 0.04 >> 99.91 >> 07:45:01 PM all 0.55 0.00 0.25 0.04 0.27 >> 98.89 >> 07:45:01 PM 0 0.54 0.00 0.14 0.05 0.21 >> 99.07 >> 07:45:01 PM 1 0.55 0.00 0.36 0.04 0.33 >> 98.72 >> 07:50:01 PM all 0.11 0.00 0.05 0.18 0.06 >> 99.60 >> 07:50:01 PM 0 0.12 0.00 0.06 0.13 0.09 >> 99.60 >> 07:50:01 PM 1 0.11 0.00 0.04 0.23 0.04 >> 99.59 >> 07:55:01 PM all 0.00 0.00 0.01 0.05 0.07 >> 99.88 >> 07:55:01 PM 0 0.00 0.00 0.01 0.01 0.13 >> 99.84 >> 07:55:01 PM 1 0.00 0.00 0.00 0.08 0.00 >> 99.91 >> 08:05:01 PM all 0.01 0.00 0.00 0.00 0.05 >> 99.94 >> 08:05:01 PM 0 0.00 0.00 0.00 0.00 0.08 >> 99.91 >> 08:05:01 PM 1 0.03 0.00 0.00 0.00 0.01 >> 99.96 >> >> I suppose it's caused by a high load but I don't have any proof :( Is >> there a known bug about that ? I had a similar issue with Cassandra that >> forced me to upgrade to linux kernel > 3.0 >> >> thanks. >> > > > -- > Cyril SCETBON >
