I'm still having region nodes that crash :(

You can see reasons on 2 region nodes :

*node1 :*

2012-05-31 17:57:03,937 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region ise,ID-TST-100-1Rw6ue7L8i
FMpWMqMmZYxDdoRLCIzI70e5kVMrgLrXW,1338486988233.aaaa5b0661020d1107eb31e982d71126.
2012-05-31 17:57:03,938 INFO org.apache.hadoop.hbase.regionserver.Store: Started compaction of 2 file(s) in cf=core, hasReferences =true, into hdfs://hb-zk1:54310/hbase/ise/aaaa5b0661020d1107eb31e982d71126/.tmp, seqid=995472, totalSize=256.5m 2012-05-31 17:58:24,860 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_98999474 0052943354_38499java.net.SocketTimeoutException: 72000 millis timeout while waiting for channel to be ready for read. ch : java.ni o.channels.SocketChannel[connected local=/10.0.0.16:45314 remote=/10.0.0.16:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3049) 2012-05-31 17:59:11,655 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x337a359bb9800 15, likely server has closed socket, closing socket connection and attempting reconnect 2012-05-31 17:59:11,687 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_989994740052943354_38499 bad datanode[
0] 10.0.0.16:50010
2012-05-31 17:59:11,689 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_989994740052943354_38499 in pipeline 1 0.0.0.16:50010, 10.0.0.13:50010, 10.0.0.11:50010, 10.0.0.15:50010: bad datanode 10.0.0.16:50010 2012-05-31 17:59:11,688 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=hb-d6,60020,1338475673130, load=(requests=0, regions=13, usedHeap=434, maxHeap=4091): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hb-d6,60020,1338475673130 as dead server org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hb-d6,60020,1338475673130 as dead server at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:749)

*node2 :
*
2012-05-31 18:26:44,339 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hb-zk1/10.0.0.50:2181 2012-05-31 18:26:44,340 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hb-zk1/10.0.0.50:2181, initiating s
ession
2012-05-31 18:26:44,343 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: This client just lost it
's session with ZooKeeper, trying to reconnect.
2012-05-31 18:26:44,343 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Trying to reconnect to z
ookeeper.
2012-05-31 18:26:44,450 INFO org.apache.zookeeper.ZooKeeper: Session: 0x237a35960010023 closed 2012-05-31 18:26:44,450 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions to close 2012-05-31 18:26:44,451 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hb-zk2:2181,hb-zk1:2181,h
b-zk3:2181 sessionTimeout=180000 watcher=hconnection
2012-05-31 18:26:44,459 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hb-zk3/10.0.1.50:2181 2012-05-31 18:26:44,490 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hb-zk3/10.0.1.50:2181, initiating s
ession
2012-05-31 18:26:44,552 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted: java.io.InterruptedIOException: Aborting compaction of store core in region ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFONbKmFSA KxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. because it was interrupted. at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1008) at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:789) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:750) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) 2012-05-31 18:26:44,553 INFO org.apache.hadoop.hbase.regionserver.HRegion: aborted compaction on region ise,ID-TST-100-P8xzKBeOhD3 p9z9iRQTws5X454eyixFONbKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. after 2mins, 12sec 2012-05-31 18:26:44,553 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor exiting 2012-05-31 18:26:44,555 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFON
bKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24.
2012-05-31 18:26:44,650 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hb-zk3/10.0.1.50:2181, sess
ionid = 0x337a359bb980024, negotiated timeout = 40000
2012-05-31 18:26:44,659 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Reconnected successfully . This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you v
erify your environment.
2012-05-31 18:26:44,659 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2012-05-31 18:26:45,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hb-zk2/10.0.0.51:2181 2012-05-31 18:26:45,009 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hb-zk2/10.0.0.51:2181, initiating session 2012-05-31 18:26:45,012 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x337a359bb980001 has expired, closing socket connection 2012-05-31 18:26:45,012 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=hb-d1,60020,1338475437868, load=(requests=145, regions=17, usedHeap=499, maxHeap=4091): regionserver:60020-0x337a359bb980001 regionserver:60020-0x337a359bb980001 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:361) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:279) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

As I understand, node 2 failed cause of a long compaction, but I don't know how I can fix it. And for node 1 I don't really understand the cause and how to fix it too :(

I'm using hbase 0.90 from cdh3 packages (cloudera). I don't understand why a node that is considered as dead if there is an error can't come back after that ...

thanks

On 5/29/12 5:17 PM, Cyril Scetbon wrote:
Hi,

I've installed hbase on the following configuration :

12 x (rest hbase + regionserver hbase + datanode hadoop)
2 x (zookeeper + hbase master)
1 x (zookeeper + hbase master + namenode hadoop)

OS used is ubuntu lucid (10.04)

The issue is that when I try to load data using rest api, some hosts become unreachable even if I can ping them. I can no longer connect to them and even monitoring tools can not work during a laps of time. For example, I use SAR on each host and you can see that between 7:10 and 7:35 pm the host does not write any information :

06:45:01 PM all 0.18 0.00 0.37 3.61 0.25 95.58 06:45:01 PM 0 0.24 0.00 0.54 6.62 0.35 92.25 06:45:01 PM 1 0.12 0.00 0.20 0.61 0.15 98.92 06:50:02 PM all 5.69 0.00 1.79 4.23 1.94 86.36 06:50:02 PM 0 5.68 0.00 3.00 7.91 2.21 81.21 06:50:02 PM 1 5.70 0.00 0.59 0.55 1.66 91.51 06:55:01 PM all 0.68 0.00 0.14 1.62 0.23 97.33 06:55:01 PM 0 0.87 0.00 0.20 3.19 0.31 95.44 06:55:01 PM 1 0.49 0.00 0.08 0.05 0.15 99.22 06:58:36 PM all 0.03 0.00 0.02 0.45 0.07 99.43 06:58:36 PM 0 0.01 0.00 0.02 0.40 0.13 99.43 06:58:36 PM 1 0.04 0.00 0.01 0.51 0.00 99.43 07:05:01 PM all 0.03 0.00 0.00 0.10 0.07 99.80 07:05:01 PM 0 0.02 0.00 0.00 0.10 0.10 99.78 07:05:01 PM 1 0.04 0.00 0.01 0.09 0.03 99.83 <--- last measure before host becomes reachable 07:40:07 PM all 14.72 0.00 17.93 0.02 13.31 54.02 <--- new measure after host becomes reachable 07:40:07 PM 0 29.43 0.00 35.87 0.00 26.57 8.13 07:40:07 PM 1 0.00 0.00 0.00 0.04 0.04 99.91 07:45:01 PM all 0.55 0.00 0.25 0.04 0.27 98.89 07:45:01 PM 0 0.54 0.00 0.14 0.05 0.21 99.07 07:45:01 PM 1 0.55 0.00 0.36 0.04 0.33 98.72 07:50:01 PM all 0.11 0.00 0.05 0.18 0.06 99.60 07:50:01 PM 0 0.12 0.00 0.06 0.13 0.09 99.60 07:50:01 PM 1 0.11 0.00 0.04 0.23 0.04 99.59 07:55:01 PM all 0.00 0.00 0.01 0.05 0.07 99.88 07:55:01 PM 0 0.00 0.00 0.01 0.01 0.13 99.84 07:55:01 PM 1 0.00 0.00 0.00 0.08 0.00 99.91 08:05:01 PM all 0.01 0.00 0.00 0.00 0.05 99.94 08:05:01 PM 0 0.00 0.00 0.00 0.00 0.08 99.91 08:05:01 PM 1 0.03 0.00 0.00 0.00 0.01 99.96

I suppose it's caused by a high load but I don't have any proof :( Is there a known bug about that ? I had a similar issue with Cassandra that forced me to upgrade to linux kernel > 3.0

thanks.



--
Cyril SCETBON

Reply via email to