I'm still having region nodes that crash :(
You can see reasons on 2 region nodes :
*node1 :*
2012-05-31 17:57:03,937 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on
region ise,ID-TST-100-1Rw6ue7L8i
FMpWMqMmZYxDdoRLCIzI70e5kVMrgLrXW,1338486988233.aaaa5b0661020d1107eb31e982d71126.
2012-05-31 17:57:03,938 INFO org.apache.hadoop.hbase.regionserver.Store:
Started compaction of 2 file(s) in cf=core, hasReferences
=true, into
hdfs://hb-zk1:54310/hbase/ise/aaaa5b0661020d1107eb31e982d71126/.tmp,
seqid=995472, totalSize=256.5m
2012-05-31 17:58:24,860 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block blk_98999474
0052943354_38499java.net.SocketTimeoutException: 72000 millis timeout
while waiting for channel to be ready for read. ch : java.ni
o.channels.SocketChannel[connected local=/10.0.0.16:45314
remote=/10.0.0.16:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3049)
2012-05-31 17:59:11,655 INFO org.apache.zookeeper.ClientCnxn: Unable to
read additional data from server sessionid 0x337a359bb9800
15, likely server has closed socket, closing socket connection and
attempting reconnect
2012-05-31 17:59:11,687 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_989994740052943354_38499 bad datanode[
0] 10.0.0.16:50010
2012-05-31 17:59:11,689 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_989994740052943354_38499 in pipeline 1
0.0.0.16:50010, 10.0.0.13:50010, 10.0.0.11:50010, 10.0.0.15:50010: bad
datanode 10.0.0.16:50010
2012-05-31 17:59:11,688 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=hb-d6,60020,1338475673130, load=(requests=0,
regions=13, usedHeap=434, maxHeap=4091): Unhandled exception:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing hb-d6,60020,1338475673130 as dead server
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing hb-d6,60020,1338475673130 as dead server
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:749)
*node2 :
*
2012-05-31 18:26:44,339 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hb-zk1/10.0.0.50:2181
2012-05-31 18:26:44,340 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hb-zk1/10.0.0.50:2181, initiating s
ession
2012-05-31 18:26:44,343 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
This client just lost it
's session with ZooKeeper, trying to reconnect.
2012-05-31 18:26:44,343 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Trying to reconnect to z
ookeeper.
2012-05-31 18:26:44,450 INFO org.apache.zookeeper.ZooKeeper: Session:
0x237a35960010023 closed
2012-05-31 18:26:44,450 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions
to close
2012-05-31 18:26:44,451 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection, connectString=hb-zk2:2181,hb-zk1:2181,h
b-zk3:2181 sessionTimeout=180000 watcher=hconnection
2012-05-31 18:26:44,459 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hb-zk3/10.0.1.50:2181
2012-05-31 18:26:44,490 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hb-zk3/10.0.1.50:2181, initiating s
ession
2012-05-31 18:26:44,552 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted:
java.io.InterruptedIOException: Aborting compaction of store core in
region ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFONbKmFSA
KxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24. because it was
interrupted.
at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1008)
at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:789)
at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:810)
at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:750)
at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
2012-05-31 18:26:44,553 INFO
org.apache.hadoop.hbase.regionserver.HRegion: aborted compaction on
region ise,ID-TST-100-P8xzKBeOhD3
p9z9iRQTws5X454eyixFONbKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24.
after 2mins, 12sec
2012-05-31 18:26:44,553 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver60020.compactor exiting
2012-05-31 18:26:44,555 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Closed
ise,ID-TST-100-P8xzKBeOhD3p9z9iRQTws5X454eyixFON
bKmFSAKxNy,1338488671506.774822fbcfa2ab325b31c3c800d72f24.
2012-05-31 18:26:44,650 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hb-zk3/10.0.1.50:2181, sess
ionid = 0x337a359bb980024, negotiated timeout = 40000
2012-05-31 18:26:44,659 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Reconnected successfully
. This disconnect could have been caused by a network partition or a
long-running GC pause, either way it's recommended that you v
erify your environment.
2012-05-31 18:26:44,659 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2012-05-31 18:26:45,006 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hb-zk2/10.0.0.51:2181
2012-05-31 18:26:45,009 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hb-zk2/10.0.0.51:2181, initiating session
2012-05-31 18:26:45,012 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x337a359bb980001 has expired,
closing socket connection
2012-05-31 18:26:45,012 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=hb-d1,60020,1338475437868, load=(requests=145,
regions=17, usedHeap=499, maxHeap=4091):
regionserver:60020-0x337a359bb980001
regionserver:60020-0x337a359bb980001 received expired from ZooKeeper,
aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:361)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:279)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
As I understand, node 2 failed cause of a long compaction, but I don't
know how I can fix it. And for node 1 I don't really understand the
cause and how to fix it too :(
I'm using hbase 0.90 from cdh3 packages (cloudera). I don't understand
why a node that is considered as dead if there is an error can't come
back after that ...
thanks
On 5/29/12 5:17 PM, Cyril Scetbon wrote:
Hi,
I've installed hbase on the following configuration :
12 x (rest hbase + regionserver hbase + datanode hadoop)
2 x (zookeeper + hbase master)
1 x (zookeeper + hbase master + namenode hadoop)
OS used is ubuntu lucid (10.04)
The issue is that when I try to load data using rest api, some hosts
become unreachable even if I can ping them. I can no longer connect to
them and even monitoring tools can not work during a laps of time. For
example, I use SAR on each host and you can see that between 7:10 and
7:35 pm the host does not write any information :
06:45:01 PM all 0.18 0.00 0.37 3.61 0.25
95.58
06:45:01 PM 0 0.24 0.00 0.54 6.62 0.35
92.25
06:45:01 PM 1 0.12 0.00 0.20 0.61 0.15
98.92
06:50:02 PM all 5.69 0.00 1.79 4.23 1.94
86.36
06:50:02 PM 0 5.68 0.00 3.00 7.91 2.21
81.21
06:50:02 PM 1 5.70 0.00 0.59 0.55 1.66
91.51
06:55:01 PM all 0.68 0.00 0.14 1.62 0.23
97.33
06:55:01 PM 0 0.87 0.00 0.20 3.19 0.31
95.44
06:55:01 PM 1 0.49 0.00 0.08 0.05 0.15
99.22
06:58:36 PM all 0.03 0.00 0.02 0.45 0.07
99.43
06:58:36 PM 0 0.01 0.00 0.02 0.40 0.13
99.43
06:58:36 PM 1 0.04 0.00 0.01 0.51 0.00
99.43
07:05:01 PM all 0.03 0.00 0.00 0.10 0.07
99.80
07:05:01 PM 0 0.02 0.00 0.00 0.10 0.10
99.78
07:05:01 PM 1 0.04 0.00 0.01 0.09 0.03
99.83 <--- last measure before host becomes reachable
07:40:07 PM all 14.72 0.00 17.93 0.02 13.31
54.02 <--- new measure after host becomes reachable
07:40:07 PM 0 29.43 0.00 35.87 0.00 26.57
8.13
07:40:07 PM 1 0.00 0.00 0.00 0.04 0.04
99.91
07:45:01 PM all 0.55 0.00 0.25 0.04 0.27
98.89
07:45:01 PM 0 0.54 0.00 0.14 0.05 0.21
99.07
07:45:01 PM 1 0.55 0.00 0.36 0.04 0.33
98.72
07:50:01 PM all 0.11 0.00 0.05 0.18 0.06
99.60
07:50:01 PM 0 0.12 0.00 0.06 0.13 0.09
99.60
07:50:01 PM 1 0.11 0.00 0.04 0.23 0.04
99.59
07:55:01 PM all 0.00 0.00 0.01 0.05 0.07
99.88
07:55:01 PM 0 0.00 0.00 0.01 0.01 0.13
99.84
07:55:01 PM 1 0.00 0.00 0.00 0.08 0.00
99.91
08:05:01 PM all 0.01 0.00 0.00 0.00 0.05
99.94
08:05:01 PM 0 0.00 0.00 0.00 0.00 0.08
99.91
08:05:01 PM 1 0.03 0.00 0.00 0.00 0.01
99.96
I suppose it's caused by a high load but I don't have any proof :( Is
there a known bug about that ? I had a similar issue with Cassandra
that forced me to upgrade to linux kernel > 3.0
thanks.
--
Cyril SCETBON