Hi all,
I had an issue recently where a scan job I frequently run caught ConnectionLoss
and subsequently failed to recover.
The stack trace looks like this:
11/04/08 12:20:04 INFO zookeeper.ZooKeeper: Session: 0x12f2497b00d03d8 closed
11/04/08 12:20:04 WARN client.HConnectionManager$ClientZKWatcher: No longer
connected to ZooKeeper, current state: Disconnected
11/04/08 12:20:05 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:21811
11/04/08 12:20:05 INFO zookeeper.ZooKeeper: Session: 0x12f2497b00d03d9 closed
11/04/08 12:20:06 INFO zookeeper.ZooKeeperWrapper: Reconnecting to zookeeper
11/04/08 12:20:06 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:21811 sessionTimeout=60000
watcher=org.apache.hadoop.hbase.z
ookeeper.ZooKeeperWrapper@51127a
11/04/08 12:20:06 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:21811
11/04/08 12:20:06 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
11/04/08 12:20:06 WARN zookeeper.ZooKeeperWrapper: Problem getting stats for
/hbase/rs
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /hbase/rs
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.getRSDirectoryCount(ZooKeeperWrapper.java:754)
at
org.apache.hadoop.hbase.client.HTable.getCurrentNrHRS(HTable.java:173)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:147)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:102)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.prefetchRegionCache(HConnectionManager.java:732)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:783)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:677)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:650)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:470)
at
org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCallable.java:57)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1145)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:503)
at
com.adobe.hs.ets.dozer.afs.EtsAfsBuilder.getHBaseTimestamp(EtsAfsBuilder.java:215)
at
com.adobe.hs.ets.dozer.afs.EtsAfsBuilder.syncHour(EtsAfsBuilder.java:310)
at com.adobe.hs.ets.dozer.afs.EtsAfsBuilder.go(EtsAfsBuilder.java:130)
at BuildAfs.main(BuildAfs.java:43)
11/04/08 12:20:07 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:21811
11/04/08 12:20:07 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
11/04/08 12:20:09 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:21811
11/04/08 12:20:09 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
It then goes on to retry endlessly. Killing the spinning job and running it
again worked fine, so crashing would be preferable to me over retrying
endlessly.
I'm not especially concerned about what went wrong to cause ConnectionLoss in
the first place, but I am interested in being able to set some behavior for
handling the ZK exceptions elegantly. For example, the call site in my code
leading to the exception is this:
Get get = new Get(Bytes.toBytes(level.rowKeyDateFormat.format(dts)));
Result result = timestampsTable.get(get);
I suppose this means that if I want to catch ConnectionLoss in my code, I have
to wrap all my gets and puts with that catch block. Or maybe just the first
one? It seems like HTable and friends might be able to catch this exception in
a more central location, maybe somewhere in here:
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.getRSDirectoryCount(ZooKeeperWrapper.java:754)
I'm running HBase 0.89.20100924+28. Will this issue go away if I upgrade to a
newer version?
Thanks,
Sandy