Hi All,

I was running a 2 node cluster with 1 zookeeper node and 2 region server node. 
I had also setup cluster replication with another single node Hbase-Hadoop 
cluster. Replication was successful and I left the cluster running over the 
weekend with no data for replication.

Today I can see that in  Master cluster Zookeeper is dead. 1 region server 
which was running on slave machine is also dead. The cluster to which I was 
replicating is running fine.

My queries are :

1.       Can zookeeper be dead because there is no replication over the network 
for long time ?

2.       How to cater to these situations ? Running 3-4 zookeeper node will 
help ?

3.       If I run multiple Zookeeper node, then will the cluster keep on 
running normally even if 2-3 zookeeper are dead?

4.       In my case, out of 2 region server, 1 is dead but 1 is still working, 
if my zookeeper node was running, will I able to access hbase properly.

Logs :
hbase-root-zookeeper-master.log :

2011-09-19 10:07:55,753 INFO org.apache.zookeeper.server.NIOServerCnxn: 
Accepted socket connection from /10.33.64.235:44706
2011-09-19 10:07:55,758 INFO org.apache.zookeeper.server.NIOServerCnxn: Client 
attempting to establish new session at /10.33.64.235:44706
2011-09-19 10:07:55,761 INFO org.apache.zookeeper.server.NIOServerCnxn: 
Established session 0x13271b6c4f1000c with negotiated timeout 180000 for client 
/10.33.64.235:44706
2011-09-19 10:10:48,318 WARN org.apache.zookeeper.server.NIOServerCnxn: 
EndOfStreamException: Unable to read additional data from client sessionid 
0x13271b6c4f1000c, likely client has closed socket
2011-09-19 10:10:48,319 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed 
socket connection for client /10.33.64.235:44706 which had sessionid 
0x13271b6c4f1000c
2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.ZooKeeperServer: 
Expiring session 0x13271b6c4f1000c, timeout of 180000ms exceeded
2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.PrepRequestProcessor: 
Processed session termination for sessionid: 0x13271b6c4f1000c

hbase-root-regionserver-slave.log:

2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
listener on 60020: readAndProcess threw exception java.io.IOException: 
Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
       at sun.nio.ch.FileDispatcher.read0(Native Method)
       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
2011-09-16 16:00:51,058 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log 
for replication slave%3A60020.1316168146136 at 663246
2011-09-16 16:00:51,064 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
currentNbOperations:5003 and seenEntries:0 and size: 0
2011-09-16 16:00:51,064 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Going to report log #slave%3A60020.1316168146136 for position 663246 in 
hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A60020.1316168146136
2011-09-16 16:00:51,066 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Removing 0 logs in the list: []
2011-09-16 16:00:51,066 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing to 
replicate, sleeping 1000 times 2
2011-09-16 16:00:53,068 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log 
for replication slave%3A60020.1316168146136 at 663246
..................................
2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session 
0x13271b5395c0007 for server null, unexpected error, closing socket connection 
and attempting reconnect
java.net.ConnectException: Connection timed out
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
       at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2011-09-16 17:14:51,039 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
/hbase/rs/master,60020,1316167798366 znode expired, trying to lock it
2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server slave1/172.28.96.239:2181
2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to slave1/172.28.96.239:2181, initiating session
2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x13271b5395c0007 has expired, closing 
socket connection
2011-09-16 17:14:51,094 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=slave,60020,1316168145427, load=(requests=0, regions=6, usedHeap=29, 
maxHeap=996): connection to cluster: 1-0x13271b5395c0007 connection to cluster: 
1-0x13271b5395c0007 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
       at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
       at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
       at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2011-09-16 17:14:51,094 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, 
memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29, 
maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384, 
blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, 
blockCacheEvictedCount=0, blockCacheHitRatio=93, blockCacheHitCachingRatio=93
2011-09-16 17:14:51,094 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: connection to 
cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 
received expired from ZooKeeper, aborting
2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2011-09-16 17:14:51,114 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Source 
exiting 1
2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 60020
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 0 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 2 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 1 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 0 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 2 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 9 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 3 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 8 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 6 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 4 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 5 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 7 on 60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 6 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 8 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 9 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 1 on 60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 3 on 60020: exiting
2011-09-16 17:14:52,478 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server listener on 60020
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 4 on 60020: exiting
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 5 on 60020: exiting
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server Responder
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 7 on 60020: exiting
2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped 
[email protected]:60030
2011-09-16 17:14:52,585 INFO 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: 
regionserver60020.compactor exiting
2011-09-16 17:14:52,585 INFO 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: 
regionserver60020.cacheFlusher exiting
2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: 
LogRoller exiting.
2011-09-16 17:14:52,586 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: 
regionserver60020.majorCompactionChecker exiting
2011-09-16 17:14:52,587 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
regionserver60020.logSyncer interrupted while waiting for sync requests
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Closing backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling 
compactions & flushes
2011-09-16 17:14:52,588 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
2011-09-16 17:14:52,588 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
2011-09-16 17:14:52,589 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
regionserver60020.logSyncer exiting
2011-09-16 17:14:52,588 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
close of -ROOT-,,0.70236052
2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
closing hlog writer in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling 
compactions & flushes
............................
2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x13271b6c4f10003 closed
2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x13271b6c4f10005 closed
2011-09-16 17:14:52,605 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Closing 
source 1 because: Region server is closing
2011-09-16 17:14:52,605 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
2011-09-16 17:14:53,040 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Not 
transferring queue since we are shutting down
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Shutdown hook starting; hbase.shutdown.hook=true; 
fsShutdownHook=Thread[Thread-14,5,main]
2011-09-16 17:14:53,042 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Starting fs shutdown hook thread.
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Shutdown hook finished.

Please suggest.

Thanks

________________________________
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. 
Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the 
opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of
this message without the prior written consent of the author of this e-mail is 
strictly prohibited. If you have
received this email in error please delete it and notify the sender 
immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

Reply via email to