I think this is just: https://issues.apache.org/jira/browse/HBASE-3130
J-D On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <[email protected]> wrote: > Hi All, > > I was running a 2 node cluster with 1 zookeeper node and 2 region server > node. I had also setup cluster replication with another single node > Hbase-Hadoop cluster. Replication was successful and I left the cluster > running over the weekend with no data for replication. > > Today I can see that in Master cluster Zookeeper is dead. 1 region server > which was running on slave machine is also dead. The cluster to which I was > replicating is running fine. > > My queries are : > > 1. Can zookeeper be dead because there is no replication over the > network for long time ? > > 2. How to cater to these situations ? Running 3-4 zookeeper node will > help ? > > 3. If I run multiple Zookeeper node, then will the cluster keep on > running normally even if 2-3 zookeeper are dead? > > 4. In my case, out of 2 region server, 1 is dead but 1 is still > working, if my zookeeper node was running, will I able to access hbase > properly. > > Logs : > hbase-root-zookeeper-master.log : > > 2011-09-19 10:07:55,753 INFO org.apache.zookeeper.server.NIOServerCnxn: > Accepted socket connection from /10.33.64.235:44706 > 2011-09-19 10:07:55,758 INFO org.apache.zookeeper.server.NIOServerCnxn: > Client attempting to establish new session at /10.33.64.235:44706 > 2011-09-19 10:07:55,761 INFO org.apache.zookeeper.server.NIOServerCnxn: > Established session 0x13271b6c4f1000c with negotiated timeout 180000 for > client /10.33.64.235:44706 > 2011-09-19 10:10:48,318 WARN org.apache.zookeeper.server.NIOServerCnxn: > EndOfStreamException: Unable to read additional data from client sessionid > 0x13271b6c4f1000c, likely client has closed socket > 2011-09-19 10:10:48,319 INFO org.apache.zookeeper.server.NIOServerCnxn: > Closed socket connection for client /10.33.64.235:44706 which had sessionid > 0x13271b6c4f1000c > 2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.ZooKeeperServer: > Expiring session 0x13271b6c4f1000c, timeout of 180000ms exceeded > 2011-09-19 10:12:57,002 INFO > org.apache.zookeeper.server.PrepRequestProcessor: Processed session > termination for sessionid: 0x13271b6c4f1000c > > hbase-root-regionserver-slave.log: > > 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > listener on 60020: readAndProcess threw exception java.io.IOException: > Connection reset by peer. Count of bytes read: 0 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > at sun.nio.ch.IOUtil.read(IOUtil.java:175) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > 2011-09-16 16:00:51,058 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication slave%3A60020.1316168146136 at 663246 > 2011-09-16 16:00:51,064 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > currentNbOperations:5003 and seenEntries:0 and size: 0 > 2011-09-16 16:00:51,064 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Going to report log #slave%3A60020.1316168146136 for position 663246 in > hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A60020.1316168146136 > 2011-09-16 16:00:51,066 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Removing 0 logs in the list: [] > 2011-09-16 16:00:51,066 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing > to replicate, sleeping 1000 times 2 > 2011-09-16 16:00:53,068 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication slave%3A60020.1316168146136 at 663246 > .................................. > 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session > 0x13271b5395c0007 for server null, unexpected error, closing socket > connection and attempting reconnect > java.net.ConnectException: Connection timed out > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) > 2011-09-16 17:14:51,039 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > /hbase/rs/master,60020,1316167798366 znode expired, trying to lock it > 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server slave1/172.28.96.239:2181 > 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to slave1/172.28.96.239:2181, initiating session > 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x13271b5395c0007 has expired, > closing socket connection > 2011-09-16 17:14:51,094 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=slave,60020,1316168145427, load=(requests=0, regions=6, > usedHeap=29, maxHeap=996): connection to cluster: 1-0x13271b5395c0007 > connection to cluster: 1-0x13271b5395c0007 received expired from ZooKeeper, > aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2011-09-16 17:14:51,094 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, > memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29, > maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384, > blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, > blockCacheEvictedCount=0, blockCacheHitRatio=93, blockCacheHitCachingRatio=93 > 2011-09-16 17:14:51,094 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: connection to > cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 > received expired from ZooKeeper, aborting > 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-09-16 17:14:51,114 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Source > exiting 1 > 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 0 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 2 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 1 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 0 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 2 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 9 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 3 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 8 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 6 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 4 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 5 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 7 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 6 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 8 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 9 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 1 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 3 on 60020: exiting > 2011-09-16 17:14:52,478 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server listener on 60020 > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 4 on 60020: exiting > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 5 on 60020: exiting > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server Responder > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 7 on 60020: exiting > 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped > [email protected]:60030 > 2011-09-16 17:14:52,585 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > regionserver60020.compactor exiting > 2011-09-16 17:14:52,585 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: > regionserver60020.cacheFlusher exiting > 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: > LogRoller exiting. > 2011-09-16 17:14:52,586 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: > regionserver60020.majorCompactionChecker exiting > 2011-09-16 17:14:52,587 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480. > 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer interrupted while waiting for sync requests > 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Closing backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling > compactions & flushes > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec. > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df. > 2011-09-16 17:14:52,589 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer exiting > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of -ROOT-,,0.70236052 > 2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: > closing hlog writer in > hdfs://master:54310/hbase/.logs/slave,60020,1316168145427 > 2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: > disabling compactions & flushes > ............................ > 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x13271b6c4f10003 closed > 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x13271b6c4f10005 closed > 2011-09-16 17:14:52,605 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Closing > source 1 because: Region server is closing > 2011-09-16 17:14:52,605 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting > 2011-09-16 17:14:53,040 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Not transferring queue since we are shutting down > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main] > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook > thread. > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. > > Please suggest. > > Thanks > > ________________________________ > ::DISCLAIMER:: > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its affiliates. > Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect the > opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, modification, > distribution and / or publication of > this message without the prior written consent of the author of this e-mail > is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > ----------------------------------------------------------------------------------------------------------------------- >
