I think the fix the mostly good. Chris is working on a test. This will be in 0.92, but can probably be back ported.
-- Lars ----- Original Message ----- From: Stuti Awasthi <[email protected]> To: "[email protected]" <[email protected]> Cc: Sent: Monday, September 19, 2011 9:25 PM Subject: RE: Unexpected shutdown of Zookeeper Hi JD, Thanks for your response. I was planning to use replication for my production/development servers but it seems like work is still going on this issue. I want to know that which version release is planned for this bug. Currently Im using Hbase 0.90.3 Some of my queries are : 1. Will running 3-4 zookeeper node helps in case of failure of 1-2 zookeeper node? Will the cluster keeps on running or it will be down ? Thanks -Stuti -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel Cryans Sent: Monday, September 19, 2011 11:04 PM To: [email protected] Subject: Re: Unexpected shutdown of Zookeeper I think this is just: https://issues.apache.org/jira/browse/HBASE-3130 J-D On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <[email protected]> wrote: > Hi All, > > I was running a 2 node cluster with 1 zookeeper node and 2 region server > node. I had also setup cluster replication with another single node > Hbase-Hadoop cluster. Replication was successful and I left the cluster > running over the weekend with no data for replication. > > Today I can see that in Master cluster Zookeeper is dead. 1 region server > which was running on slave machine is also dead. The cluster to which I was > replicating is running fine. > > My queries are : > > 1. Can zookeeper be dead because there is no replication over the > network for long time ? > > 2. How to cater to these situations ? Running 3-4 zookeeper node will > help ? > > 3. If I run multiple Zookeeper node, then will the cluster keep on > running normally even if 2-3 zookeeper are dead? > > 4. In my case, out of 2 region server, 1 is dead but 1 is still > working, if my zookeeper node was running, will I able to access hbase > properly. > > Logs : > hbase-root-zookeeper-master.log : > > 2011-09-19 10:07:55,753 INFO > org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection > from /10.33.64.235:44706 > 2011-09-19 10:07:55,758 INFO > org.apache.zookeeper.server.NIOServerCnxn: Client attempting to > establish new session at /10.33.64.235:44706 > 2011-09-19 10:07:55,761 INFO > org.apache.zookeeper.server.NIOServerCnxn: Established session > 0x13271b6c4f1000c with negotiated timeout 180000 for client > /10.33.64.235:44706 > 2011-09-19 10:10:48,318 WARN > org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: > Unable to read additional data from client sessionid > 0x13271b6c4f1000c, likely client has closed socket > 2011-09-19 10:10:48,319 INFO > org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection > for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c > 2011-09-19 10:12:57,002 INFO > org.apache.zookeeper.server.ZooKeeperServer: Expiring session > 0x13271b6c4f1000c, timeout of 180000ms exceeded > 2011-09-19 10:12:57,002 INFO > org.apache.zookeeper.server.PrepRequestProcessor: Processed session > termination for sessionid: 0x13271b6c4f1000c > > hbase-root-regionserver-slave.log: > > 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server listener on 60020: readAndProcess threw exception > java.io.IOException: Connection reset by peer. Count of bytes read: 0 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > at sun.nio.ch.IOUtil.read(IOUtil.java:175) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > 2011-09-16 16:00:51,058 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening log for replication slave%3A60020.1316168146136 at 663246 > 2011-09-16 16:00:51,064 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > currentNbOperations:5003 and seenEntries:0 and size: 0 > 2011-09-16 16:00:51,064 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana > ger: Going to report log #slave%3A60020.1316168146136 for position > 663246 in > hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A6002 > 0.1316168146136 > 2011-09-16 16:00:51,066 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana > ger: Removing 0 logs in the list: [] > 2011-09-16 16:00:51,066 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Nothing to replicate, sleeping 1000 times 2 > 2011-09-16 16:00:53,068 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication slave%3A60020.1316168146136 at 663246 > .................................. > 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session > 0x13271b5395c0007 for server null, unexpected error, closing socket > connection and attempting reconnect > java.net.ConnectException: Connection timed out > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) > 2011-09-16 17:14:51,039 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana > ger: /hbase/rs/master,60020,1316167798366 znode expired, trying to > lock it > 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server slave1/172.28.96.239:2181 > 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to slave1/172.28.96.239:2181, initiating > session > 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable > to reconnect to ZooKeeper service, session 0x13271b5395c0007 has > expired, closing socket connection > 2011-09-16 17:14:51,094 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server serverName=slave,60020,1316168145427, load=(requests=0, > regions=6, usedHeap=29, maxHeap=996): connection to cluster: > 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo > KeeperWatcher.java:343) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa > tcher.java:261) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja > va:530) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2011-09-16 17:14:51,094 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, > memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29, > maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384, > blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, > blockCacheEvictedCount=0, blockCacheHitRatio=93, > blockCacheHitCachingRatio=93 > 2011-09-16 17:14:51,094 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: > connection to cluster: 1-0x13271b5395c0007 connection to cluster: > 1-0x13271b5395c0007 received expired from ZooKeeper, aborting > 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-09-16 17:14:51,114 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Source exiting 1 > 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping server on 60020 > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 0 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 2 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 1 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 0 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 2 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 9 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 3 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 8 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 6 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 4 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 5 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 7 on 60020: exiting > 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 6 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 8 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 9 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 1 on 60020: exiting > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 3 on 60020: exiting > 2011-09-16 17:14:52,478 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping > infoServer > 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping IPC Server listener on 60020 > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 4 on 60020: exiting > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 5 on 60020: exiting > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping IPC Server Responder > 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 7 on 60020: exiting > 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped > [email protected]:60030 > 2011-09-16 17:14:52,585 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > regionserver60020.compactor exiting > 2011-09-16 17:14:52,585 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: > regionserver60020.cacheFlusher exiting > 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: > LogRoller exiting. > 2011-09-16 17:14:52,586 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChec > ker: regionserver60020.majorCompactionChecker exiting > 2011-09-16 17:14:52,587 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480. > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer interrupted while waiting for sync > requests > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Closing > backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling > compactions & flushes > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec. > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df. > 2011-09-16 17:14:52,589 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer exiting > 2011-09-16 17:14:52,588 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Processing close of -ROOT-,,0.70236052 > 2011-09-16 17:14:52,589 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in > hdfs://master:54310/hbase/.logs/slave,60020,1316168145427 > 2011-09-16 17:14:52,589 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Closing > replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: disabling > compactions & flushes ............................ > 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x13271b6c4f10003 closed > 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x13271b6c4f10005 closed > 2011-09-16 17:14:52,605 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Closing source 1 because: Region server is closing > 2011-09-16 17:14:52,605 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 > exiting > 2011-09-16 17:14:53,040 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceMana > ger: Not transferring queue since we are shutting down > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > starting; hbase.shutdown.hook=true; > fsShutdownHook=Thread[Thread-14,5,main] > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown > hook > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook > thread. > 2011-09-16 17:14:53,042 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. > > Please suggest. > > Thanks > > ________________________________ > ::DISCLAIMER:: > ---------------------------------------------------------------------- > ------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in this email are solely those of > the author and may not necessarily reflect the opinions of HCL or its > affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of this message > without the prior written consent of the author of this e-mail is > strictly prohibited. If you have received this email in error please delete > it and notify the sender immediately. Before opening any mail and attachments > please check them for viruses and defect. > > ---------------------------------------------------------------------- > ------------------------------------------------- >
