I think this is just:

https://issues.apache.org/jira/browse/HBASE-3130

J-D

On Sun, Sep 18, 2011 at 10:15 PM, Stuti Awasthi <[email protected]> wrote:
> Hi All,
>
> I was running a 2 node cluster with 1 zookeeper node and 2 region server 
> node. I had also setup cluster replication with another single node 
> Hbase-Hadoop cluster. Replication was successful and I left the cluster 
> running over the weekend with no data for replication.
>
> Today I can see that in  Master cluster Zookeeper is dead. 1 region server 
> which was running on slave machine is also dead. The cluster to which I was 
> replicating is running fine.
>
> My queries are :
>
> 1.       Can zookeeper be dead because there is no replication over the 
> network for long time ?
>
> 2.       How to cater to these situations ? Running 3-4 zookeeper node will 
> help ?
>
> 3.       If I run multiple Zookeeper node, then will the cluster keep on 
> running normally even if 2-3 zookeeper are dead?
>
> 4.       In my case, out of 2 region server, 1 is dead but 1 is still 
> working, if my zookeeper node was running, will I able to access hbase 
> properly.
>
> Logs :
> hbase-root-zookeeper-master.log :
>
> 2011-09-19 10:07:55,753 INFO org.apache.zookeeper.server.NIOServerCnxn: 
> Accepted socket connection from /10.33.64.235:44706
> 2011-09-19 10:07:55,758 INFO org.apache.zookeeper.server.NIOServerCnxn: 
> Client attempting to establish new session at /10.33.64.235:44706
> 2011-09-19 10:07:55,761 INFO org.apache.zookeeper.server.NIOServerCnxn: 
> Established session 0x13271b6c4f1000c with negotiated timeout 180000 for 
> client /10.33.64.235:44706
> 2011-09-19 10:10:48,318 WARN org.apache.zookeeper.server.NIOServerCnxn: 
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x13271b6c4f1000c, likely client has closed socket
> 2011-09-19 10:10:48,319 INFO org.apache.zookeeper.server.NIOServerCnxn: 
> Closed socket connection for client /10.33.64.235:44706 which had sessionid 
> 0x13271b6c4f1000c
> 2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.ZooKeeperServer: 
> Expiring session 0x13271b6c4f1000c, timeout of 180000ms exceeded
> 2011-09-19 10:12:57,002 INFO 
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session 
> termination for sessionid: 0x13271b6c4f1000c
>
> hbase-root-regionserver-slave.log:
>
> 2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
> listener on 60020: readAndProcess threw exception java.io.IOException: 
> Connection reset by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcher.read0(Native Method)
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> 2011-09-16 16:00:51,058 DEBUG 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening 
> log for replication slave%3A60020.1316168146136 at 663246
> 2011-09-16 16:00:51,064 DEBUG 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
> currentNbOperations:5003 and seenEntries:0 and size: 0
> 2011-09-16 16:00:51,064 INFO 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> Going to report log #slave%3A60020.1316168146136 for position 663246 in 
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A60020.1316168146136
> 2011-09-16 16:00:51,066 INFO 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> Removing 0 logs in the list: []
> 2011-09-16 16:00:51,066 DEBUG 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing 
> to replicate, sleeping 1000 times 2
> 2011-09-16 16:00:53,068 DEBUG 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening 
> log for replication slave%3A60020.1316168146136 at 663246
> ..................................
> 2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x13271b5395c0007 for server null, unexpected error, closing socket 
> connection and attempting reconnect
> java.net.ConnectException: Connection timed out
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2011-09-16 17:14:51,039 INFO 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> /hbase/rs/master,60020,1316167798366 znode expired, trying to lock it
> 2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server slave1/172.28.96.239:2181
> 2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to slave1/172.28.96.239:2181, initiating session
> 2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable to 
> reconnect to ZooKeeper service, session 0x13271b5395c0007 has expired, 
> closing socket connection
> 2011-09-16 17:14:51,094 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> serverName=slave,60020,1316168145427, load=(requests=0, regions=6, 
> usedHeap=29, maxHeap=996): connection to cluster: 1-0x13271b5395c0007 
> connection to cluster: 1-0x13271b5395c0007 received expired from ZooKeeper, 
> aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
>       at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2011-09-16 17:14:51,094 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
> requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, 
> memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=29, 
> maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384, 
> blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, 
> blockCacheEvictedCount=0, blockCacheHitRatio=93, blockCacheHitCachingRatio=93
> 2011-09-16 17:14:51,094 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: connection to 
> cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 
> received expired from ZooKeeper, aborting
> 2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2011-09-16 17:14:51,114 DEBUG 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Source 
> exiting 1
> 2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: Stopping 
> server on 60020
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 1 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 0 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 2 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 9 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 3 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 8 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 6 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 4 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 5 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 7 on 60020: exiting
> 2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 6 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 8 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 9 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 1 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 3 on 60020: exiting
> 2011-09-16 17:14:52,478 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> 2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
> Server listener on 60020
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 4 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 5 on 60020: exiting
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
> Server Responder
> 2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC 
> Server handler 7 on 60020: exiting
> 2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped 
> [email protected]:60030
> 2011-09-16 17:14:52,585 INFO 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: 
> regionserver60020.compactor exiting
> 2011-09-16 17:14:52,585 INFO 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: 
> regionserver60020.cacheFlusher exiting
> 2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: 
> LogRoller exiting.
> 2011-09-16 17:14:52,586 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: 
> regionserver60020.majorCompactionChecker exiting
> 2011-09-16 17:14:52,587 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
> close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> regionserver60020.logSyncer interrupted while waiting for sync requests
> 2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> Closing backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.: disabling 
> compactions & flushes
> 2011-09-16 17:14:52,588 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
> close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
> 2011-09-16 17:14:52,588 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
> close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
> 2011-09-16 17:14:52,589 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
> regionserver60020.logSyncer exiting
> 2011-09-16 17:14:52,588 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing 
> close of -ROOT-,,0.70236052
> 2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> closing hlog writer in 
> hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
> 2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.: 
> disabling compactions & flushes
> ............................
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x13271b6c4f10003 closed
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x13271b6c4f10005 closed
> 2011-09-16 17:14:52,605 INFO 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Closing 
> source 1 because: Region server is closing
> 2011-09-16 17:14:52,605 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting
> 2011-09-16 17:14:53,040 INFO 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> Not transferring queue since we are shutting down
> 2011-09-16 17:14:53,042 INFO 
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; 
> hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main]
> 2011-09-16 17:14:53,042 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
> 2011-09-16 17:14:53,042 INFO 
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook 
> thread.
> 2011-09-16 17:14:53,042 INFO 
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.
>
> Please suggest.
>
> Thanks
>
> ________________________________
> ::DISCLAIMER::
> -----------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its affiliates. 
> Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect the 
> opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure, modification, 
> distribution and / or publication of
> this message without the prior written consent of the author of this e-mail 
> is strictly prohibited. If you have
> received this email in error please delete it and notify the sender 
> immediately. Before opening any mail and
> attachments please check them for viruses and defect.
>
> -----------------------------------------------------------------------------------------------------------------------
>

Reply via email to