Dear Team, In one of the Hbase Cluster, some of the replication queue has not been properly removed, though the concerned peerId has been removed from list_peers.
Due to this, I'm facing frequent region server restart has been occurring in the cluster where replication has to be written. I have tried to use hbase hbck -fixReplication. But it didn't work. The HBase Version is 1.4.14 Below is the exception from Master and Regionserver respectively *Master Exception* 2023-11-18 13:01:30,815 ERROR > [172.XX.XX.XX,16020,1700289063450_ChoreService_2] > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts > 2023-11-18 13:01:30,815 WARN > [172.XX.XX.XX,,16020,1700289063450_ChoreService_2] > cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node > java.io.IOException: Failed to delete queue, replicator: > 172.XX.XX.XX,,16020,1655822657566, queueId: 3 > at > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor. > removeQueue(ReplicationZKNodeCleaner.java:160) > at > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner. > removeQueues(ReplicationZKNodeCleaner.java:197) > at > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49) > at > org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) *RegionServer Exception* 2023-11-18 13:17:52,200 WARN [main-SendThread(10.XX.XX.XX:2171)] > zookeeper.ClientCnxn: Session 0xXXXXXXX for server > 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection > and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0] > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts > 2023-11-18 13:17:52,300 WARN [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl: Got exception in > copyQueuesFromRSUsingMulti: > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1685) > at > org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.moveQueueUsingMulti(ReplicationQueuesZKImpl.java:410) > at > org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueue(ReplicationQueuesZKImpl.java:257) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:700) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Please help to solve this issue. Regards, Manimekalai K