Dear Team,

In one of the Hbase Cluster, some of the replication queue has not been
properly removed, though the concerned peerId has been removed from
list_peers.

Due to this, I'm facing frequent region server restart has been
occurring in the cluster where replication has to be written.

I have tried to use hbase hbck -fixReplication. But it didn't work.

The HBase Version is 1.4.14

Below is the exception from Master and Regionserver respectively
*Master Exception*

2023-11-18 13:01:30,815 ERROR
> [172.XX.XX.XX,16020,1700289063450_ChoreService_2]
> zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> 2023-11-18 13:01:30,815 WARN  
> [172.XX.XX.XX,,16020,1700289063450_ChoreService_2]
> cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node
> java.io.IOException: Failed to delete queue, replicator:
> 172.XX.XX.XX,,16020,1655822657566, queueId: 3
>         at
> org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor.
> removeQueue(ReplicationZKNodeCleaner.java:160)
>         at
> org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner.
> removeQueues(ReplicationZKNodeCleaner.java:197)
>         at
> org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49)
>         at
> org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



*RegionServer Exception*

2023-11-18 13:17:52,200 WARN  [main-SendThread(10.XX.XX.XX:2171)]
> zookeeper.ClientCnxn: Session 0xXXXXXXX for server
> 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection
> and attempting reconnect
> java.io.IOException: Broken pipe
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>         at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>         at
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>         at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0]
> zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> 2023-11-18 13:17:52,300 WARN  [ReplicationExecutor-0]
> replication.ReplicationQueuesZKImpl: Got exception in
> copyQueuesFromRSUsingMulti:
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at
> org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992)
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>         at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672)
>         at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1685)
>         at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.moveQueueUsingMulti(ReplicationQueuesZKImpl.java:410)
>         at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueue(ReplicationQueuesZKImpl.java:257)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:700)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



Please help to solve this issue.


Regards,
Manimekalai K

Reply via email to