Re: Undeleted replication queue for removed peer found

Duo Zhang Sat, 18 Nov 2023 06:20:38 -0800

I guess the problem is you exceeded the maximum size limit for
zookeeper multi operation.


I searched the code base of branch-1, you could try to set
'hbase.zookeeper.useMulti' to false in your hbase-site.xml to disable
multi so the operation could succeed. But it may introduce
inconsistency so you'd better find out why there are so many files
that need to be claimed or deleted, fix the problem and switch
hbase.zookeeper.useMulti back to true.

And the 1.4.x release line is already EOL, suggest you upgrade to the
current stable release line 2.5.x.

Thanks.

Manimekalai <k.manimeka...@gmail.com> 于2023年11月18日周六 20:21写道：
>
> Dear Team,
>
> In one of the Hbase Cluster, some of the replication queue has not been
> properly removed, though the concerned peerId has been removed from
> list_peers.
>
> Due to this, I'm facing frequent region server restart has been
> occurring in the cluster where replication has to be written.
>
> I have tried to use hbase hbck -fixReplication. But it didn't work.
>
> The HBase Version is 1.4.14
>
> Below is the exception from Master and Regionserver respectively
> *Master Exception*
>
> 2023-11-18 13:01:30,815 ERROR
> > [172.XX.XX.XX,16020,1700289063450_ChoreService_2]
> > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> > 2023-11-18 13:01:30,815 WARN  
> > [172.XX.XX.XX,,16020,1700289063450_ChoreService_2]
> > cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node
> > java.io.IOException: Failed to delete queue, replicator:
> > 172.XX.XX.XX,,16020,1655822657566, queueId: 3
> >         at
> > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor.
> > removeQueue(ReplicationZKNodeCleaner.java:160)
> >         at
> > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner.
> > removeQueues(ReplicationZKNodeCleaner.java:197)
> >         at
> > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49)
> >         at
> > org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >         at
> > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >         at
> > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
>
>
>
> *RegionServer Exception*
>
> 2023-11-18 13:17:52,200 WARN  [main-SendThread(10.XX.XX.XX:2171)]
> > zookeeper.ClientCnxn: Session 0xXXXXXXX for server
> > 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection
> > and attempting reconnect
> > java.io.IOException: Broken pipe
> >         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >         at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> >         at
> > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> >         at
> > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> >         at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> > 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0]
> > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> > 2023-11-18 13:17:52,300 WARN  [ReplicationExecutor-0]
> > replication.ReplicationQueuesZKImpl: Got exception in
> > copyQueuesFromRSUsingMulti:
> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> >         at
> > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992)
> >         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
> >         at
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672)
> >         at
> > org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1685)
> >         at
> > org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.moveQueueUsingMulti(ReplicationQueuesZKImpl.java:410)
> >         at
> > org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueue(ReplicationQueuesZKImpl.java:257)
> >         at
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:700)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
>
>
>
> Please help to solve this issue.
>
>
> Regards,
> Manimekalai K

Re: Undeleted replication queue for removed peer found

Reply via email to