[ https://issues.apache.org/jira/browse/ZOOKEEPER-4444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829769#comment-17829769 ]
Jimmy LIN edited comment on ZOOKEEPER-4444 at 3/22/24 7:36 AM: --------------------------------------------------------------- This issue happened in my environment, then zookeeper follower data was inconsistent with leader's. znode was deleted, but could be read from one follower. My zookeeper version is 3.7.1 2023-09-22 06:14:59,309 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):Learner@737][] - Learner received NEWLEADER message 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):QuorumPeer@921][] - Peer state changed: following - synchronization 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):CommitProcessor@491][] - Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):CommitProcessor@452][] - Configuring CommitProcessor with 28 worker threads. 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):FollowerRequestProcessor@59][] - Initialized FollowerRequestProcessor with zookeeper.follower.skipLearnerRequestToNextProcessor as false 2023-09-22 06:14:59,311 [myid:3] - WARN [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):MBeanRegistry@110][] - Failed to register MBean InMemoryDataTree 2023-09-22 06:14:59,311 [myid:3] - WARN [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):LearnerZooKeeperServer@104][] - Failed to register with JMX org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324) at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522) at org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106) at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102) at org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:715) at org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:698) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1522) 2023-09-22 06:15:00,292 [myid:3] - WARN [NIOWorkerThread-18:NIOServerCnxn@380][] - Close of session 0x0 ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:554) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) // 30065370041 2023-09-22 06:15:00,292 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370041 is <= 30065371040 for 5 2023-09-22 06:15:00,292 [myid:3] - WARN [NIOWorkerThread-10:NIOServerCnxn@380][] - Close of session 0x0 ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:554) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2023-09-22 06:15:00,293 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370042 is <= 30065371040 for 1 2023-09-22 06:15:00,293 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370043 is <= 30065371040 for 1 2023-09-22 06:15:00,293 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370044 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370045 is <= 30065371040 for 2 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370046 is <= 30065371040 for 5 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370047 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370048 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370049 is <= 30065371040 for 2 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370050 is <= 30065371040 for 5 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370051 is <= 30065371040 for 2 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370052 is <= 30065371040 for 5 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370053 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370054 is <= 30065371040 for 1 was (Author: JIRAUSER304714): This issue happened in my environment, then zookeeper follower data was inconsistent with leader. znode was deleted, but could be read from one follower. 2023-09-22 06:14:59,309 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):Learner@737][] - Learner received NEWLEADER message 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):QuorumPeer@921][] - Peer state changed: following - synchronization 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):CommitProcessor@491][] - Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):CommitProcessor@452][] - Configuring CommitProcessor with 28 worker threads. 2023-09-22 06:14:59,310 [myid:3] - INFO [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):FollowerRequestProcessor@59][] - Initialized FollowerRequestProcessor with zookeeper.follower.skipLearnerRequestToNextProcessor as false 2023-09-22 06:14:59,311 [myid:3] - WARN [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):MBeanRegistry@110][] - Failed to register MBean InMemoryDataTree 2023-09-22 06:14:59,311 [myid:3] - WARN [QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):LearnerZooKeeperServer@104][] - Failed to register with JMX org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324) at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522) at org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106) at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102) at org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:715) at org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:698) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1522) 2023-09-22 06:15:00,292 [myid:3] - WARN [NIOWorkerThread-18:NIOServerCnxn@380][] - Close of session 0x0 ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:554) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) // 30065370041 2023-09-22 06:15:00,292 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370041 is <= 30065371040 for 5 2023-09-22 06:15:00,292 [myid:3] - WARN [NIOWorkerThread-10:NIOServerCnxn@380][] - Close of session 0x0 ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:554) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2023-09-22 06:15:00,293 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370042 is <= 30065371040 for 1 2023-09-22 06:15:00,293 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370043 is <= 30065371040 for 1 2023-09-22 06:15:00,293 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370044 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370045 is <= 30065371040 for 2 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370046 is <= 30065371040 for 5 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370047 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370048 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370049 is <= 30065371040 for 2 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370050 is <= 30065371040 for 5 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370051 is <= 30065371040 for 2 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370052 is <= 30065371040 for 5 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370053 is <= 30065371040 for 1 2023-09-22 06:15:00,294 [myid:3] - WARN [SyncThread:3:FileTxnLog@275][] - Current zxid 30065370054 is <= 30065371040 for 1 > Follower doesn't get synchronized after process restart > ------------------------------------------------------- > > Key: ZOOKEEPER-4444 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4444 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.6.3 > Reporter: Andreas Weber > Priority: Major > > Hi folks, I've got an issue with 3.6.3. > I can't provide a simple test, because it seems to depend on timing in a > cluster environment, but I tried to reduce the scenario as far as possible: > * zookeeper cluster with 5 nodes, all of them Followers (no Observers) > * start some parallel threads which do some writes in the cluster (e.g. > create/delete znodes) > * kill one of the zookeeper processes, let's say on node X (where node X is > not the Leader) > * restart zookeeper process on node X > * wait a few seconds > * kill zookeeper process on node X again > * restart zookeeper process on node X again > In some cases (every 3-4 runs) I see the following in the log of node X: > After first restart of node X: > {noformat} > WARN persistence.FileTxnLog - Current zxid 4294968525 is <= > 4294969524 for 15 > WARN persistence.FileTxnLog - Current zxid 4294968526 is <= > 4294969524 for 15 > WARN persistence.FileTxnLog - Current zxid 4294968527 is <= > 4294969524 for 15 > ... (this kind of WARN is repeated some hundred times) > WARN quorum.SendAckRequestProcessor - Closing connection to leader, > exception during packet send java.net.SocketException: Socket closed ... > ... (this kind of WARN is repeated some hundred times) > {noformat} > After second restart of node X: > {noformat} > ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > > 4294969147(next log) for type 2 > WARN server.DataTree - Message:Digests are not matching. > Value is Zxid. Value:4294969147 > ERROR server.DataTree - First digest mismatch on txn: > 360466402305310720,3870,4294969147,1639258399998,2 > , ... > , expected digest is 2,1365261838770 > , actual digest is 1098406565142, > ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > > 4294969148(next log) for type 2 > ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > > 4294969149(next log) for type 5 > ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > > 4294969150(next log) for type 2 > ... (this kind of ERROR is repeated some hundred times) > {noformat} > And afterwards (in the actual application), zookepeer on node X seems to have > a different view of the cluster state and doesn't get synchronized, at least > for a few hours. > This e.g. leads to phantom reads of znodes that were deleted a long time ago. > (The resulting behaviour looks a little bit similar as described in > ZOOKEEPER-3911.) > This does not happen with zookeeper 3.6.2 ! > (at least I can't reproduce it with this version) -- This message was sent by Atlassian Jira (v8.20.10#820010)