Andor, Thank you for your response.
We are uncertain about the specific conditions triggering the issue, making it difficult to predict its occurrence in version 3.8.4. As multiple critical modules, including Hadoop and Kafka, rely on Zookeeper, any disruption could lead to potential data loss on our end. Identifying the exact root cause or scenario would help us both to either fix or mitigate the issue. I will provide any additional logs if needed. Could you please clarify the specific issue and confirm if a fix is available in version 3.8.4? On Mon, Jun 23, 2025 at 9:27 PM Andor Molnar <an...@apache.org> wrote: > Hi Arjun, > > Could you please validate the same scenario with latest stable version > 3.8.4? > > Andor > > > > > > On Jun 23, 2025, at 00:09, arjun s v <arjun.cs...@gmail.com> wrote: > > > > Continuation to the Ephemeral node issue, > > > > I observed that the learner sends ACKs for each packet it receives, but > > there seems to be no verification on the leader's side to confirm these > > ACKs against the packets sent. > > Is there a configuration option that, when enabled, ensures all packet > > ACKs, including COMMIT ACKs, are validated? > > If packet loss is the reason for this issue, verifying all received ACKs > > against the sent packets could help prevent such problems in the future. > > > > Please advise. > > > > On Thu, Jun 19, 2025 at 6:35 PM arjun s v <arjun.cs...@gmail.com> wrote: > > > >> Team, > >> > >> I'm investigating an issue where an ephemeral node in ZooKeeper was not > >> properly managed after a server rejoined the ensemble. My setup uses > >> ZooKeeper 3.9.3. Below is the timeline of events: > >> > >> Timeline: > >> > >> 1. An ephemeral node is created. > >> 2. This is synced across all servers in the ensemble. > >> 3. Follower 'A' goes out of the ensemble due to a connectivity issue. > >> 4. Now the client session associated with the ephemeral node > >> disconnects, deleting the ephemeral node across all active servers in > the > >> ensemble. > >> 5. A new client session is initiated, creating another ephemeral node > >> with the same path. > >> 6. This new ephemeral node is synced across all active servers in the > >> ensemble. > >> 7. Follower 'A' rejoins the ensemble. > >> 8. The leader syncs the latest commits to follower 'A'. > >> 9. However, (Ephemeral Node).getEphemeralOwner() does not return the > >> current session's session ID. > >> > >> > >> I couldn't confirm if an old ephemeral node persisted, as the machine > was > >> restarted, resolving the issue. Debug logs were not enabled, so no > >> additional logs are available to confirm the root cause. I suspect > packet > >> loss during the rejoin may have contributed. Attached are the > >> leader-to-follower sync logs. > >> > >> Could you please advise if there are known issues with ephemeral node > >> cleanup during server rejoins, or other scenarios to check? Is this > likely > >> due to packet loss or synchronization issues? > >> > >> Thanks in advance! > >> > >> > >> Sync logs: > >> > >>> Follower > >>> > >>> 11:29:25:853 > >>> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run > >>> Successfully connected to leader, using address: <DOMAIN/IPADDR>:2888 > >>> > >>> 11:29:25:854 org.apache.zookeeper.util.SecurityUtils.createSaslClient > >>> QuorumLearner will use DIGEST-MD5 as SASL mechanism. > >>> > >>> 11:29:25:859 > >>> > org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthLearner.checkAuthStatus > >>> Successfully completed the authentication using SASL. server addr: > >>> <DOMAIN/IPADDR>:2888, status: SUCCESS > >>> > >>> 11:29:25:864 org.apache.zookeeper.server.quorum.QuorumPeer.setZabState > >>> Peer state changed: following - synchronization > >>> > >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> Getting a diff from the leader 0x240025c7cc > >>> > >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> Got zxid 0x240025c7cc expected 0x1 > >>> > >>> 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> Learner received NEWLEADER message > >>> > >>> 11:29:25:869 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode > >>> Peer state changed: following - synchronization - diff > >>> > >>> 11:29:25:870 > >>> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier > >>> Dynamic reconfig is disabled, we don't store the last seen config. > >>> > >>> 11:29:25:871 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> It took 2ms to persist and commit txns in packetsCommitted. 0 > outstanding > >>> txns left in packetsNotLogged > >>> > >>> 11:29:25:873 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> Set the current epoch to 37 > >>> > >>> 11:29:25:874 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode > >>> Peer state changed: following - synchronization > >>> > >>> 11:29:25:874 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> Sent NEWLEADER ack to leader with zxid 2500000000 > >>> > >>> 11:29:25:879 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > >>> Learner received UPTODATE message > >>> > >>> > >>> > >>> Leader: > >>> > >>> 11:29:25:864 org.apache.zookeeper.server.quorum.LearnerHandler.run > >>> Follower sid: 2 : info : <ADDR>:2888:3888:participant > >>> > >>> 11:29:25:868 org.apache.zookeeper.server.ZKDatabase.isTxnLogSyncEnabled > >>> On disk txn sync enabled with snapshotSizeFactor 0.33 > >>> > >>> 11:29:25:868 > >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower > >>> Synchronizing with Learner sid: 2 maxCommittedLog=0x240025c7cc > >>> minCommittedLog=0x240025c5d0 lastProcessedZxid=0x240025c7cc > >>> peerLastZxid=0x240025c7c3 > >>> > >>> 11:29:25:869 > >>> org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower > >>> Using committedLog for peer sid: 2 > >>> > >>> 11:29:25:870 > >>> > org.apache.zookeeper.server.quorum.LearnerHandler.queueCommittedProposals > >>> Sending DIFF zxid=0x240025c7cc for peer sid: 2 > >> > >> > >