Team, I'm investigating an issue where an ephemeral node in ZooKeeper was not properly managed after a server rejoined the ensemble. My setup uses ZooKeeper 3.9.3. Below is the timeline of events:
Timeline: 1. An ephemeral node is created. 2. This is synced across all servers in the ensemble. 3. Follower 'A' goes out of the ensemble due to a connectivity issue. 4. Now the client session associated with the ephemeral node disconnects, deleting the ephemeral node across all active servers in the ensemble. 5. A new client session is initiated, creating another ephemeral node with the same path. 6. This new ephemeral node is synced across all active servers in the ensemble. 7. Follower 'A' rejoins the ensemble. 8. The leader syncs the latest commits to follower 'A'. 9. However, (Ephemeral Node).getEphemeralOwner() does not return the current session's session ID. I couldn't confirm if an old ephemeral node persisted, as the machine was restarted, resolving the issue. Debug logs were not enabled, so no additional logs are available to confirm the root cause. I suspect packet loss during the rejoin may have contributed. Attached are the leader-to-follower sync logs. Could you please advise if there are known issues with ephemeral node cleanup during server rejoins, or other scenarios to check? Is this likely due to packet loss or synchronization issues? Thanks in advance! Sync logs: > Follower > > 11:29:25:853 org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run > Successfully connected to leader, using address: <DOMAIN/IPADDR>:2888 > > 11:29:25:854 org.apache.zookeeper.util.SecurityUtils.createSaslClient > QuorumLearner will use DIGEST-MD5 as SASL mechanism. > > 11:29:25:859 > org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthLearner.checkAuthStatus > Successfully completed the authentication using SASL. server addr: > <DOMAIN/IPADDR>:2888, status: SUCCESS > > 11:29:25:864 org.apache.zookeeper.server.quorum.QuorumPeer.setZabState > Peer state changed: following - synchronization > > 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > Getting a diff from the leader 0x240025c7cc > > 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > Got zxid 0x240025c7cc expected 0x1 > > 11:29:25:869 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > Learner received NEWLEADER message > > 11:29:25:869 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode > Peer state changed: following - synchronization - diff > > 11:29:25:870 > org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier > Dynamic reconfig is disabled, we don't store the last seen config. > > 11:29:25:871 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > It took 2ms to persist and commit txns in packetsCommitted. 0 outstanding > txns left in packetsNotLogged > > 11:29:25:873 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > Set the current epoch to 37 > > 11:29:25:874 org.apache.zookeeper.server.quorum.QuorumPeer.setSyncMode > Peer state changed: following - synchronization > > 11:29:25:874 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > Sent NEWLEADER ack to leader with zxid 2500000000 > > 11:29:25:879 org.apache.zookeeper.server.quorum.Learner.syncWithLeader > Learner received UPTODATE message > > > > Leader: > > 11:29:25:864 org.apache.zookeeper.server.quorum.LearnerHandler.run > Follower sid: 2 : info : <ADDR>:2888:3888:participant > > 11:29:25:868 org.apache.zookeeper.server.ZKDatabase.isTxnLogSyncEnabled > On disk txn sync enabled with snapshotSizeFactor 0.33 > > 11:29:25:868 org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower > Synchronizing with Learner sid: 2 maxCommittedLog=0x240025c7cc > minCommittedLog=0x240025c5d0 lastProcessedZxid=0x240025c7cc > peerLastZxid=0x240025c7c3 > > 11:29:25:869 org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower > Using committedLog for peer sid: 2 > > 11:29:25:870 > org.apache.zookeeper.server.quorum.LearnerHandler.queueCommittedProposals > Sending DIFF zxid=0x240025c7cc for peer sid: 2