I've created a jira ticket here: https://issues.apache.org/jira/browse/ZOOKEEPER-2234
Thanks, Adam On 15 Jul 2015 16:07, Adam Milne-Smith <[email protected]> wrote: > > Whilst writing a patch for ZOOKEEPER-2141 (3.4.6 branch), we spotted an > ephemeral node that had not been deleted despite its session having expired. > Its ACL long did not exist in the ACL cache so any operation against this > node will fail. > > This could lead to things like curator locks never being deleted (even after > the timeout) and deadlocking applications. > > We inspected the code and are reasonably certain that there are no bugs in > updating the in-memory data tree that could cause this. However serialising > the snapshot happens asynchronously and follows these 4 steps: > > -copy the sessions map > -serialise the sessions map copy > -serialise the ACL map (synchronised) > -serialise the data tree (synchronised at the individual node level) > > We suspect the issue we are seeing is a new session and ephemeral node being > created during the data tree serialisation hence the corresponding session > and acl are missing from the snapshot but the node is present. This means the > snapshot contains a partial transaction. > > If we were to deserialise from this snapshot then the data in-memory would be > invalid. If one member of the quorum were to reboot and restore from this > snapshot, it would contain this node where the other hosts had removed it. If > this host were to become the leader and send its snapshot to other members of > the quorum, those would have the invalid data too. > > As far as we can see, the only way to delete this node when this happens in > production would be to perform manual surgery on the snapshot. > > Can anyone confirm that they agree this to be the case or let us know if > we've misunderstood something? > > Thanks, > Adam
