I've created a jira ticket here:
https://issues.apache.org/jira/browse/ZOOKEEPER-2234

Thanks,
Adam

On 15 Jul 2015 16:07, Adam Milne-Smith <[email protected]> wrote:
>
> Whilst writing a patch for ZOOKEEPER-2141 (3.4.6 branch), we spotted an 
> ephemeral node that had not been deleted despite its session having expired. 
> Its ACL long did not exist in the ACL cache so any operation against this 
> node will fail.
>
> This could lead to things like curator locks never being deleted (even after 
> the timeout) and deadlocking applications.
>
> We inspected the code and are reasonably certain that there are no bugs in 
> updating the in-memory data tree that could cause this. However serialising 
> the snapshot happens asynchronously and follows these 4 steps:
>
> -copy the sessions map
> -serialise the sessions map copy
> -serialise the ACL map (synchronised)
> -serialise the data tree (synchronised at the individual node level)
>
> We suspect the issue we are seeing is a new session and ephemeral node being 
> created during the data tree serialisation hence the corresponding session 
> and acl are missing from the snapshot but the node is present. This means the 
> snapshot contains a partial transaction.
>
> If we were to deserialise from this snapshot then the data in-memory would be 
> invalid. If one member of the quorum were to reboot and restore from this 
> snapshot, it would contain this node where the other hosts had removed it. If 
> this host were to become the leader and send its snapshot to other members of 
> the quorum, those would have the invalid data too.
>
> As far as we can see, the only way to delete this node when this happens in 
> production would be to perform manual surgery on the snapshot.
>
> Can anyone confirm that they agree this to be the case or let us know if 
> we've misunderstood something?
>
> Thanks,
> Adam

Reply via email to