[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758105#comment-17758105
 ] 

Damien Diederen commented on ZOOKEEPER-4689:
--------------------------------------------

Hi [~adamyi],

This is indeed a very critical issue, as the corruption can spread from member 
to member\!

I initially preferred solution 2 from the ticket description—the one which was 
tentatively implemented in https://github.com/apache/zookeeper/pull/1997—but 
given the difficulties encountered, and [~kezhuw]’s suggestion of never 
removing the ACL {{aclIndex}} is pointing to, I am also reconsidering. Are we 
missing something?

We would also like to add some kind of \(optional) "fsck" pass which sanity 
checks the tree before the service starts—to prevent this and other kinds of 
corruption from spreading—but that can be implemented in a followup ticket.


> Node may not accessible due the the inconsistent ACL reference map after SNAP 
> sync (again)
> ------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4689
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4689
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.0, 3.7.0, 3.8.0
>            Reporter: Adam Yi
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> In Zookeeper, we do a "fuzzy snapshot". It means that we don't make a copy of 
> the DataTree or grab a lock or anything when serializing a DataTree. Instead, 
> we note down the zxid when we start serializing DataTree. We serialize the 
> DataTree while it's getting mutated and replay the \{transactions after 
> starting to take snapshot} after deserializing the DataTree. The idea is that 
> those transactions should be idempotent.
> Zookeeper also implements its own interned ACL. It keeps a [long -> ACL] map 
> and store the `long` in each node as nodes tend to share the same ACL.
> When serializing DataTree, we first serialize the ACL cache and then 
> serialize the nodes. It's possible that with the following sequence, a node 
> points to an invalid ACL entry:
> 1. Serialize ACL
> 2. Create node with new ACL
> 3. Serialize node
> ZOOKEEPER-3306 fixes this by making sure to insert the ACL to cache upon 
> calling `DataTree.createNode` when replaying transactions and when the node 
> already exists. However, we only insert it to the cache, but do not set the 
> interned ACL in the node to point to the new entry.
> It's possible that the longval we get for the ACL is inconsistent, even 
> though we follow the same zxid ordering of events. Specifically, we keep a 
> [aclIndex] pointing to the max entry that currently exists and increment that 
> whenever we need to intern a new ACL we've never seen before.
> With ZOOKEEPER-2214, we started to do reference counting in ACL cache and 
> remove no-longer used entries from the cache. 
> Say the current aclIndex is 10. If we create a node with ACL unseen before 
> and delete that node, aclIndex will increment to 11. However, when we 
> deserialize the tree, we'll set aclIndex to the max existent ACL cache entry, 
> so it's reverted back to 10. aclIndex inconsistency on its own is fine but it 
> causes problem to the ZOOKEEPER-3306 patch.
> Now if we follow the same scenario mentioned in ZOOKEEPER-3306:
>  # Leader creates ACL entry 11 and delete it due to node deletion
>  # Server A starts to have snap sync with leader
>  # After serializing the ACL map to Server A, there is a txn T1 to create a 
> node N1 with new ACL_1 which was not exist in ACL map
>  # On leader, after this txn, the ACL map will be 12 -> (ACL_1, COUNT: 1), 
> and data tree N1 -> 12
>  # On server A, it will be ACL map with max ID 10, and N1 -> 12 in fuzzy 
> snapshot
>  # When replaying the txn T1, it will add 11 -> (ACL_1, COUNT: 1) to the ACL 
> cache but the node N1 still points to 12.
> N1 still points to invalid ACL entry.
> There are two ways to fix this:
>  # Make aclIndex consistent upon re-deserialization (by either serializing it 
> in snapshot or paying special attention to decrement it when removing cache)
>  # Fix ZOOKEEPER-3306 patch so that we also override the ACL of node to new 
> key if previous entry does not exist in the ACL table.
>  
> I think solution 2 is nicer as aclIndex inconsistency itself is not a 
> problem. With solution 1, we're still implicitly depending on aclIndex 
> consistency and ordering of events. It's harder to reason about and it seems 
> more fragile than solution 1.
> I'm going to send a patch for solution 2 but please let me know if you 
> disagree and I'm happy to go with solution 1 instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to