[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andor Molnar reassigned ZOOKEEPER-3531:
---------------------------------------

    Assignee: Chang Lou

> Synchronization on ACLCache cause cluster to hang when network/disk issues 
> happen during datatree serialization
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3531
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3531
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.2, 3.5.3, 3.5.4, 3.5.5
>            Reporter: Chang Lou
>            Assignee: Chang Lou
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 3.6.0
>
>         Attachments: fix.patch, generator.py
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> During our ZooKeeper fault injection testing, we observed that sometimes the 
> ZK cluster could hang (requests time out, node status shows ok). After 
> inspecting the issue, we believe this is caused by I/O (serializing ACLCache) 
> inside a critical section. The bug is essentially similar to what is 
> described in ZooKeeper-2201.
> org.apache.zookeeper.server.DataTree#serialize calls the aclCache.serialize 
> when doing dataree serialization, however, 
> org.apache.zookeeper.server.ReferenceCountedACLCache#serialize could get 
> stuck at OutputArchieve.writeInt due to potential network/disk issues. This 
> can cause the system experiences hanging issues similar to ZooKeeper-2201 
> (any attempt to create/delete/modify the DataNode will cause the leader to 
> hang at the beginning of the request processor chain). The root cause is the 
> lock contention between:
>  * org.apache.zookeeper.server.DataTree#serialize -> 
> org.apache.zookeeper.server.ReferenceCountedACLCache#serialize 
>  * PrepRequestProcessor#getRecordForPath -> 
> org.apache.zookeeper.server.DataTree#getACL(org.apache.zookeeper.server.DataNode)
>  -> org.apache.zookeeper.server.ReferenceCountedACLCache#convertLong
> When the snapshot gets stuck in acl serialization, it would prevent all other 
> operations to ReferenceCountedACLCache. Since getRecordForPath calls 
> ReferenceCountedACLCache#convertLong, any op triggering getRecordForPath will 
> cause the leader to hang at the beginning of the request processor chain:
> {code:java}
> org.apache.zookeeper.server.ReferenceCountedACLCache.convertLong(ReferenceCountedACLCache.java:87)
> org.apache.zookeeper.server.DataTree.getACL(DataTree.java:734)
>    - locked org.apache.zookeeper.server.DataNode@4a062b7d
> org.apache.zookeeper.server.ZKDatabase.aclForNode(ZKDatabase.java:371)
> org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:170)
>    - locked java.util.ArrayDeque@3f7394f7
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:417)
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:757)
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:145)
> {code}
> Similar to ZooKeeper-2201, the leader can still send out heartbeats so the 
> cluster will not recover until the network/disk issue resolves.  
> Steps to reproduce this bug:
>  # start a cluster with 1 leader and n followers
>  # manually create some ACLs, to enlarge the window of dumping acls so it 
> would be more likely to hang at serializing ACLCache when delay happens. (we 
> wrote a script to generate such workloads, see attachments)
>  # inject long network/disk write delays and run some benchmarks to trigger 
> snapshots
>  # once stuck, you should observe new requests to the cluster would fail.
> Essentially the core problem is the OutputArchive write should not be kept 
> inside this synchronization block. So a straightforward solution is to move 
> writes out of sync block: do a copy inside the sync block and perform 
> vulnerable network writes afterwards. The patch for this solution is attached 
> and verified.  Another more systematic fix is perhaps replacing all 
> synchronized methods in the ReferenceCountedACLCache with ConcurrentHashMap. 
> We double checked that the issue remains in the latest version of master 
> branch (68c21988d55c57e483370d3ee223c22da2d1bbcf). 
> Attachments are 1) patch for fix and regression test 2) scripts to generate 
> workloads to fill ACL cache



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to