chang lou created ZOOKEEPER-3531:
------------------------------------

             Summary: Synchronization on ACLCache cause cluster to hang when 
network/disk issues happen during datatree serialization
                 Key: ZOOKEEPER-3531
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3531
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.5.5, 3.5.4, 3.5.3, 3.5.2
            Reporter: chang lou
         Attachments: fix.patch, generator.py

During our ZooKeeper fault injection testing, we observed that sometimes the ZK 
cluster could hang (requests time out, node status shows ok). After inspecting 
the issue, we believe this is caused by I/O (serializing ACLCache) inside a 
critical section. The bug is essentially similar to what is described in 
ZooKeeper-2201.

 

org.apache.zookeeper.server.DataTree#serialize calls the aclCache.serialize 
when doing dataree serialization, however, 
org.apache.zookeeper.server.ReferenceCountedACLCache#serialize could get stuck 
at OutputArchieve.writeInt due to potential network/disk issues. This can cause 
the system experiences hanging issues similar to ZooKeeper-2201 (any attempt to 
create/delete/modify the DataNode will cause the leader to hang at the 
beginning of the request processor chain). The root cause is the lock 
contention between:

 
 * org.apache.zookeeper.server.DataTree#serialize -> 
org.apache.zookeeper.server.ReferenceCountedACLCache#serialize 
 * PrepRequestProcessor#getRecordForPath -> 
org.apache.zookeeper.server.DataTree#getACL(org.apache.zookeeper.server.DataNode)
 -> org.apache.zookeeper.server.ReferenceCountedACLCache#convertLong





When the snapshot gets stuck in acl serialization, it would prevent all other 
operations to ReferenceCountedACLCache. Since getRecordForPath calls 
ReferenceCountedACLCache#convertLong, any op triggering getRecordForPath will 
cause the leader to hang at the beginning of the request processor chain:

 

 
{code:java}
org.apache.zookeeper.server.ReferenceCountedACLCache.convertLong(ReferenceCountedACLCache.java:87)
org.apache.zookeeper.server.DataTree.getACL(DataTree.java:734)
   - locked org.apache.zookeeper.server.DataNode@4a062b7d
org.apache.zookeeper.server.ZKDatabase.aclForNode(ZKDatabase.java:371)
org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:170)
   - locked java.util.ArrayDeque@3f7394f7
org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:417)
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:757)
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:145)
{code}
 

SImilar to ZooKeeper-2201, the leader can still send out heartbeats so the 
cluster will not recover until the network/disk issue resolves.  

 

Steps to reproduce this bug:
 # start a cluster with 1 leader and n followers
 # manually create some ACLs, to enlarge the window of dumping acls so it would 
be more likely to hang at serializing ACLCache when delay happens. (we wrote a 
script to generate such workloads, see attachments)
 # inject long network/disk write delays and run some benchmarks to trigger 
snapshots
 # once stuck, you should observe new requests to the cluster would fail.

 

Essentially the core problem is the OutputArchive write should not be kept 
inside this synchronization block. So a straightforward solution is to move 
writes out of sync block: do a copy inside the sync block and perform 
vulnerable network writes afterwards. The patch for this solution is attached 
and verified.  Another more systematic fix is perhaps replacing all 
synchronized methods in the ReferenceCountedACLCache with ConcurrentHashMap. 

 

We double checked that the issue remains in the latest version of master branch 
(68c21988d55c57e483370d3ee223c22da2d1bbcf). 

 

Attachments are 1) patch for fix and regression test 2) scripts to generate 
workloads to fill ACL cache



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to