Rohith commented on YARN-2946:

I updated the patch with following fix
# All the token storage handled synchronously via state machine.
# Removed unnecessary synchronization from the method. This ensures 1st point

For the test, deployed in cluster by integrating with JCarder. Executed same 
scenario as per my earlier comment for checking any deadlock cycles. JCarder 
has not identified any deadlock cycles.

Kindly review the patch

> DeadLocks in RMStateStore<->ZKRMStateStore
> ------------------------------------------
>                 Key: YARN-2946
>                 URL: https://issues.apache.org/jira/browse/YARN-2946
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Blocker
>         Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 
> 0002-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, 
> RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java
> Found one deadlock in ZKRMStateStore.
> # Initial stage zkClient is null because of zk disconnected event.
> # When ZKRMstatestore#runWithCheck()  wait(zkSessionTimeout) for zkClient to 
> re establish zookeeper connection either via synconnected or expired event, 
> it is highly possible that any other thred can obtain lock on 
> {{ZKRMStateStore.this}} from state machine transition events. This cause 
> Deadlock in ZKRMStateStore.

This message was sent by Atlassian JIRA

Reply via email to