Bence Kosztolnik created YARN-11656:
---------------------------------------

             Summary: RMStateStore event queue blocked
                 Key: YARN-11656
                 URL: https://issues.apache.org/jira/browse/YARN-11656
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: yarn
    Affects Versions: 3.4.1
            Reporter: Bence Kosztolnik
         Attachments: issue.png

I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}

{panel}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to