[jira] [Comment Edited] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

chaosju (Jira) Wed, 07 Apr 2021 06:31:40 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316336#comment-17316336
 ]


chaosju edited comment on YARN-10450 at 4/7/21, 1:30 PM:
---------------------------------------------------------

Why adaptive Heartbeat ？
 * {color:#ff0000}Regular heartbeats can overload RM.{color}
 * {color:#ff0000}if RM is overloaded things get worse over time as events 
queue up.{color}
 * Lower work efficiency as important events at NM/AM need to wait for next 
heartbeat to let RM know of their status.
 * Not every heartbeat from a node or AM may be important. If nodes are running 
full, heartbeats from such nodes would not be useful for application 
scheduling. 
 * RM should be able to control heartbeats sent to itself

How adaptive Heartbeat ？

1.Throttle Heartbeat: 
 * {color:#ff0000} HB interval based on scheduler load (LIGHT, NORMAL, BUSY, 
HEAVY){color}
 * Statistics associated with various scheduler events (processing time vs wait 
time in queue) is collected. 
 * RM indicates the next HB interval to NM and AM to throttle the heartbeat.

2. Event based Heartbeat:
 * Send out of band heartbeat to send emergent request such as new resource 
requests, container completion etc. before the heartbeat interval indicated by 
RM. 
 * RM can notify AM when the containers have been allocated so that AM does not 
have to wait for the scheduled heartbeat to get resources.

 
Reference：https://www.slideshare.net/vsaxenavarun/venturing-into-large-hadoop-clusters

[~Jim_Brennan] 


was (Author: chaosju):
Why adaptive Heartbeat ？
 * {color:#FF0000}Regular heartbeats can overload RM.{color}
 * {color:#FF0000}if RM is overloaded things get worse over time as events 
queue up.{color}
 * Lower work efficiency as important events at NM/AM need to wait for next 
heartbeat to let RM know of their status.
 * Not every heartbeat from a node or AM may be important. If nodes are running 
full, heartbeats from such nodes would not be useful for application 
scheduling. 
 * RM should be able to control heartbeats sent to itself

How adaptive Heartbeat ？

1.Throttle Heartbeat: 
 * {color:#FF0000} HB interval based on scheduler load (LIGHT, NORMAL, BUSY, 
HEAVY){color}
 * Statistics associated with various scheduler events (processing time vs wait 
time in queue) is collected. 
 * RM indicates the next HB interval to NM and AM to throttle the heartbeat.

2. Event based Heartbeat:
 * Send out of band heartbeat to send emergent request such as new resource 
requests, container completion etc. before the heartbeat interval indicated by 
RM. 
 * RM can notify AM when the containers have been allocated so that AM does not 
have to wait for the scheduled heartbeat to get resources.

 

[~Jim_Brennan] 

> Add cpu and memory utilization per node and cluster-wide metrics
> ----------------------------------------------------------------
>
>                 Key: YARN-10450
>                 URL: https://issues.apache.org/jira/browse/YARN-10450
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.3.1
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Minor
>             Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
>         Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

Reply via email to