[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072696#comment-14072696
 ] 

Wangda Tan commented on YARN-415:
---------------------------------

Hi Eric,
Thanks for updating your patch, I think now don't have major comments, 

*Following are some minor comments:*
1) RMAppAttemptImpl.java
1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you please 
revert them? Like
{code}
           RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition())
-          
+
{code}

1.2 getResourceUtilization:
{code}
+    if (rmApps != null) {
+      RMApp app = rmApps.get(attemptId.getApplicationId());
+      if (app != null) {
{code}
I think the two cannot happen, we don't need check null to avoid potential bug 
here

{code}
+          ApplicationResourceUsageReport appResUsageRpt =
{code}
It's better to name it appResUsageReport since rpt is not a common abbr of 
report.

2) RMContainerImpl.java
2.1 updateAttemptMetrics:
{code}
      if (rmApps != null) {
        RMApp rmApp = 
            rmApps.get(container.getApplicationAttemptId().getApplicationId());
        if (rmApp != null) {
{code}
Again, I think the two null check is unnecessary

3) SchedulerApplicationAttempt.java
3.1 Some rename suggestions: (Please let me know if you have better idea)
CACHE_MILLI -> MEMORY_UTILIZATION_CACHE_MILLISECONDS
lastTime -> lastMemoryUtilizationUpdateTime
cachedMemorySeconds -> lastMemorySeconds
same for cachedVCore ...

4) AppBlock.java
Should we rename "Resource Seconds:" to "Resource Utilization" or something?

5) Test
5.1 I'm wondering if we need add a end to end test, since we changed 
RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt.
It can consist submit an application, launch several containers, and finish 
application. And it's better to make the launched application contains several 
application attempt.
While the application running, there're muliple containers running, and 
multiple containers finished. We can check if total resource utilization are 
expected.

*To your comments:*
1) 
bq. One thing I did notice when these values are cached is that there is a race 
where containers can get counted twice:
I think this can not be avoid, it should be a transient state and Jian He and I 
discussed about this long time before.
But apparently, 3 sec cache make it not only a transient state. I suggest you 
can make "lastTime" in SchedulerApplicationAttempt protected. And in 
FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in 
completedContainer method). You can set lastTime to a negtive value like -1, 
and next time when trying to get accumulated resource utilization, it will 
recompute all container utilization.

2)
bq. I am a little reluctant to modify the type of 
SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems 
like something that could be done separately.
I think that will be fine :), because current getRunningResourceUtilization is 
called by getResourceUsageReport. And getResourceUsageReport is synchronized, 
no matter we changed liveContainers to concurrent map or not, we cannot solve 
the locking problem. 
I agree to enhance it in a separated JIRA in the future.

Thanks,
Wangda


> Capture memory utilization at the app-level for chargeback
> ----------------------------------------------------------
>
>                 Key: YARN-415
>                 URL: https://issues.apache.org/jira/browse/YARN-415
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: resourcemanager
>    Affects Versions: 0.23.6
>            Reporter: Kendall Thrapp
>            Assignee: Andrey Klochkov
>         Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to