[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072696#comment-14072696 ]
Wangda Tan commented on YARN-415: --------------------------------- Hi Eric, Thanks for updating your patch, I think now don't have major comments, *Following are some minor comments:* 1) RMAppAttemptImpl.java 1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you please revert them? Like {code} RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition()) - + {code} 1.2 getResourceUtilization: {code} + if (rmApps != null) { + RMApp app = rmApps.get(attemptId.getApplicationId()); + if (app != null) { {code} I think the two cannot happen, we don't need check null to avoid potential bug here {code} + ApplicationResourceUsageReport appResUsageRpt = {code} It's better to name it appResUsageReport since rpt is not a common abbr of report. 2) RMContainerImpl.java 2.1 updateAttemptMetrics: {code} if (rmApps != null) { RMApp rmApp = rmApps.get(container.getApplicationAttemptId().getApplicationId()); if (rmApp != null) { {code} Again, I think the two null check is unnecessary 3) SchedulerApplicationAttempt.java 3.1 Some rename suggestions: (Please let me know if you have better idea) CACHE_MILLI -> MEMORY_UTILIZATION_CACHE_MILLISECONDS lastTime -> lastMemoryUtilizationUpdateTime cachedMemorySeconds -> lastMemorySeconds same for cachedVCore ... 4) AppBlock.java Should we rename "Resource Seconds:" to "Resource Utilization" or something? 5) Test 5.1 I'm wondering if we need add a end to end test, since we changed RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt. It can consist submit an application, launch several containers, and finish application. And it's better to make the launched application contains several application attempt. While the application running, there're muliple containers running, and multiple containers finished. We can check if total resource utilization are expected. *To your comments:* 1) bq. One thing I did notice when these values are cached is that there is a race where containers can get counted twice: I think this can not be avoid, it should be a transient state and Jian He and I discussed about this long time before. But apparently, 3 sec cache make it not only a transient state. I suggest you can make "lastTime" in SchedulerApplicationAttempt protected. And in FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in completedContainer method). You can set lastTime to a negtive value like -1, and next time when trying to get accumulated resource utilization, it will recompute all container utilization. 2) bq. I am a little reluctant to modify the type of SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems like something that could be done separately. I think that will be fine :), because current getRunningResourceUtilization is called by getResourceUsageReport. And getResourceUsageReport is synchronized, no matter we changed liveContainers to concurrent map or not, we cannot solve the locking problem. I agree to enhance it in a separated JIRA in the future. Thanks, Wangda > Capture memory utilization at the app-level for chargeback > ---------------------------------------------------------- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager > Affects Versions: 0.23.6 > Reporter: Kendall Thrapp > Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)