[jira] [Commented] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731080#comment-14731080 ] Andrey Klochkov commented on YARN-261: -- [~rohithsharma], please feel free to reassign to yourself. I tried to rebase but the patch is old and rebasing is not straightforward. > Ability to kill AM attempts > --- > > Key: YARN-261 > URL: https://issues.apache.org/jira/browse/YARN-261 > Project: Hadoop YARN > Issue Type: New Feature > Components: api >Affects Versions: 2.0.3-alpha >Reporter: Jason Lowe > Attachments: YARN-261--n2.patch, YARN-261--n3.patch, > YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, > YARN-261--n7.patch, YARN-261.patch > > > It would be nice if clients could ask for an AM attempt to be killed. This > is analogous to the task attempt kill support provided by MapReduce. > This feature would be useful in a scenario where AM retries are enabled, the > AM supports recovery, and a particular AM attempt is stuck. Currently if > this occurs the user's only recourse is to kill the entire application, > requiring them to resubmit a new application and potentially breaking > downstream dependent jobs if it's part of a bigger workflow. Killing the > attempt would allow a new attempt to be started by the RM without killing the > entire application, and if the AM supports recovery it could potentially save > a lot of work. It could also be useful in workflow scenarios where the > failure of the entire application kills the workflow, but the ability to kill > an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov reassigned YARN-261: Assignee: (was: Andrey Klochkov) > Ability to kill AM attempts > --- > > Key: YARN-261 > URL: https://issues.apache.org/jira/browse/YARN-261 > Project: Hadoop YARN > Issue Type: New Feature > Components: api >Affects Versions: 2.0.3-alpha >Reporter: Jason Lowe > Attachments: YARN-261--n2.patch, YARN-261--n3.patch, > YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, > YARN-261--n7.patch, YARN-261.patch > > > It would be nice if clients could ask for an AM attempt to be killed. This > is analogous to the task attempt kill support provided by MapReduce. > This feature would be useful in a scenario where AM retries are enabled, the > AM supports recovery, and a particular AM attempt is stuck. Currently if > this occurs the user's only recourse is to kill the entire application, > requiring them to resubmit a new application and potentially breaking > downstream dependent jobs if it's part of a bigger workflow. Killing the > attempt would allow a new attempt to be started by the RM without killing the > entire application, and if the AM supports recovery it could potentially save > a lot of work. It could also be useful in workflow scenarios where the > failure of the entire application kills the workflow, but the ability to kill > an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov reassigned YARN-445: Assignee: (was: Andrey Klochkov) After I submitted the patch for this there were alternative proposals / patches discussed, so I'm unassigning myself from this jira. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Task Components: nodemanager Reporter: Jason Lowe Labels: BB2015-05-TBR Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130368#comment-14130368 ] Andrey Klochkov commented on YARN-415: -- [~eepayne], congratulations and thanks for this tremendous amount of persistence! :-) Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Fix For: 2.6.0 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.201409102216.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Assignee: Eric Payne (was: Andrey Klochkov) Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Eric Payne Fix For: 2.6.0 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.201409102216.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017225#comment-14017225 ] Andrey Klochkov commented on YARN-415: -- [~eepayne], thanks a lot for finishing this. Sorry I have never had enough time for it. My biggest question - and that's where I stuck when I tried to rebase the patch on the latest trunk - is about possibility of previous attempts being evicted from Scheduler by following attempts of the same app. Currently Scheduler stores just the latest attempt (in {{SchedulerApplication.currentAttempt}}), and actually when the next attempt starts, the whole {{SchedulerApplication}} instance is replaced with a new one. I'm not sure if in fact this may happen before the usage report for the previous attempt is retrieved by RM, but from the code it seems possible. If this is the case, then we need to somehow keep previous attempts of an app until reports for those are retrieved. I realize that with the current implementation this may not be done just by storing multiple attempts instead of just the current one in {{SchedulerApplication}}, because Scheduler creates new {{SchedulerApplication}} instance of every attempt. What do you think? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878308#comment-13878308 ] Andrey Klochkov commented on YARN-415: -- I'm updating the patch to make it applicable to current trunk. Going slowly but hope I'll finish this week. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n7.patch Uploading a patch rebased after YARN-891 and with fixes according to Jason's comments. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, YARN-261--n7.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n10.patch Updating the patch: - got rid of the runningContainers map - refactored SchedulerApplication by pulling some of the methods from descendants (allocate, containerCompleted, unreserve), making most of the fields private and modifying containerCompleted to do resource usage tracking - made scheduler not evicting app immediately after the attempt is finished, but waiting for signal from RMAppAttemptImpl Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n9.patch Updated the patch moving tracking logic into Scheduler: - AppSchedulingInfo tracks resources usage. Existing methods are reused and overall it seems more like the right place to have this logic in. - When app is finished and Scheduler evicts it from it's cache, it sends a new type of event (RMAppAttemptAppFinishedEvent) to the attempt, attaching usage stats to the event. - RMAppAttemptImpl test is modified accordingly - a new test is added to verify resources tracking in AppSchedulingInfo Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805804#comment-13805804 ] Andrey Klochkov commented on YARN-415: -- This scheme has a downside that the stats would be incorrect between 2 events: 1) Scheduler evicting the app from the cache and sending an event and 2) RMAppAttemptImpl receiving the event and updating it's internal stats. The only idea I have is to add an additional roundtrip extending this schema to: 1. When app is finished, Scheduler sends RMAppAttemptAppFinishedEvent instance and does not evict the app from the cache yet 2. RMAppAttemptImpl receives the event, updates it's internal fields finalMemorySeconds and finalVcoreSeconds and sends a new type event to the Scheduler allowing it to evict the app. 3. Scheduler gets the event and evicts the app. Thoughts? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802033#comment-13802033 ] Andrey Klochkov commented on YARN-1183: --- I don't think concurrency level can make any difference here. The change was requested by Karthik. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Assignee: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n5.patch Attaching an updated patch. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Assignee: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183--n5.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802258#comment-13802258 ] Andrey Klochkov commented on YARN-1183: --- Jonathan, the issue occurred when I just run tests for hadoop-mapreduce-client-jobclient and watched for zombie Java processes. It was much more visible when using parallel execution, see MAPREDUCE-4980. I observed it quite often under OSX (some of the tests did that on every run) and didn't see it on a Linux machine I had, and I had different JVM's there. I reproduced it later in a non-modified trunk and tracked it down to MiniYARNCluster shutdown. Can't reproduce it on another macbook I have now, but I think this just due to the nature of the bug (concurrency issue). MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Assignee: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183--n5.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n6.patch Jason, makes sense. See the updated patch. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801286#comment-13801286 ] Andrey Klochkov commented on YARN-415: -- IMO it makes sense to move this tracking into the scheduler, and in particular SchedulerApplication looks like a good place to have this logic. I'm wondering why SchedulerApplication has everything abstract, while it's descendants have a lot of same fields and code. Why isn't the common code placed into the SchedulerApplication itself? If I'm not missing anything here, I'd move all that code and this resources usage tracking just into SchedulerApplication. Please comment. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801308#comment-13801308 ] Andrey Klochkov commented on YARN-415: -- Adding YARN-1335 as a dependency. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801332#comment-13801332 ] Andrey Klochkov commented on YARN-415: -- On a second thought, scheduler does not seem like a good place for this stats. It doesn't keep info on finished apps, so the tracking data will be gone as soon as an app is done, if the logic is placed in the scheduler. At the same time app attempts are kept in the RMContext until evicted, so usage stats can be pulled out of there by an external system that have persistence/reporting/etc. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799363#comment-13799363 ] Andrey Klochkov commented on YARN-415: -- Arun, the idea is to have the stats being updated in real time while the app is running. Is there a way to get a list of running containers assigned to the app, with their start times, without tracking it explicitly? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n7.patch Thanks Jason. Attaching a fixed patch. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n8.patch Adding changes in REST API docs to the patch. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n5.patch Jason, thanks for review. All your points make sense for me. Attaching a patch with fixes. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261--n4.patch, YARN-261--n5.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795753#comment-13795753 ] Andrey Klochkov commented on YARN-445: -- Vinod, Accepting a mapping of arbitrary commands is indeed the most powerful approach. Also, this would require lots of changes in the Yarn, as well as an additional complexity for app writers. At the same time, are we sure that this flexibility is needed, and it won't be an over-engineering and probably an abstraction leak in the Yarn framework? By the latter I mean that we will give app writers an ability to run arbitrary commands on any node at any point of time, but is it in the Yarn responsibilities to do that? I'm not a Yarn expert so I'm just asking. Anyway, the scope of what I has proposed with the patch is much smaller and solves the task the initial description of this Jira stated - troubleshooting of timed out containers by dumping jstack. This would be useful for many Yarn uses, so I thought it may make sense to implement it this way now and extend in the future if there is a demand. Agree that the way it is exposed in the API may be changed to a signal value in the stopContainers request instead of a separate call which is indeed a bit confusing. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-677) Increase coverage to FairScheduler
[ https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov resolved YARN-677. -- Resolution: Won't Fix Increase coverage to FairScheduler -- Key: YARN-677 URL: https://issues.apache.org/jira/browse/YARN-677 Project: Hadoop YARN Issue Type: Test Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6 Reporter: Vadim Bondarev Assignee: Andrey Klochkov Attachments: HADOOP-4536-branch-2-a.patch, HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-445: - Attachment: YARN-445--n4.patch Attaching a patch which implements Ctrl+C and uses it instead of Signal.TERM on Windows. I tested it by manually invoking winutils.exe only. Never could succeed with starting Hadoop itself on Windows, although tried hard. I think it doesn't makes much sense to separate this into two patches for this Jira and MAPREDUCE-5387. One problem I am not able to solve is the case when a batch script is used to start a container. Using console handlers in that case leads to batch script waiting for Terminate batch job? Y/N. Even if I know that a particular process in the Job Object is a batch script I can't avoid sending the console event to it. This may be a problem in scenarios when QUIT/TERM signals are not followed later by KILL, and the process would not exit normally as it should. So the question is whether KILL is used in all cases when containers are stopped? Please advise. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n6.patch With the 1st option it's not clear how to implement a protection from leaks. There's no event which can be used to check for leaks in that case. At the same time currently Yarn behavior does not support containers surviving after AM is finished, so the 2nd option is acceptable. This may need to be changed when there'll be support for long-lived apps and attempts which stay alive after AM is stopped. Attaching a patch which implements option #2 and adds a test for it. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-465: - Attachment: YARN-465-trunk--n5.patch YARN-465-branch-2--n5.patch Ravi, The main method in WebAppProxyServer is blocking and it's installing an exception handler which calls System.exit, so it shouldn't be used in tests. Yes, it was a mistake that join() was just removed without making corresponding change in main() - I fixed that. Yes, you're correct about the originalPort logging. I got rid of this confusing port variable and fixed logging. Also, I made patches for branch-2 and trunk as similar as possible. Attaching updated patches. fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788623#comment-13788623 ] Andrey Klochkov commented on YARN-465: -- Actually the only difference is in how HttpServer is initiated as in trunk the constructor was deprecated in favor of HttpServer.Builder. Seems this constructor is deprecated in branch-2 as well so yes, let's apply the trunk patch to both branches. Sorry didn't notice that earlier. fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n5.patch Fixing the failed test. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786506#comment-13786506 ] Andrey Klochkov commented on YARN-445: -- The large diffs in the tests are not due to reformatting but because of refactoring needed to implement an additional test without lots of copy/paste. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445--n2.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786572#comment-13786572 ] Andrey Klochkov commented on YARN-445: -- Steve, the current implementation will send the signal to the java started with bin/hbase as it sends it to all processes in the job object, e.g. all processes of the main container process. It can be replaced with sending the signal to all processes in the group instead, and I think the behavior will be the same. BTW I don't know how to do the opposite - i.e. how to avoid sending the signal to all processes of the container, on Windows (so the behavior on Linux is different as bin/hbase will receive the signal). I think this is fine as long as this difference is documented. In case of hbase the shell script can create a custom hook for SIGTERM and do whatever is needed in that case (e.g. send SIGTERM to the java process it started). There is one caveat in ctrl+break handling in case of a batch file starting a java process: 1. the batch file starts the java process 2. user sends ctrl+break to all processes in the group (or job object). java process prints thread dump. batch file doesn't react yet. 3. the java processes completes successfully 4. the batch file will not exit, it will print Terminate batch job? (Y/N) as it received the ctrl+break signal earlier. The only way I see on how we can overcome this problem with batch file processes is to identify them somehow (by executable name?) when walking through the processes in the job object, and do not send them the signal. Sending ctrl+break to batch file processes doesn't make sense anyway as in newer Windows there's no way to disable or customize ctrl+break handling in batch files. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445--n2.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov reassigned YARN-1183: - Assignee: Andrey Klochkov MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Assignee: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-465: - Attachment: YARN-465-trunk--n4.patch Ravi, this is not my patch so please keep in mind I'm digging into this code as you are. Alexey wouldn't be available to make fixes so I'm taking this on me so the contribution wouldn't be lost. 1-2. As I see, WebAppProxy.start() method is used in another test, so that should be the reason it's not a part of the main method. The join method is removed as it's not used anymore. 3. I think that it is meant to log port, not originalPort. The port variable is set in WebAppProxyForTest.start() to the actual port which the server binds to. 4. Indeed core-default.xml is not needed. I'm replacing it with making this configuration in the code of the test itself. 5. It must be setName(proxy) as this is the name of the webapp under hadoop-yarn-common/src/main/resources/webapps. If you set it to anything else that would lead to ClassNotFoundException. I made the message about the port number more detailed. 6. I added the check which verifies that the cookie is present in one case and is absent in another. 7. Yes, I don't see why testWebAppProxyServer is needed in the presense of testWebAppProxyServlet. Removing. 8. The test testWebAppProxyServerMainMethod is testing that the server is starting successfully. The counter is used to wait for the server to start. Attaching the updated patch for trunk fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-465: - Attachment: YARN-465-branch-2--n4.patch Attaching the updated patch for branch-2 fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n4.patch Jason, thanks for the thorough review. Attaching the patch with fixes. I basically made all the fixes you're proposing except the last one about capturing the leak. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-445: - Attachment: YARN-445--n3.patch Attaching the patch that marks all new interfaces/methods as unstable. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785381#comment-13785381 ] Andrey Klochkov commented on YARN-465: -- The robot failed when testing the branch-2 patch against trunk, this is expected. fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Aleksey Gorshkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-677) Increase coverage to FairScheduler
[ https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov reassigned YARN-677: Assignee: Andrey Klochkov Increase coverage to FairScheduler -- Key: YARN-677 URL: https://issues.apache.org/jira/browse/YARN-677 Project: Hadoop YARN Issue Type: Test Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6 Reporter: Vadim Bondarev Assignee: Andrey Klochkov Attachments: HADOOP-4536-branch-2-a.patch, HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-677) Increase coverage to FairScheduler
[ https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785407#comment-13785407 ] Andrey Klochkov commented on YARN-677: -- I looked at the difference in coverage before and after the patch. There are 2 test methods added: 1. testSchedulerHandleFailWithExternalEvents checks that FairScheduler.handle() throws RuntimeException when supplied with a wrong event type. Actual check is missing so seems like the test will pass in any case. This is a very minor addition to the coverage. If we want to keep it, I can add the check and update the patch. 2. testAggregateCapacityTrackingWithPreemptionEnabled -- not sure about the intention. I see that it adds coverage to the FairScheduler.preemptTasksIfNecessary() method, but basically it just sleeps so the method is invoked, but preemption never happens and the test is not making any checks. I think we can skip this one. Should we keep #1? Increase coverage to FairScheduler -- Key: YARN-677 URL: https://issues.apache.org/jira/browse/YARN-677 Project: Hadoop YARN Issue Type: Test Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6 Reporter: Vadim Bondarev Assignee: Andrey Klochkov Attachments: HADOOP-4536-branch-2-a.patch, HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov reassigned YARN-465: Assignee: Andrey Klochkov (was: Aleksey Gorshkov) fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Andrey Klochkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784199#comment-13784199 ] Andrey Klochkov commented on YARN-445: -- Bikas, on Windows JVM prints full thread dump on ctrl+break. I think ctrl+c may be emulated in the same way and used in place of TERM on Windows, via the same signalContainers API. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-445: - Attachment: YARN-445--n2.patch fixing javadoc warnings and the failed test Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445--n2.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784292#comment-13784292 ] Andrey Klochkov commented on YARN-445: -- As I understand this Findbugs warning should be ignored as it's complaining about a valid type cast. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Jason Lowe Attachments: YARN-445--n2.patch, YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy
[ https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-465: - Attachment: YARN-465-branch-2--n3.patch YARN-465-trunk--n3.patch Attaching updated patches. setAccessible usage is removed. fix coverage org.apache.hadoop.yarn.server.webproxy Key: YARN-465 URL: https://issues.apache.org/jira/browse/YARN-465 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha Reporter: Aleksey Gorshkov Assignee: Aleksey Gorshkov Attachments: YARN-465-branch-0.23-a.patch, YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, YARN-465-trunk.patch fix coverage org.apache.hadoop.yarn.server.webproxy patch YARN-465-trunk.patch for trunk patch YARN-465-branch-2.patch for branch-2 patch YARN-465-branch-0.23.patch for branch-0.23 There is issue in branch-0.23 . Patch does not creating .keep file. For fix it need to run commands: mkdir yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy touch yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-445: - Attachment: YARN-445.patch Attaching a patch that provides the simplest implementation: - winutils is extended with an additional routine that uses console control handlers to emulate ctrl+break on the container. For Java containers it roughly corresponds to QUIT signal on Linux. - ContainerManagerProtocol is extended with signalContainers() method which accepts a signal number to send. Currently the implementation accepts QUIT (i.e. value 3) signal only and rejects the request otherwise. - TestContainerManager is extended accordingly and executed successfully under Windows, OSX and Linux. This provides a simple implementation that would allow to troubleshoot containers without killing them, as the initial description of the feature is stating. If needed we may extract an additional Jira to extend this further with allowing arbitrary map of commands to be provided in submission context and then invoked through the NM API. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Attachments: YARN-445.patch It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773419#comment-13773419 ] Andrey Klochkov commented on YARN-415: -- The proposed implementation uses events fired by the scheduler to track resources usage, so we start ticking as soon as a container is allocated by the scheduler and stop doing that when the container is completed and the scheduler gets the resources back. Hence, in case a container fails to start by some reason, we'll stop ticking as soon as RM will get this reported. As for the gap between a container actually finishes and RM gets the report, we don't manage it, i.e. the client will be charged until RM gets the report. Start time and finish time are both computed by the scheduler, i.e. it's on the RM side. Not sure about rounding off - can you point me to the code which does that? I think we just use what's provided in the ApplicationSubmissionContext, i.e. it shouldn't be rounded off. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415.patch The patch exposes MB-seconds and CPU-seconds through CLI, REST API and UI. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n3.patch Updating the patch with fixes in tests. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Attachment: YARN-415--n2.patch Updating the patch with fixes of findbugs warnings on multithreaded correctness Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769772#comment-13769772 ] Andrey Klochkov commented on YARN-261: -- On a closer look it is indeed possible to reuse existing events instead of introducing new logic. Will simplify the patch. Xuan, thanks for the suggestion. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov reassigned YARN-415: Assignee: Andrey Klochkov Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n4.patch This patch is implementing this with less changes in state machines -- it's not introducing new transitions in the AppEvent state machine. It still introduces new event type and add transitions into App state machine. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261--n4.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770160#comment-13770160 ] Andrey Klochkov commented on YARN-261: -- Seems that the reported test failures are all caused by java.lang.OutOfMemoryError: unable to create new native thread, shouldn't be relevant to the changes in the patch. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261--n4.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261.patch Attaching a patch implementing application restart feature. Effectively, it's what is described here: a current attempt is killed and a new one is started. It'll work only if there are attempts left and recovery is possible - exactly the same conditions which are used when deciding whether to start a new attempt after a failure in the previous one. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Attachments: YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n2.patch Updated the patch to prevent the exception in the log when processing RMAppEventType.ATTEMPT_KILLED. On failing an attempt instead of additional logic - well, the patch itself consists of: 1) CLI modifications 2) adding the API call and lots of boilerplate needed for that 3) replicating AttemptFailedTransaction into a modified version named AppRestartedTransaction (with different diagnostics) 4) modifying state machines. The 4th part is needed exactly for not firing RMAppEventType.ATTEMPT_KILLED. I missed that part in the prev patch. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Attachments: YARN-261--n2.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-261: - Attachment: YARN-261--n3.patch Fixing Javadoc warning. Ability to kill AM attempts --- Key: YARN-261 URL: https://issues.apache.org/jira/browse/YARN-261 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 2.0.3-alpha Reporter: Jason Lowe Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261.patch It would be nice if clients could ask for an AM attempt to be killed. This is analogous to the task attempt kill support provided by MapReduce. This feature would be useful in a scenario where AM retries are enabled, the AM supports recovery, and a particular AM attempt is stuck. Currently if this occurs the user's only recourse is to kill the entire application, requiring them to resubmit a new application and potentially breaking downstream dependent jobs if it's part of a bigger workflow. Killing the attempt would allow a new attempt to be started by the RM without killing the entire application, and if the AM supports recovery it could potentially save a lot of work. It could also be useful in workflow scenarios where the failure of the entire application kills the workflow, but the ability to kill an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n2.patch Attaching an updated patch. Updated the name of the wait method. Changed the way it gets notifications when app masters get registered/unregistered so now ApplicationAttemptId is used as the key. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766237#comment-13766237 ] Andrey Klochkov commented on YARN-1183: --- bq. MiniYARNCluster is used by several tests. This might bite us if and when we run tests parallely. Concurrency level won't make any difference even with that. BTW I'm actually running MR tests in parallel now. That's when this issue with cluster shutdown working incorrectly becomes more evident. Thanks for catching the thing with synchronized block, fixing it. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n4.patch MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183--n4.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183--n3.patch Attaching an updated patch MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765735#comment-13765735 ] Andrey Klochkov commented on YARN-1183: --- Yep, you're right about trackingUrl being able to change. Please disregard this part. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
Andrey Klochkov created YARN-1183: - Summary: MiniYARNCluster shutdown takes several minutes intermittently Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-1183: -- Attachment: YARN-1183.patch Attaching a patch which modifies MiniYARNCluter so it waits until all app masters are reported as finished. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira