[jira] [Commented] (YARN-261) Ability to kill AM attempts

2015-09-04 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731080#comment-14731080
 ] 

Andrey Klochkov commented on YARN-261:
--

[~rohithsharma], please feel free to reassign to yourself. I tried to rebase 
but the patch is old and rebasing is not straightforward.

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-261) Ability to kill AM attempts

2015-09-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-261:


Assignee: (was: Andrey Klochkov)

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-445) Ability to signal containers

2015-05-06 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-445:


Assignee: (was: Andrey Klochkov)

After I submitted the patch for this there were alternative proposals / patches 
discussed, so I'm unassigning myself from this jira.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Reporter: Jason Lowe
  Labels: BB2015-05-TBR
 Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, 
 YARN-445--n3.patch, YARN-445--n4.patch, 
 YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-11 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130368#comment-14130368
 ] 

Andrey Klochkov commented on YARN-415:
--

[~eepayne], congratulations and thanks for this tremendous amount of 
persistence! :-)

 Capture aggregate memory allocation at the app-level for chargeback
 ---

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Fix For: 2.6.0

 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
 YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
 YARN-415.201409102216.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-11 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-
Assignee: Eric Payne  (was: Andrey Klochkov)

 Capture aggregate memory allocation at the app-level for chargeback
 ---

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Eric Payne
 Fix For: 2.6.0

 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
 YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
 YARN-415.201409102216.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-06-03 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017225#comment-14017225
 ] 

Andrey Klochkov commented on YARN-415:
--

[~eepayne], thanks a lot for finishing this. Sorry I have never had enough time 
for it. My biggest question - and that's where I stuck when I tried to rebase 
the patch on the latest trunk - is about possibility of previous attempts being 
evicted from Scheduler by following attempts of the same app. Currently 
Scheduler stores just the latest attempt (in 
{{SchedulerApplication.currentAttempt}}), and actually when the next attempt 
starts, the whole {{SchedulerApplication}} instance is replaced with a new one. 
I'm not sure if in fact this may happen before the usage report for the 
previous attempt is retrieved by RM, but from the code it seems possible. If 
this is the case, then we need to somehow keep previous attempts of an app 
until reports for those are retrieved. I realize that with the current 
implementation this may not be done just by storing multiple attempts instead 
of just the current one in {{SchedulerApplication}}, because Scheduler creates 
new {{SchedulerApplication}} instance of every attempt. What do you think? 

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-01-21 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878308#comment-13878308
 ] 

Andrey Klochkov commented on YARN-415:
--

I'm updating the patch to make it applicable to current trunk. Going slowly but 
hope I'll finish this week.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-11-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n7.patch

Uploading a patch rebased after YARN-891 and with fixes according to Jason's 
comments.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
 YARN-261--n7.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-26 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n10.patch

Updating the patch:
- got rid of the runningContainers map
- refactored SchedulerApplication by pulling some of the methods from 
descendants (allocate, containerCompleted, unreserve), making most of the 
fields private and modifying  containerCompleted to do resource usage tracking
- made scheduler not evicting app immediately after the attempt is finished, 
but waiting for signal from RMAppAttemptImpl

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-25 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n9.patch

Updated the patch moving tracking logic into Scheduler:
- AppSchedulingInfo tracks resources usage. Existing methods are reused and 
overall it seems more like the right place to have this logic in.
- When app is finished and Scheduler evicts it from it's cache, it sends a new 
type of event (RMAppAttemptAppFinishedEvent) to the attempt, attaching usage 
stats to the event.
- RMAppAttemptImpl test is modified accordingly
- a new test is added to verify resources tracking in AppSchedulingInfo 

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-25 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805804#comment-13805804
 ] 

Andrey Klochkov commented on YARN-415:
--

This scheme has a downside that the stats would be incorrect between 2 events: 
1) Scheduler evicting the app from the cache and sending an event and 2) 
RMAppAttemptImpl receiving the event and updating it's internal stats. The only 
idea I have is to add an additional roundtrip extending this schema to:
1. When app is finished, Scheduler sends RMAppAttemptAppFinishedEvent instance 
and does not evict the app from the cache yet
2. RMAppAttemptImpl receives the event, updates it's internal fields 
finalMemorySeconds and finalVcoreSeconds and sends a new type event to the 
Scheduler allowing it to evict the app.
3. Scheduler gets the event and evicts the app.

Thoughts?

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-22 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802033#comment-13802033
 ] 

Andrey Klochkov commented on YARN-1183:
---

I don't think concurrency level can make any difference here. The change was 
requested by Karthik.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-22 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n5.patch

Attaching an updated patch.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183--n5.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-22 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802258#comment-13802258
 ] 

Andrey Klochkov commented on YARN-1183:
---

Jonathan, the issue occurred when I just run tests for 
hadoop-mapreduce-client-jobclient and watched for zombie Java processes. It was 
much more visible when using parallel execution, see MAPREDUCE-4980. I observed 
it quite often under OSX (some of the tests did that on every run) and didn't 
see it on a Linux machine I had, and I had different JVM's there. I reproduced 
it later in a non-modified trunk and tracked it down to MiniYARNCluster 
shutdown. Can't reproduce it on another macbook I have now, but I think this 
just due to the nature of the bug (concurrency issue).

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183--n5.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-10-21 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n6.patch

Jason, makes sense. See the updated patch.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-21 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801286#comment-13801286
 ] 

Andrey Klochkov commented on YARN-415:
--

IMO it makes sense to move this tracking into the scheduler, and in particular 
SchedulerApplication looks like a good place to have this logic. I'm wondering 
why SchedulerApplication has everything abstract, while it's descendants have a 
lot of same fields and code. Why isn't the common code placed into the 
SchedulerApplication itself? If I'm not missing anything here, I'd move all 
that code and this resources usage tracking just into SchedulerApplication. 
Please comment.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-21 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801308#comment-13801308
 ] 

Andrey Klochkov commented on YARN-415:
--

Adding YARN-1335 as a dependency.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-21 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801332#comment-13801332
 ] 

Andrey Klochkov commented on YARN-415:
--

On a second thought, scheduler does not seem like a good place for this stats. 
It doesn't keep info on finished apps, so the tracking data will be gone as 
soon as an app is done, if the logic is placed in the scheduler. At the same 
time app attempts are kept in the RMContext until evicted, so usage stats can 
be pulled  out of there by an external system that have 
persistence/reporting/etc.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-18 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799363#comment-13799363
 ] 

Andrey Klochkov commented on YARN-415:
--

Arun, the idea is to have the stats being updated in real time while the app is 
running. Is there a way to get a list of running containers assigned to the 
app, with their start times, without tracking it explicitly?

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-17 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n7.patch

Thanks Jason. Attaching a fixed patch.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-17 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n8.patch

Adding changes in REST API docs to the patch.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, 
 YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-10-15 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n5.patch

Jason, thanks for review. All your points make sense for me. Attaching a patch 
with fixes.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261--n5.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-15 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795753#comment-13795753
 ] 

Andrey Klochkov commented on YARN-445:
--

Vinod,
Accepting a mapping of arbitrary commands is indeed the most powerful approach. 
Also, this would require lots of changes in the Yarn, as well as an additional 
complexity for app writers. At the same time, are we sure that this flexibility 
is needed, and it won't be an over-engineering and probably an abstraction leak 
in the Yarn framework? By the latter I mean that we will give app writers an 
ability to run arbitrary commands on any node at any point of time, but is it 
in the Yarn responsibilities to do that? I'm not a Yarn expert so I'm just 
asking.

Anyway, the scope of what I has proposed with the patch is much smaller and 
solves the task the initial description of this Jira stated - troubleshooting 
of timed out containers by dumping jstack. This would be useful for many Yarn 
uses, so I thought it may make sense to implement it this way now and extend in 
the future if there is a demand. Agree that the way it is exposed in the API 
may be changed to a signal value in the stopContainers request instead of a 
separate call which is indeed a bit confusing.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-445--n2.patch, YARN-445--n3.patch, 
 YARN-445--n4.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (YARN-677) Increase coverage to FairScheduler

2013-10-15 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov resolved YARN-677.
--

Resolution: Won't Fix

 Increase coverage to FairScheduler
 --

 Key: YARN-677
 URL: https://issues.apache.org/jira/browse/YARN-677
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6
Reporter: Vadim Bondarev
Assignee: Andrey Klochkov
 Attachments: HADOOP-4536-branch-2-a.patch, 
 HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, 
 HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, 
 HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, 
 HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, 
 HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-14 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445--n4.patch

Attaching a patch which implements Ctrl+C and uses it instead of Signal.TERM on 
Windows. I tested it by manually invoking winutils.exe only. Never could 
succeed with starting Hadoop itself on Windows, although tried hard.

I think it doesn't makes much sense to separate this into two patches for this 
Jira and MAPREDUCE-5387.

One problem I am not able to solve is the case when a batch script is used to 
start a container. Using console handlers in that case leads to batch script 
waiting for Terminate batch job? Y/N. Even if I know that a particular 
process in the Job Object is a batch script I can't avoid sending the console 
event to it. This may be a problem in scenarios when QUIT/TERM signals are not 
followed later by KILL, and the process would not exit normally as it should. 
So the question is whether KILL is used in all cases when containers are 
stopped? Please advise.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445--n3.patch, 
 YARN-445--n4.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-10 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n6.patch

With the 1st option it's not clear how to implement a protection from leaks. 
There's no event which can be used to check for leaks in that case. At the same 
time currently Yarn behavior does not support containers surviving after AM is 
finished, so the 2nd option is acceptable. This may need to be changed when 
there'll be support for long-lived apps and attempts which stay alive after AM 
is stopped.

Attaching a patch which implements option #2 and adds a test for it.


 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-07 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-trunk--n5.patch
YARN-465-branch-2--n5.patch

Ravi,
The main method in WebAppProxyServer is blocking and it's installing an 
exception handler which calls System.exit, so it shouldn't be used in tests. 
Yes, it was a mistake that join() was just removed without making corresponding 
change in main() - I fixed that.

Yes, you're correct about the originalPort logging. I got rid of this confusing 
port variable and fixed logging.

Also, I made patches for branch-2 and trunk as similar as possible.

Attaching updated patches.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, 
 YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, 
 YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-07 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788623#comment-13788623
 ] 

Andrey Klochkov commented on YARN-465:
--

Actually the only difference is in how HttpServer is initiated as in trunk the 
constructor was deprecated in favor of HttpServer.Builder. Seems this 
constructor is deprecated in branch-2 as well so yes, let's apply the trunk 
patch to both branches. Sorry didn't notice that earlier.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, 
 YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, 
 YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-07 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n5.patch

Fixing the failed test.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415--n5.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-04 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786506#comment-13786506
 ] 

Andrey Klochkov commented on YARN-445:
--

The large diffs in the tests are not due to reformatting but because of 
refactoring needed to implement an additional test without lots of copy/paste. 

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-04 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786572#comment-13786572
 ] 

Andrey Klochkov commented on YARN-445:
--

Steve, the current implementation will send the signal to the java started with 
bin/hbase as it sends it to all processes in the job object, e.g. all processes 
of the main container process. It can be replaced with sending the signal to 
all processes in the group instead, and I think the behavior will be the same. 

BTW I don't know how to do the opposite - i.e. how to avoid sending the signal 
to all processes of the container, on Windows (so the behavior on Linux is 
different as bin/hbase will receive the signal). I think this is fine as long 
as this difference is documented. In case of hbase the shell script can create 
a custom hook for SIGTERM and do whatever is needed in that case (e.g. send 
SIGTERM to the java process it started). 

There is one caveat in ctrl+break handling in case of a batch file starting a 
java process:
1. the batch file starts the java process
2. user sends ctrl+break to all processes in the group (or job object). java 
process prints thread dump. batch file doesn't react yet.
3. the java processes completes successfully
4. the batch file will not exit, it will print Terminate batch job? (Y/N) as 
it received the ctrl+break signal earlier.

The only way I see on how we can overcome this problem with batch file 
processes is to identify them somehow (by executable name?) when walking 
through the processes in the job object, and do not send them the signal. 
Sending ctrl+break to batch file processes doesn't make sense anyway as in 
newer Windows there's no way to disable or customize ctrl+break handling in 
batch files.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-1183:
-

Assignee: Andrey Klochkov

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-trunk--n4.patch

Ravi, this is not my patch so please keep in mind I'm digging into this code as 
you are. Alexey wouldn't be available to make fixes so I'm taking this on me so 
the contribution wouldn't be lost.

1-2. As I see, WebAppProxy.start() method is used in another test, so that 
should be the reason it's not a part of the main method. The join method is 
removed as it's not used anymore.
3. I think that it is meant to log port, not originalPort. The port 
variable is set in WebAppProxyForTest.start() to the actual port which the 
server binds to. 
4. Indeed core-default.xml is not needed. I'm replacing it with making this 
configuration in the code of the test itself.
5. It must be setName(proxy) as this is the name of the webapp under 
hadoop-yarn-common/src/main/resources/webapps. If you set it to anything else 
that would lead to ClassNotFoundException. I made the message about the port 
number more detailed.
6. I added the check which verifies that the cookie is present in one case and 
is absent in another. 
7. Yes, I don't see why testWebAppProxyServer is needed in the presense of 
testWebAppProxyServlet. Removing.
8. The test testWebAppProxyServerMainMethod is testing that the server is 
starting successfully. The counter is used to wait for the server to start. 

Attaching the updated patch for trunk

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-branch-2--n4.patch

Attaching the updated patch for branch-2

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, 
 YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, 
 YARN-465-trunk--n4.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n4.patch

Jason, thanks for the thorough review. Attaching the patch with fixes. I 
basically made all the fixes you're proposing except the last one about 
capturing the leak.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, 
 YARN-415--n4.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445--n3.patch

Attaching the patch that marks all new interfaces/methods as unstable.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-03 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785381#comment-13785381
 ] 

Andrey Klochkov commented on YARN-465:
--

The robot failed when testing the branch-2 patch against trunk, this is 
expected.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Aleksey Gorshkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-677) Increase coverage to FairScheduler

2013-10-03 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-677:


Assignee: Andrey Klochkov

 Increase coverage to FairScheduler
 --

 Key: YARN-677
 URL: https://issues.apache.org/jira/browse/YARN-677
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6
Reporter: Vadim Bondarev
Assignee: Andrey Klochkov
 Attachments: HADOOP-4536-branch-2-a.patch, 
 HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, 
 HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, 
 HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, 
 HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, 
 HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-677) Increase coverage to FairScheduler

2013-10-03 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785407#comment-13785407
 ] 

Andrey Klochkov commented on YARN-677:
--

I looked at the difference in coverage before and after the patch. There are 2 
test methods added: 
1. testSchedulerHandleFailWithExternalEvents checks that 
FairScheduler.handle() throws RuntimeException when supplied with a wrong event 
type. Actual check is missing so seems like the test will pass in any case. 
This is a very minor addition to the coverage. If we want to keep it, I can add 
the check and update the patch.
2. testAggregateCapacityTrackingWithPreemptionEnabled -- not sure about the 
intention. I see that it adds coverage to the 
FairScheduler.preemptTasksIfNecessary() method, but basically it just sleeps so 
the method is invoked, but preemption never happens and the test is not making 
any checks. I think we can skip this one.

Should we keep #1? 

 Increase coverage to FairScheduler
 --

 Key: YARN-677
 URL: https://issues.apache.org/jira/browse/YARN-677
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6
Reporter: Vadim Bondarev
Assignee: Andrey Klochkov
 Attachments: HADOOP-4536-branch-2-a.patch, 
 HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, 
 HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, 
 HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, 
 HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, 
 HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-03 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-465:


Assignee: Andrey Klochkov  (was: Aleksey Gorshkov)

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-02 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784199#comment-13784199
 ] 

Andrey Klochkov commented on YARN-445:
--

Bikas, on Windows JVM prints full thread dump on ctrl+break. I think ctrl+c may 
be emulated in the same way and used in place of TERM on Windows, via the same 
signalContainers API.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-02 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445--n2.patch

fixing javadoc warnings and the failed test

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-02 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784292#comment-13784292
 ] 

Andrey Klochkov commented on YARN-445:
--

As I understand this Findbugs warning should be ignored as it's complaining 
about a valid type cast. 

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-02 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-branch-2--n3.patch
YARN-465-trunk--n3.patch

Attaching updated patches. setAccessible usage is removed.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Aleksey Gorshkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-01 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445.patch

Attaching a patch that provides the simplest implementation: 
- winutils is extended with an additional routine that uses console control 
handlers to emulate ctrl+break on the container. For Java containers it  
roughly corresponds to QUIT signal on Linux.
- ContainerManagerProtocol is extended with signalContainers() method which 
accepts a signal number to send. Currently the implementation accepts QUIT 
(i.e. value 3) signal only and rejects the request otherwise.
- TestContainerManager is extended accordingly and executed successfully under 
Windows, OSX and Linux.

This provides a simple implementation that would allow to troubleshoot 
containers without killing them, as the initial description of the feature is 
stating. If needed we may extract an additional Jira to extend this further 
with allowing arbitrary map of commands to be provided in submission context 
and then invoked through the NM API.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
 Attachments: YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-20 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773419#comment-13773419
 ] 

Andrey Klochkov commented on YARN-415:
--

The proposed implementation uses events fired by the scheduler to track 
resources usage, so we start ticking as soon as a container is allocated by the 
scheduler and stop doing that when the container is completed and the scheduler 
gets the resources back. Hence, in case a container fails to start by some 
reason, we'll stop ticking as soon as RM will get this reported. As for the gap 
between a container actually finishes and RM gets the report, we don't manage 
it, i.e. the client will be charged until RM gets the report. Start time and 
finish time are both computed by the scheduler, i.e. it's on the RM side.

Not sure about rounding off - can you point me to the code which does that? I 
think we just use what's provided in the ApplicationSubmissionContext, i.e. it 
shouldn't be rounded off. 

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-19 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415.patch

The patch exposes MB-seconds and CPU-seconds through CLI, REST API and UI. 

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-19 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n3.patch

Updating the patch with fixes in tests.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-19 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n2.patch

Updating the patch with fixes of findbugs warnings on multithreaded correctness

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n2.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-261) Ability to kill AM attempts

2013-09-17 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769772#comment-13769772
 ] 

Andrey Klochkov commented on YARN-261:
--

On a closer look it is indeed possible to reuse existing events instead of 
introducing new logic. Will simplify the patch. Xuan, thanks for the suggestion.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-17 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-415:


Assignee: Andrey Klochkov

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov

 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-17 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n4.patch

This patch is implementing this with less changes in state machines -- it's not 
introducing new transitions in the AppEvent state machine. It still introduces 
new event type and add transitions into App state machine. 

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-261) Ability to kill AM attempts

2013-09-17 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770160#comment-13770160
 ] 

Andrey Klochkov commented on YARN-261:
--

Seems that the reported test failures are all caused by 
java.lang.OutOfMemoryError: unable to create new native thread, shouldn't be 
relevant to the changes in the patch.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-16 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261.patch

Attaching a patch implementing application restart feature. Effectively, it's 
what is described here: a current attempt is killed and a new one is started. 
It'll work only if there are attempts left and recovery is possible - exactly 
the same conditions which are used when deciding whether to start a new attempt 
after a failure in the previous one.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
 Attachments: YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-16 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n2.patch

Updated the patch to prevent the exception in the log when processing 
RMAppEventType.ATTEMPT_KILLED. 

On failing an attempt instead of additional logic - well, the patch itself 
consists of: 1) CLI modifications 2) adding the API call and lots of 
boilerplate needed for that 3) replicating AttemptFailedTransaction into a 
modified version named AppRestartedTransaction (with different diagnostics) 4) 
modifying state machines. The 4th part is needed exactly for not firing 
RMAppEventType.ATTEMPT_KILLED. I missed that part in the prev patch.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
 Attachments: YARN-261--n2.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-16 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n3.patch

Fixing Javadoc warning.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n2.patch

Attaching an updated patch. Updated the name of the wait method. Changed the 
way it gets notifications when app masters get registered/unregistered so now 
ApplicationAttemptId is used as the key.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766237#comment-13766237
 ] 

Andrey Klochkov commented on YARN-1183:
---

bq. MiniYARNCluster is used by several tests. This might bite us if and when we 
run tests parallely.
Concurrency level won't make any difference even with that. BTW I'm actually 
running MR tests in parallel now. That's when this issue with cluster shutdown 
working incorrectly becomes more evident. 

Thanks for catching the thing with synchronized block, fixing it.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n4.patch

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n3.patch

Attaching an updated patch

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-12 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765735#comment-13765735
 ] 

Andrey Klochkov commented on YARN-1183:
---

Yep, you're right about trackingUrl being able to change. Please disregard this 
part.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-11 Thread Andrey Klochkov (JIRA)
Andrey Klochkov created YARN-1183:
-

 Summary: MiniYARNCluster shutdown takes several minutes 
intermittently
 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov


As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
processes living for several minutes after successful completion of the 
corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
logic which leads to this. Sometimes RM stops before an app master sends it's 
last report, and then the app master keeps retrying for 6 minutes. In some 
cases it leads to failures in subsequent tests, and it affects performance of 
tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-11 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183.patch

Attaching a patch which modifies MiniYARNCluter so it waits until all app 
masters are reported as finished.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira