from:"Andrey Klochkov \(JIRA\)"

[jira] [Commented] (YARN-261) Ability to kill AM attempts

2015-09-04 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731080#comment-14731080
 ] 

Andrey Klochkov commented on YARN-261:
--

[~rohithsharma], please feel free to reassign to yourself. I tried to rebase 
but the patch is old and rebasing is not straightforward.

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-261) Ability to kill AM attempts

2015-09-04 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-261:


Assignee: (was: Andrey Klochkov)

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-445) Ability to signal containers

2015-05-06 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-445:


Assignee: (was: Andrey Klochkov)

After I submitted the patch for this there were alternative proposals / patches 
discussed, so I'm unassigning myself from this jira.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Reporter: Jason Lowe
  Labels: BB2015-05-TBR
 Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, 
 YARN-445--n3.patch, YARN-445--n4.patch, 
 YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-11 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130368#comment-14130368
 ] 

Andrey Klochkov commented on YARN-415:
--

[~eepayne], congratulations and thanks for this tremendous amount of 
persistence! :-)

 Capture aggregate memory allocation at the app-level for chargeback
 ---

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Fix For: 2.6.0

 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
 YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
 YARN-415.201409102216.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-11 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-
Assignee: Eric Payne  (was: Andrey Klochkov)

 Capture aggregate memory allocation at the app-level for chargeback
 ---

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Eric Payne
 Fix For: 2.6.0

 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
 YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
 YARN-415.201409102216.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-06-03 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017225#comment-14017225
]

Andrey Klochkov commented on YARN-415:
--

[~eepayne], thanks a lot for finishing this. Sorry I have never had enough time
for it. My biggest question - and that's where I stuck when I tried to rebase
the patch on the latest trunk - is about possibility of previous attempts being
evicted from Scheduler by following attempts of the same app. Currently
Scheduler stores just the latest attempt (in
{{SchedulerApplication.currentAttempt}}), and actually when the next attempt
starts, the whole {{SchedulerApplication}} instance is replaced with a new one.
I'm not sure if in fact this may happen before the usage report for the
previous attempt is retrieved by RM, but from the code it seems possible. If
this is the case, then we need to somehow keep previous attempts of an app
until reports for those are retrieved. I realize that with the current
implementation this may not be done just by storing multiple attempts instead
of just the current one in {{SchedulerApplication}}, because Scheduler creates
new {{SchedulerApplication}} instance of every attempt. What do you think?

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n10.patch, YARN-415--n2.patch,
YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch,
YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch,
YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt,
YARN-415.patch

For the purpose of chargeback, I'd like to be able to compute the cost of an
application in terms of cluster resource usage. To start out, I'd like to
get the memory utilization of an application. The unit should be MB-seconds
or something similar and, from a chargeback perspective, the memory amount
should be the memory reserved for the application, as even if the app didn't
use all that memory, no one else was able to use it.
(reserved ram for container 1 * lifetime of container 1) + (reserved ram for
container 2 * lifetime of container 2) + ... + (reserved ram for container n
* lifetime of container n)
It'd be nice to have this at the app level instead of the job level because:
1. We'd still be able to get memory usage for jobs that crashed (and wouldn't
appear on the job history server).
2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
This new metric should be available both through the RM UI and RM Web
Services REST API.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-01-21 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13878308#comment-13878308
]

Andrey Klochkov commented on YARN-415:
--

I'm updating the patch to make it applicable to current trunk. Going slowly but
hope I'll finish this week.

Capture memory utilization at the app-level for chargeback
--

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-11-04 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n7.patch

Uploading a patch rebased after YARN-891 and with fixes according to Jason's 
comments.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
 YARN-261--n7.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-26 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n10.patch

Updating the patch:
- got rid of the runningContainers map
- refactored SchedulerApplication by pulling some of the methods from 
descendants (allocate, containerCompleted, unreserve), making most of the 
fields private and modifying  containerCompleted to do resource usage tracking
- made scheduler not evicting app immediately after the attempt is finished, 
but waiting for signal from RMAppAttemptImpl

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-25 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n9.patch

Updated the patch moving tracking logic into Scheduler:
- AppSchedulingInfo tracks resources usage. Existing methods are reused and
overall it seems more like the right place to have this logic in.
- When app is finished and Scheduler evicts it from it's cache, it sends a new
type of event (RMAppAttemptAppFinishedEvent) to the attempt, attaching usage
stats to the event.
- RMAppAttemptImpl test is modified accordingly
- a new test is added to verify resources tracking in AppSchedulingInfo

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-25 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805804#comment-13805804
]

Andrey Klochkov commented on YARN-415:
--

This scheme has a downside that the stats would be incorrect between 2 events:
1) Scheduler evicting the app from the cache and sending an event and 2)
RMAppAttemptImpl receiving the event and updating it's internal stats. The only
idea I have is to add an additional roundtrip extending this schema to:
1. When app is finished, Scheduler sends RMAppAttemptAppFinishedEvent instance
and does not evict the app from the cache yet
2. RMAppAttemptImpl receives the event, updates it's internal fields
finalMemorySeconds and finalVcoreSeconds and sends a new type event to the
Scheduler allowing it to evict the app.
3. Scheduler gets the event and evicts the app.

Thoughts?

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-22 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802033#comment-13802033
 ] 

Andrey Klochkov commented on YARN-1183:
---

I don't think concurrency level can make any difference here. The change was 
requested by Karthik.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-22 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n5.patch

Attaching an updated patch.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183--n5.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-22 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802258#comment-13802258
]

Andrey Klochkov commented on YARN-1183:
---

Jonathan, the issue occurred when I just run tests for
hadoop-mapreduce-client-jobclient and watched for zombie Java processes. It was
much more visible when using parallel execution, see MAPREDUCE-4980. I observed
it quite often under OSX (some of the tests did that on every run) and didn't
see it on a Linux machine I had, and I had different JVM's there. I reproduced
it later in a non-modified trunk and tracked it down to MiniYARNCluster
shutdown. Can't reproduce it on another macbook I have now, but I think this
just due to the nature of the bug (concurrency issue).

MiniYARNCluster shutdown takes several minutes intermittently
-

Key: YARN-1183
URL: https://issues.apache.org/jira/browse/YARN-1183
Project: Hadoop YARN
Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch,
YARN-1183--n4.patch, YARN-1183--n5.patch, YARN-1183.patch

As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java
processes living for several minutes after successful completion of the
corresponding test. There is a concurrency issue in MiniYARNCluster shutdown
logic which leads to this. Sometimes RM stops before an app master sends it's
last report, and then the app master keeps retrying for 6 minutes. In some
cases it leads to failures in subsequent tests, and it affects performance of
tests as app masters eat resources.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-10-21 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n6.patch

Jason, makes sense. See the updated patch.

Ability to kill AM attempts
---

Key: YARN-261
URL: https://issues.apache.org/jira/browse/YARN-261
Project: Hadoop YARN
Issue Type: New Feature
Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
Attachments: YARN-261--n2.patch, YARN-261--n3.patch,
YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, YARN-261.patch

It would be nice if clients could ask for an AM attempt to be killed. This
is analogous to the task attempt kill support provided by MapReduce.
This feature would be useful in a scenario where AM retries are enabled, the
AM supports recovery, and a particular AM attempt is stuck. Currently if
this occurs the user's only recourse is to kill the entire application,
requiring them to resubmit a new application and potentially breaking
downstream dependent jobs if it's part of a bigger workflow. Killing the
attempt would allow a new attempt to be started by the RM without killing the
entire application, and if the AM supports recovery it could potentially save
a lot of work. It could also be useful in workflow scenarios where the
failure of the entire application kills the workflow, but the ability to kill
an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-21 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801286#comment-13801286
]

Andrey Klochkov commented on YARN-415:
--

IMO it makes sense to move this tracking into the scheduler, and in particular
SchedulerApplication looks like a good place to have this logic. I'm wondering
why SchedulerApplication has everything abstract, while it's descendants have a
lot of same fields and code. Why isn't the common code placed into the
SchedulerApplication itself? If I'm not missing anything here, I'd move all
that code and this resources usage tracking just into SchedulerApplication.
Please comment.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-21 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801308#comment-13801308
]

Andrey Klochkov commented on YARN-415:
--

Adding YARN-1335 as a dependency.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-21 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801332#comment-13801332
]

Andrey Klochkov commented on YARN-415:
--

On a second thought, scheduler does not seem like a good place for this stats.
It doesn't keep info on finished apps, so the tracking data will be gone as
soon as an app is done, if the logic is placed in the scheduler. At the same
time app attempts are kept in the RMContext until evicted, so usage stats can
be pulled out of there by an external system that have
persistence/reporting/etc.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-18 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799363#comment-13799363
]

Andrey Klochkov commented on YARN-415:
--

Arun, the idea is to have the stats being updated in real time while the app is
running. Is there a way to get a list of running containers assigned to the
app, with their start times, without tracking it explicitly?

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-17 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n7.patch

Thanks Jason. Attaching a fixed patch.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-17 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n8.patch

Adding changes in REST API docs to the patch.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch,
YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-10-15 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n5.patch

Jason, thanks for review. All your points make sense for me. Attaching a patch
with fixes.

Ability to kill AM attempts
---

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-15 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795753#comment-13795753
]

Andrey Klochkov commented on YARN-445:
--

Vinod,
Accepting a mapping of arbitrary commands is indeed the most powerful approach.
Also, this would require lots of changes in the Yarn, as well as an additional
complexity for app writers. At the same time, are we sure that this flexibility
is needed, and it won't be an over-engineering and probably an abstraction leak
in the Yarn framework? By the latter I mean that we will give app writers an
ability to run arbitrary commands on any node at any point of time, but is it
in the Yarn responsibilities to do that? I'm not a Yarn expert so I'm just
asking.

Anyway, the scope of what I has proposed with the patch is much smaller and
solves the task the initial description of this Jira stated - troubleshooting
of timed out containers by dumping jstack. This would be useful for many Yarn
uses, so I thought it may make sense to implement it this way now and extend in
the future if there is a demand. Agree that the way it is exposed in the API
may be changed to a signal value in the stopContainers request instead of a
separate call which is indeed a bit confusing.

Ability to signal containers

Key: YARN-445
URL: https://issues.apache.org/jira/browse/YARN-445
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Jason Lowe
Assignee: Andrey Klochkov
Attachments: YARN-445--n2.patch, YARN-445--n3.patch,
YARN-445--n4.patch, YARN-445.patch

It would be nice if an ApplicationMaster could send signals to contaniers
such as SIGQUIT, SIGUSR1, etc.
For example, in order to replicate the jstack-on-task-timeout feature
implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an
interface for sending SIGQUIT to a container. For that specific feature we
could implement it as an additional field in the StopContainerRequest.
However that would not address other potential features like the ability for
an AM to trigger jstacks on arbitrary tasks *without* killing them. The
latter feature would be a very useful debugging tool for users who do not
have shell access to the nodes.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (YARN-677) Increase coverage to FairScheduler

2013-10-15 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov resolved YARN-677.
--

Resolution: Won't Fix

 Increase coverage to FairScheduler
 --

 Key: YARN-677
 URL: https://issues.apache.org/jira/browse/YARN-677
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6
Reporter: Vadim Bondarev
Assignee: Andrey Klochkov
 Attachments: HADOOP-4536-branch-2-a.patch, 
 HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, 
 HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, 
 HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, 
 HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, 
 HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-14 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445--n4.patch

Attaching a patch which implements Ctrl+C and uses it instead of Signal.TERM on
Windows. I tested it by manually invoking winutils.exe only. Never could
succeed with starting Hadoop itself on Windows, although tried hard.

I think it doesn't makes much sense to separate this into two patches for this
Jira and MAPREDUCE-5387.

One problem I am not able to solve is the case when a batch script is used to
start a container. Using console handlers in that case leads to batch script
waiting for Terminate batch job? Y/N. Even if I know that a particular
process in the Job Object is a batch script I can't avoid sending the console
event to it. This may be a problem in scenarios when QUIT/TERM signals are not
followed later by KILL, and the process would not exit normally as it should.
So the question is whether KILL is used in all cases when containers are
stopped? Please advise.

Ability to signal containers

Key: YARN-445
URL: https://issues.apache.org/jira/browse/YARN-445
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Jason Lowe
Attachments: YARN-445--n2.patch, YARN-445--n3.patch,
YARN-445--n4.patch, YARN-445.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-10 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n6.patch

With the 1st option it's not clear how to implement a protection from leaks.
There's no event which can be used to check for leaks in that case. At the same
time currently Yarn behavior does not support containers surviving after AM is
finished, so the 2nd option is acceptable. This may need to be changed when
there'll be support for long-lived apps and attempts which stay alive after AM
is stopped.

Attaching a patch which implements option #2 and adds a test for it.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n2.patch, YARN-415--n3.patch,
YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-07 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-trunk--n5.patch
YARN-465-branch-2--n5.patch

Ravi,
The main method in WebAppProxyServer is blocking and it's installing an 
exception handler which calls System.exit, so it shouldn't be used in tests. 
Yes, it was a mistake that join() was just removed without making corresponding 
change in main() - I fixed that.

Yes, you're correct about the originalPort logging. I got rid of this confusing 
port variable and fixed logging.

Also, I made patches for branch-2 and trunk as similar as possible.

Attaching updated patches.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, 
 YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, 
 YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-07 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788623#comment-13788623
 ] 

Andrey Klochkov commented on YARN-465:
--

Actually the only difference is in how HttpServer is initiated as in trunk the 
constructor was deprecated in favor of HttpServer.Builder. Seems this 
constructor is deprecated in branch-2 as well so yes, let's apply the trunk 
patch to both branches. Sorry didn't notice that earlier.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, 
 YARN-465-branch-2--n5.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk--n5.patch, 
 YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-07 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n5.patch

Fixing the failed test.

Capture memory utilization at the app-level for chargeback
--

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-04 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786506#comment-13786506
 ] 

Andrey Klochkov commented on YARN-445:
--

The large diffs in the tests are not due to reformatting but because of 
refactoring needed to implement an additional test without lots of copy/paste. 

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-04 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786572#comment-13786572
]

Andrey Klochkov commented on YARN-445:
--

Steve, the current implementation will send the signal to the java started with
bin/hbase as it sends it to all processes in the job object, e.g. all processes
of the main container process. It can be replaced with sending the signal to
all processes in the group instead, and I think the behavior will be the same.

BTW I don't know how to do the opposite - i.e. how to avoid sending the signal
to all processes of the container, on Windows (so the behavior on Linux is
different as bin/hbase will receive the signal). I think this is fine as long
as this difference is documented. In case of hbase the shell script can create
a custom hook for SIGTERM and do whatever is needed in that case (e.g. send
SIGTERM to the java process it started).

There is one caveat in ctrl+break handling in case of a batch file starting a
java process:
1. the batch file starts the java process
2. user sends ctrl+break to all processes in the group (or job object). java
process prints thread dump. batch file doesn't react yet.
3. the java processes completes successfully
4. the batch file will not exit, it will print Terminate batch job? (Y/N) as
it received the ctrl+break signal earlier.

The only way I see on how we can overcome this problem with batch file
processes is to identify them somehow (by executable name?) when walking
through the processes in the job object, and do not send them the signal.
Sending ctrl+break to batch file processes doesn't make sense anyway as in
newer Windows there's no way to disable or customize ctrl+break handling in
batch files.

Ability to signal containers

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Assigned] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-10-04 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-1183:
-

Assignee: Andrey Klochkov

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
Assignee: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-04 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-trunk--n4.patch

Ravi, this is not my patch so please keep in mind I'm digging into this code as
you are. Alexey wouldn't be available to make fixes so I'm taking this on me so
the contribution wouldn't be lost.

1-2. As I see, WebAppProxy.start() method is used in another test, so that
should be the reason it's not a part of the main method. The join method is
removed as it's not used anymore.
3. I think that it is meant to log port, not originalPort. The port
variable is set in WebAppProxyForTest.start() to the actual port which the
server binds to.
4. Indeed core-default.xml is not needed. I'm replacing it with making this
configuration in the code of the test itself.
5. It must be setName(proxy) as this is the name of the webapp under
hadoop-yarn-common/src/main/resources/webapps. If you set it to anything else
that would lead to ClassNotFoundException. I made the message about the port
number more detailed.
6. I added the check which verifies that the cookie is present in one case and
is absent in another.
7. Yes, I don't see why testWebAppProxyServer is needed in the presense of
testWebAppProxyServlet. Removing.
8. The test testWebAppProxyServerMainMethod is testing that the server is
starting successfully. The counter is used to wait for the server to start.

Attaching the updated patch for trunk

fix coverage org.apache.hadoop.yarn.server.webproxy

Key: YARN-465
URL: https://issues.apache.org/jira/browse/YARN-465
Project: Hadoop YARN
Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
Attachments: YARN-465-branch-0.23-a.patch,
YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch,
YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch,
YARN-465-trunk--n3.patch, YARN-465-trunk--n4.patch, YARN-465-trunk.patch

fix coverage org.apache.hadoop.yarn.server.webproxy
patch YARN-465-trunk.patch for trunk
patch YARN-465-branch-2.patch for branch-2
patch YARN-465-branch-0.23.patch for branch-0.23
There is issue in branch-0.23 . Patch does not creating .keep file.
For fix it need to run commands:
mkdir
yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
touch
yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-04 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-branch-2--n4.patch

Attaching the updated patch for branch-2

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2--n4.patch, 
 YARN-465-branch-2.patch, YARN-465-trunk-a.patch, YARN-465-trunk--n3.patch, 
 YARN-465-trunk--n4.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-10-04 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n4.patch

Jason, thanks for the thorough review. Attaching the patch with fixes. I
basically made all the fixes you're proposing except the last one about
capturing the leak.

Capture memory utilization at the app-level for chargeback
--

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-04 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445--n3.patch

Attaching the patch that marks all new interfaces/methods as unstable.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445--n3.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-03 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785381#comment-13785381
 ] 

Andrey Klochkov commented on YARN-465:
--

The robot failed when testing the branch-2 patch against trunk, this is 
expected.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Aleksey Gorshkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Assigned] (YARN-677) Increase coverage to FairScheduler

2013-10-03 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-677:


Assignee: Andrey Klochkov

 Increase coverage to FairScheduler
 --

 Key: YARN-677
 URL: https://issues.apache.org/jira/browse/YARN-677
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6
Reporter: Vadim Bondarev
Assignee: Andrey Klochkov
 Attachments: HADOOP-4536-branch-2-a.patch, 
 HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch, 
 HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch, 
 HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch, 
 HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch, 
 HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-677) Increase coverage to FairScheduler

2013-10-03 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785407#comment-13785407
]

Andrey Klochkov commented on YARN-677:
--

I looked at the difference in coverage before and after the patch. There are 2
test methods added:
1. testSchedulerHandleFailWithExternalEvents checks that
FairScheduler.handle() throws RuntimeException when supplied with a wrong event
type. Actual check is missing so seems like the test will pass in any case.
This is a very minor addition to the coverage. If we want to keep it, I can add
the check and update the patch.
2. testAggregateCapacityTrackingWithPreemptionEnabled -- not sure about the
intention. I see that it adds coverage to the
FairScheduler.preemptTasksIfNecessary() method, but basically it just sleeps so
the method is invoked, but preemption never happens and the test is not making
any checks. I think we can skip this one.

Should we keep #1?

Increase coverage to FairScheduler
--

Key: YARN-677
URL: https://issues.apache.org/jira/browse/YARN-677
Project: Hadoop YARN
Issue Type: Test
Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.6
Reporter: Vadim Bondarev
Assignee: Andrey Klochkov
Attachments: HADOOP-4536-branch-2-a.patch,
HADOOP-4536-branch-2c.patch, HADOOP-4536-trunk-a.patch,
HADOOP-4536-trunk-c.patch, HDFS-4536-branch-2--N7.patch,
HDFS-4536-branch-2--N8.patch, HDFS-4536-branch-2-N9.patch,
HDFS-4536-trunk--N6.patch, HDFS-4536-trunk--N7.patch,
HDFS-4536-trunk--N8.patch, HDFS-4536-trunk-N9.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Assigned] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-03 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-465:


Assignee: Andrey Klochkov  (was: Aleksey Gorshkov)

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Andrey Klochkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-02 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784199#comment-13784199
 ] 

Andrey Klochkov commented on YARN-445:
--

Bikas, on Windows JVM prints full thread dump on ctrl+break. I think ctrl+c may 
be emulated in the same way and used in place of TERM on Windows, via the same 
signalContainers API.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-02 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445--n2.patch

fixing javadoc warnings and the failed test

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-445) Ability to signal containers

2013-10-02 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784292#comment-13784292
 ] 

Andrey Klochkov commented on YARN-445:
--

As I understand this Findbugs warning should be ignored as it's complaining 
about a valid type cast. 

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
 Attachments: YARN-445--n2.patch, YARN-445.patch


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-465) fix coverage org.apache.hadoop.yarn.server.webproxy

2013-10-02 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-465:
-

Attachment: YARN-465-branch-2--n3.patch
YARN-465-trunk--n3.patch

Attaching updated patches. setAccessible usage is removed.

 fix coverage  org.apache.hadoop.yarn.server.webproxy
 

 Key: YARN-465
 URL: https://issues.apache.org/jira/browse/YARN-465
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 0.23.7, 2.0.4-alpha
Reporter: Aleksey Gorshkov
Assignee: Aleksey Gorshkov
 Attachments: YARN-465-branch-0.23-a.patch, 
 YARN-465-branch-0.23.patch, YARN-465-branch-2-a.patch, 
 YARN-465-branch-2--n3.patch, YARN-465-branch-2.patch, YARN-465-trunk-a.patch, 
 YARN-465-trunk--n3.patch, YARN-465-trunk.patch


 fix coverage  org.apache.hadoop.yarn.server.webproxy
 patch YARN-465-trunk.patch for trunk
 patch YARN-465-branch-2.patch for branch-2
 patch YARN-465-branch-0.23.patch for branch-0.23
 There is issue in branch-0.23 . Patch does not creating .keep file.
 For fix it need to run commands:
 mkdir 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy
 touch 
 yhadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/proxy/.keep
  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-445) Ability to signal containers

2013-10-01 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-445:
-

Attachment: YARN-445.patch

Attaching a patch that provides the simplest implementation:
- winutils is extended with an additional routine that uses console control
handlers to emulate ctrl+break on the container. For Java containers it
roughly corresponds to QUIT signal on Linux.
- ContainerManagerProtocol is extended with signalContainers() method which
accepts a signal number to send. Currently the implementation accepts QUIT
(i.e. value 3) signal only and rejects the request otherwise.
- TestContainerManager is extended accordingly and executed successfully under
Windows, OSX and Linux.

This provides a simple implementation that would allow to troubleshoot
containers without killing them, as the initial description of the feature is
stating. If needed we may extract an additional Jira to extend this further
with allowing arbitrary map of commands to be provided in submission context
and then invoked through the NM API.

Ability to signal containers

Key: YARN-445
URL: https://issues.apache.org/jira/browse/YARN-445
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Attachments: YARN-445.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-20 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773419#comment-13773419
]

Andrey Klochkov commented on YARN-415:
--

The proposed implementation uses events fired by the scheduler to track
resources usage, so we start ticking as soon as a container is allocated by the
scheduler and stop doing that when the container is completed and the scheduler
gets the resources back. Hence, in case a container fails to start by some
reason, we'll stop ticking as soon as RM will get this reported. As for the gap
between a container actually finishes and RM gets the report, we don't manage
it, i.e. the client will be charged until RM gets the report. Start time and
finish time are both computed by the scheduler, i.e. it's on the RM side.

Not sure about rounding off - can you point me to the code which does that? I
think we just use what's provided in the ApplicationSubmissionContext, i.e. it
shouldn't be rounded off.

Capture memory utilization at the app-level for chargeback
--

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-19 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415.patch

The patch exposes MB-seconds and CPU-seconds through CLI, REST API and UI.

Capture memory utilization at the app-level for chargeback
--

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-19 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n3.patch

Updating the patch with fixes in tests.

Capture memory utilization at the app-level for chargeback
--

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-19 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-415:
-

Attachment: YARN-415--n2.patch

Updating the patch with fixes of findbugs warnings on multithreaded correctness

Capture memory utilization at the app-level for chargeback
--

[jira] [Commented] (YARN-261) Ability to kill AM attempts

2013-09-17 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769772#comment-13769772
]

Andrey Klochkov commented on YARN-261:
--

On a closer look it is indeed possible to reuse existing events instead of
introducing new logic. Will simplify the patch. Xuan, thanks for the suggestion.

Ability to kill AM attempts
---

Key: YARN-261
URL: https://issues.apache.org/jira/browse/YARN-261
Project: Hadoop YARN
Issue Type: New Feature
Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261.patch

[jira] [Assigned] (YARN-415) Capture memory utilization at the app-level for chargeback

2013-09-17 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov reassigned YARN-415:

Assignee: Andrey Klochkov

Capture memory utilization at the app-level for chargeback
--

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-17 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n4.patch

This patch is implementing this with less changes in state machines -- it's not 
introducing new transitions in the AppEvent state machine. It still introduces 
new event type and add transitions into App state machine. 

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-261) Ability to kill AM attempts

2013-09-17 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770160#comment-13770160
 ] 

Andrey Klochkov commented on YARN-261:
--

Seems that the reported test failures are all caused by 
java.lang.OutOfMemoryError: unable to create new native thread, shouldn't be 
relevant to the changes in the patch.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
 YARN-261--n4.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-16 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261.patch

Attaching a patch implementing application restart feature. Effectively, it's
what is described here: a current attempt is killed and a new one is started.
It'll work only if there are attempts left and recovery is possible - exactly
the same conditions which are used when deciding whether to start a new attempt
after a failure in the previous one.

Ability to kill AM attempts
---

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-16 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n2.patch

Updated the patch to prevent the exception in the log when processing
RMAppEventType.ATTEMPT_KILLED.

On failing an attempt instead of additional logic - well, the patch itself
consists of: 1) CLI modifications 2) adding the API call and lots of
boilerplate needed for that 3) replicating AttemptFailedTransaction into a
modified version named AppRestartedTransaction (with different diagnostics) 4)
modifying state machines. The 4th part is needed exactly for not firing
RMAppEventType.ATTEMPT_KILLED. I missed that part in the prev patch.

Ability to kill AM attempts
---

[jira] [Updated] (YARN-261) Ability to kill AM attempts

2013-09-16 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-261:
-

Attachment: YARN-261--n3.patch

Fixing Javadoc warning.

 Ability to kill AM attempts
 ---

 Key: YARN-261
 URL: https://issues.apache.org/jira/browse/YARN-261
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.0.3-alpha
Reporter: Jason Lowe
 Attachments: YARN-261--n2.patch, YARN-261--n3.patch, YARN-261.patch


 It would be nice if clients could ask for an AM attempt to be killed.  This 
 is analogous to the task attempt kill support provided by MapReduce.
 This feature would be useful in a scenario where AM retries are enabled, the 
 AM supports recovery, and a particular AM attempt is stuck.  Currently if 
 this occurs the user's only recourse is to kill the entire application, 
 requiring them to resubmit a new application and potentially breaking 
 downstream dependent jobs if it's part of a bigger workflow.  Killing the 
 attempt would allow a new attempt to be started by the RM without killing the 
 entire application, and if the AM supports recovery it could potentially save 
 a lot of work.  It could also be useful in workflow scenarios where the 
 failure of the entire application kills the workflow, but the ability to kill 
 an attempt can keep the workflow going if the subsequent attempt succeeds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n2.patch

Attaching an updated patch. Updated the name of the wait method. Changed the 
way it gets notifications when app masters get registered/unregistered so now 
ApplicationAttemptId is used as the key.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766237#comment-13766237
]

Andrey Klochkov commented on YARN-1183:
---

bq. MiniYARNCluster is used by several tests. This might bite us if and when we
run tests parallely.
Concurrency level won't make any difference even with that. BTW I'm actually
running MR tests in parallel now. That's when this issue with cluster shutdown
working incorrectly becomes more evident.

Thanks for catching the thing with synchronized block, fixing it.

MiniYARNCluster shutdown takes several minutes intermittently
-

Key: YARN-1183
URL: https://issues.apache.org/jira/browse/YARN-1183
Project: Hadoop YARN
Issue Type: Bug
Reporter: Andrey Klochkov
Attachments: YARN-1183--n2.patch, YARN-1183.patch

[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n4.patch

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, 
 YARN-1183--n4.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-13 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183--n3.patch

Attaching an updated patch

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183--n3.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-12 Thread Andrey Klochkov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765735#comment-13765735
 ] 

Andrey Klochkov commented on YARN-1183:
---

Yep, you're right about trackingUrl being able to change. Please disregard this 
part.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-11 Thread Andrey Klochkov (JIRA)

Andrey Klochkov created YARN-1183:
-

 Summary: MiniYARNCluster shutdown takes several minutes 
intermittently
 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov


As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
processes living for several minutes after successful completion of the 
corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
logic which leads to this. Sometimes RM stops before an app master sends it's 
last report, and then the app master keeps retrying for 6 minutes. In some 
cases it leads to failures in subsequent tests, and it affects performance of 
tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-11 Thread Andrey Klochkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov updated YARN-1183:
--

Attachment: YARN-1183.patch

Attaching a patch which modifies MiniYARNCluter so it waits until all app 
masters are reported as finished.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

63 matches

Mail list logo