[jira] [Updated] (YARN-2934) Improve handling of container's stderr

2015-11-08 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2934:

Attachment: YARN-2934.v1.004.patch

Thanks for the comments [~nijel],
bq. What if this give multiple error files ?
Currenly i am  going under the assumption that the logs needs to be only 
fetched from stderr and hence there is less possibility of multiple files, but 
if app is submitted wrongly configured file names then the first file matching 
the pattern is picked.

Other comment for error message is corrected, also discussed with 
[~rohithsharma], we felt that Regex is better as we can specify multiple 
patterns in a single pattern which can help in future multiple apps scenario.
Uploading a patch With *Regex Pattern* approach

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4241) Typo in yarn-default.xml

2015-11-08 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996107#comment-14996107
 ] 

Tsuyoshi Ozawa commented on YARN-4241:
--

[~anth...@cloudera.com] the patch cannot be applied straightforwardly. Could 
you rebase it on trunk code?

> Typo in yarn-default.xml
> 
>
> Key: YARN-4241
> URL: https://issues.apache.org/jira/browse/YARN-4241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, yarn
>Reporter: Anthony Rojas
>Assignee: Anthony Rojas
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-4241.002.patch, YARN-4241.patch, YARN-4241.patch.1
>
>
> Typo in description section of yarn-default.xml, under the properties:
> yarn.nodemanager.disk-health-checker.min-healthy-disks
> yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
> yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb
> yarn.nodemanager.disk-health-checker.disk-utilization-watermark-low-per-disk-percentage
> The reference to yarn-nodemanager.local-dirs should be 
> yarn.nodemanager.local-dirs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4241) Typo in yarn-default.xml

2015-11-08 Thread Tsuyoshi Ozawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-4241:
-
Description: 
Typo in description section of yarn-default.xml, under the properties:

yarn.nodemanager.disk-health-checker.min-healthy-disks
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb
yarn.nodemanager.disk-health-checker.disk-utilization-watermark-low-per-disk-percentage

The reference to yarn-nodemanager.local-dirs should be 
yarn.nodemanager.local-dirs


  was:
Typo in description section of yarn-default.xml, under the properties:

yarn.nodemanager.disk-health-checker.min-healthy-disks
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb

The reference to yarn-nodemanager.local-dirs should be 
yarn.nodemanager.local-dirs



> Typo in yarn-default.xml
> 
>
> Key: YARN-4241
> URL: https://issues.apache.org/jira/browse/YARN-4241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, yarn
>Reporter: Anthony Rojas
>Assignee: Anthony Rojas
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-4241.002.patch, YARN-4241.patch, YARN-4241.patch.1
>
>
> Typo in description section of yarn-default.xml, under the properties:
> yarn.nodemanager.disk-health-checker.min-healthy-disks
> yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
> yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb
> yarn.nodemanager.disk-health-checker.disk-utilization-watermark-low-per-disk-percentage
> The reference to yarn-nodemanager.local-dirs should be 
> yarn.nodemanager.local-dirs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4320) TestJobHistoryEventHandler fails as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-08 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996084#comment-14996084
 ] 

Tsuyoshi Ozawa commented on YARN-4320:
--

Thanks for committing this to branch-2.6, Sangjin!

> TestJobHistoryEventHandler fails as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4320
> URL: https://issues.apache.org/jira/browse/YARN-4320
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Fix For: 3.0.0, 2.8.0, 2.7.2, 2.6.3
>
> Attachments: YARN-4320.01.patch
>
>
> {noformat}
> Running org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 40.256 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler
> testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler)
>   Time elapsed: 35.764 sec  <<< ERROR!
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
>   at com.sun.jersey.api.client.Client.handle(Client.java:648)
>   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
>   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
>   at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:474)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:320)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:320)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:305)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForTimelineServer(JobHistoryEventHandler.java:1015)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:586)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.handleEvent(TestJobHistoryEventHandler.java:719)
>   at 
> org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:507)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM

2015-11-08 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996066#comment-14996066
 ] 

Rohith Sharma K S commented on YARN-4324:
-

When AM is launched and registered, RM expects AM to send heartbeat in timely 
manner. If RM does not receive an heartbeat from AM for certain time(default is 
10min) then RM kills the AM. This is expected behavior. 

In your scenario, try to find what expiry happened either AM heartbeat expiry 
OR container expiry. This info you will get in ResourceManager log. Does 
NodeManager is restarted? If Yes, and NM recovery not enabled then it is 
expected behavior.

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-11-08 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3946:

Attachment: 3946WebImages.zip
YARN-3946.v1.002.patch

Thanks for the quick feedback [~wangda]
bq. AM launch diagnostics should have an intial value after added to scheduler: 
...
Initially thought of adding this message but the problem is 
{{LeafQueue.activateApplications}} will be immediately called by C.S in 
{{addApplicationAttempt}} hence the messages will be replaced very fast, hence 
initial message will not be helpfull but have ensured the related details are 
captured. Thoughts?

bq.Not caused by your patch, isWaitingForAMContainer checks if master container 
created, you may also need to check if application is in recover state or not. 
Because AM could contact to RM before AM container recovered by RM.
I am not sure i got this correctly
# ??AM could contact to RM before AM container recovered by RM?? failed to 
understand the impact of this, all the required information is restored from 
the RMState store ({{RMAppAttemptImpl.recover(RMState)}} sets the 
mastercontainer from the store) , so after the services are started there is a 
possibility of AM hearbeat to be earlier than NM heartbeat, but what impact 
could it have? Correct me if my understanding is wrong !
# ??check if application is in recover state or not?? not sure how to do this 
if req!, i went through RMAppAttemptImpl and RMAppImpl there was no such 
methods or internal state which can expose this. May be i am missing something 
here.

bq. Suggest to add to REST API / web UI together with this patch if changes are 
not complex.
Even earlier Implementation also had captured it as part of 
attempt.getDiagnostics, so it will be available in all the interfaces

Other comments have handled, Have attached the web images 

[~steve_l],
bq. I'd like to see this in application reports, so that client-side 
applications can display the details
Have taken care in this patch


> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> 
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: 3946WebImages.zip, YARN-3946.v1.001.patch, 
> YARN-3946.v1.002.patch, YARN3946_attemptDiagnistic message.png
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-11-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995855#comment-14995855
 ] 

Hadoop QA commented on YARN-3980:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 17s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 6s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 15s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 15s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 57s 
{color} | {color:red} Patch generated 7 new checkstyle issues in root (total 
was 402, now 408). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 1s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 51s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 49s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 1s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 53s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
23s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 160m 49s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem |

[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-11-08 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: YARN-3980-v6.patch

Fixed javadoc style issue.
The checkstyle errors look like pedantic mode to me; let me know if you want me 
to fix any but they don't seem very reasonable.
The failed unit tests look like not related.

> Plumb resource-utilization info in node heartbeat through to the scheduler
> --
>
> Key: YARN-3980
> URL: https://issues.apache.org/jira/browse/YARN-3980
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Karthik Kambatla
>Assignee: Inigo Goiri
> Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, 
> YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch, 
> YARN-3980-v5.patch, YARN-3980-v6.patch
>
>
> YARN-1012 and YARN-3534 collect resource utilization information for all 
> containers and the node respectively and send it to the RM on node heartbeat. 
> We should plumb it through to the scheduler so the scheduler can make use of 
> it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)