date:20150721

[jira] [Updated] (YARN-3852) Add docker container support to container-executor

2015-07-21 Thread Abin Shahab (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abin Shahab updated YARN-3852:
--
Attachment: YARN-3852-2.patch

> Add docker container support to container-executor 
> ---
>
> Key: YARN-3852
> URL: https://issues.apache.org/jira/browse/YARN-3852
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Abin Shahab
> Attachments: YARN-3852-1.patch, YARN-3852-2.patch, YARN-3852.patch
>
>
> For security reasons, we need to ensure that access to the docker daemon and 
> the ability to run docker containers is restricted to privileged users ( i.e 
> users running applications should not have direct access to docker). In order 
> to ensure the node manager can run docker commands, we need to add docker 
> support to the container-executor binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins

2015-07-21 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636369#comment-14636369
 ] 

Brahma Reddy Battula commented on YARN-3528:


Yes, I was in leave for these days..Do you have any other comments apart from 
[~varun_saxena].

> Tests with 12345 as hard-coded port break jenkins
> -
>
> Key: YARN-3528
> URL: https://issues.apache.org/jira/browse/YARN-3528
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
>Priority: Blocker
>  Labels: test
> Attachments: YARN-3528-002.patch, YARN-3528.patch
>
>
> A lot of the YARN tests have hard-coded the port 12345 for their services to 
> come up on.
> This makes it impossible to have scheduled or precommit tests to run 
> consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
> appear to get ignored completely.
> A quick grep of "12345" shows up many places in the test suite where this 
> practise has developed.
> * All {{BaseContainerManagerTest}} subclasses
> * {{TestNodeManagerShutdown}}
> * {{TestContainerManager}}
> + others
> This needs to be addressed through portscanning and dynamic port allocation. 
> Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-451) Add more metrics to RM page

2015-07-21 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636358#comment-14636358
 ] 

Joep Rottinghuis commented on YARN-451:
---

Just for the record, at Twitter we've been running with YARN-2417 in production 
and are finding it very useful in clusters of many thousands of nodes with tens 
of thousands of jobs in a day.

> Add more metrics to RM page
> ---
>
> Key: YARN-451
> URL: https://issues.apache.org/jira/browse/YARN-451
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Lohit Vijayarenu
>Assignee: Sangjin Lee
> Attachments: in_progress_2x.png, yarn-451-trunk-20130916.1.patch
>
>
> ResourceManager webUI shows list of RUNNING applications, but it does not 
> tell which applications are requesting more resource compared to others. With 
> cluster running hundreds of applications at once it would be useful to have 
> some kind of metric to show high-resource usage applications vs low-resource 
> usage ones. At the minimum showing number of containers is good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3952) Fix new findbugs warnings in resourcemanager in YARN-2928 branch

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3952:
---
Description: 
{noformat}









{noformat}

  was:
{noformat}






{noformat}


> Fix new findbugs warnings in resourcemanager in YARN-2928 branch
> 
>
> Key: YARN-3952
> URL: https://issues.apache.org/jira/browse/YARN-3952
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher'>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='79'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='76'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='73'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='67'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='70'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='82'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='85'/>
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3250) Support admin cli interface in for Application Priority

2015-07-21 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3250:
--
Summary: Support admin cli interface in for Application Priority  (was: 
Support admin cli interface in Application Priority Manager (server side))

> Support admin cli interface in for Application Priority
> ---
>
> Key: YARN-3250
> URL: https://issues.apache.org/jira/browse/YARN-3250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
>
> Current Application Priority Manager supports only configuration via file. 
> To support runtime configurations for admin cli and REST, a common management 
> interface has to be added which can be shared with NodeLabelsManager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3176) In Fair Scheduler, child queue should inherit maxApp from its parent

2015-07-21 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636244#comment-14636244
 ] 

Joep Rottinghuis commented on YARN-3176:


For the record, we're running with this patch in production at Twitter.

> In Fair Scheduler, child queue should inherit maxApp from its parent
> 
>
> Key: YARN-3176
> URL: https://issues.apache.org/jira/browse/YARN-3176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siqi Li
>Assignee: Siqi Li
> Attachments: YARN-3176.v1.patch, YARN-3176.v2.patch
>
>
> if the child queue does not have a maxRunningApp limit, it will use the 
> queueMaxAppsDefault. This behavior is not quite right, since 
> queueMaxAppsDefault is normally a small number, whereas some parent queues do 
> have maxRunningApp set to be more than the default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-445) Ability to signal containers

2015-07-21 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636221#comment-14636221
 ] 

Joep Rottinghuis commented on YARN-445:
---

Can we rekindle this discussion? We've had folks ask how we're letting users 
debug their own containers at Twitter and the answer is that we're running with 
the patch supplied by Ming.

Giving the users a mechanism to jstack is absolutely awesome. In fact we're 
using a capability in our JVM that lets user do a perf record/perf report right 
from a link on the UI using the very same mechanism.

> Ability to signal containers
> 
>
> Key: YARN-445
> URL: https://issues.apache.org/jira/browse/YARN-445
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Reporter: Jason Lowe
>  Labels: BB2015-05-TBR
> Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, 
> YARN-445--n3.patch, YARN-445--n4.patch, 
> YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png
>
>
> It would be nice if an ApplicationMaster could send signals to contaniers 
> such as SIGQUIT, SIGUSR1, etc.
> For example, in order to replicate the jstack-on-task-timeout feature 
> implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
> interface for sending SIGQUIT to a container.  For that specific feature we 
> could implement it as an additional field in the StopContainerRequest.  
> However that would not address other potential features like the ability for 
> an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
> latter feature would be a very useful debugging tool for users who do not 
> have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3952) Fix new findbugs warning in resourcemanager in YARN-2928 branch

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3952:
---
Issue Type: Sub-task  (was: Bug)
Parent: YARN-2928

> Fix new findbugs warning in resourcemanager in YARN-2928 branch
> ---
>
> Key: YARN-3952
> URL: https://issues.apache.org/jira/browse/YARN-3952
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher'>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='79'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='76'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='73'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='67'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='70'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='82'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='85'/>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3952) Fix new findbugs warnings in resourcemanager in YARN-2928 branch

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3952:
---
Summary: Fix new findbugs warnings in resourcemanager in YARN-2928 branch  
(was: Fix new findbugs warning in resourcemanager in YARN-2928 branch)

> Fix new findbugs warnings in resourcemanager in YARN-2928 branch
> 
>
> Key: YARN-3952
> URL: https://issues.apache.org/jira/browse/YARN-3952
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher'>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='79'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='76'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='73'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='67'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='70'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='82'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='85'/>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3952) Fix new findbugs warning in resourcemanager in YARN-2928 branch

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3952:
---
Description: 
{noformat}






{noformat}

  was:
{noformat}




> Fix new findbugs warning in resourcemanager in YARN-2928 branch
> ---
>
> Key: YARN-3952
> URL: https://issues.apache.org/jira/browse/YARN-3952
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher'>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='79'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='76'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='73'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='67'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='70'/>
>  message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='82'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='85'/>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3952) Fix new findbugs warning in resourcemanager in YARN-2928 branch

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3952:
---
Description: 
{noformat}



> Fix new findbugs warning in resourcemanager in YARN-2928 branch
> ---
>
> Key: YARN-3952
> URL: https://issues.apache.org/jira/browse/YARN-3952
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher'>  type='BC_UNCONFIRMED_CAST' priority='Normal' category='STYLE' 
> message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='79'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='76'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='73'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='67'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent
>  in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='70'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerCreatedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='82'/> category='STYLE' message='Unchecked/unconfirmed cast from 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.ContainerFinishedEvent 
> in 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)'
>  lineNumber='85'/>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3952) Fix new findbugs warning in resourcemanager in YARN-2928 branch

2015-07-21 Thread Varun Saxena (JIRA)

Varun Saxena created YARN-3952:
--

 Summary: Fix new findbugs warning in resourcemanager in YARN-2928 
branch
 Key: YARN-3952
 URL: https://issues.apache.org/jira/browse/YARN-3952
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: YARN-2928
Reporter: Varun Saxena
Assignee: Varun Saxena






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3874) Optimize and synchronize FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636204#comment-14636204
 ] 

Varun Saxena commented on YARN-3874:


Test failures are unrelated.
They are due to a recent commit which was reverted in main branch.
That revert has not yet come in YARN-2928 branch.

Findbugs in timelineservice are related. Primarily related to default encoding. 
Will fix.
The ones in resourcemanager are not related. Will raise an issue for it.

> Optimize and synchronize FS Reader and Writer Implementations
> -
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch, 
> YARN-3874-YARN-2928.02.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3903) Disable preemption at Queue level for Fair Scheduler

2015-07-21 Thread He Tianyi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636173#comment-14636173
 ] 

He Tianyi commented on YARN-3903:
-

The latter one.
I encountered the requirement that needs to prevent particular jobs from being 
preempted.
This can either be done at queue-level or job-level. IMHO queue level have the 
advantage of transparency over job-level.

> Disable preemption at Queue level for Fair Scheduler
> 
>
> Key: YARN-3903
> URL: https://issues.apache.org/jira/browse/YARN-3903
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0
> Environment: 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt2-1~bpo70+1 
> (2014-12-08) x86_64
>Reporter: He Tianyi
>Priority: Trivial
> Attachments: YARN-3093.1.patch, YARN-3093.2.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> YARN-2056 supports disabling preemption at queue level for CapacityScheduler.
> As for fair scheduler, we recently encountered the same need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell

2015-07-21 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636134#comment-14636134
 ] 

Allen Wittenauer commented on YARN-3950:


This needs to be either YARN_SHELL_ID or HADOOP_SHELL_ID.  Having a 'naked' 
SHELL_ID pollutes the shell environment space.

> Add unique SHELL_ID environment variable to DistributedShell
> 
>
> Key: YARN-3950
> URL: https://issues.apache.org/jira/browse/YARN-3950
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-3950.001.patch
>
>
> As discussed in [this 
> comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
>  it would be useful to have a monotonically increasing and independent ID of 
> some kind that is unique per shell in the distributed shell program.
> We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636122#comment-14636122
 ] 

Hadoop QA commented on YARN-3950:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 46s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 22s | The applied patch generated  2 
new checkstyle issues (total was 46, now 48). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 21s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 42s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 57s | Tests passed in 
hadoop-yarn-applications-distributedshell. |
| | |  43m 27s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746452/YARN-3950.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 31f1171 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8602/artifact/patchprocess/diffcheckstylehadoop-yarn-applications-distributedshell.txt
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8602/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8602/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8602/console |


This message was automatically generated.

> Add unique SHELL_ID environment variable to DistributedShell
> 
>
> Key: YARN-3950
> URL: https://issues.apache.org/jira/browse/YARN-3950
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-3950.001.patch
>
>
> As discussed in [this 
> comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
>  it would be useful to have a monotonically increasing and independent ID of 
> some kind that is unique per shell in the distributed shell program.
> We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3874) Optimize and synchronize FS Reader and Writer Implementations

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636107#comment-14636107
 ] 

Hadoop QA commented on YARN-3874:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m 21s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   7m 49s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 48s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 13s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m 21s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   3m 58s | The patch appears to introduce 
13 new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests | 112m 28s | Tests passed in 
hadoop-mapreduce-client-jobclient. |
| {color:green}+1{color} | yarn tests |  14m  3s | Tests passed in 
hadoop-yarn-applications-distributedshell. |
| {color:red}-1{color} | yarn tests |  51m 42s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 30s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | | 221m 50s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-resourcemanager |
| FindBugs | module:hadoop-yarn-server-timelineservice |
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 |
|   | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates |
|   | hadoop.yarn.server.resourcemanager.TestApplicationCleanup |
|   | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
|   | hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
|   | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746423/YARN-3874-YARN-2928.02.patch
 |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | YARN-2928 / eb1932d |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-timelineservice.html
 |
| hadoop-mapreduce-client-jobclient test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8601/console |


This message was automatically generated.

> Optimize and synchronize FS Reader and Writer Implementations
> -
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch, 
> YARN-3874-YARN-2928.02.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.

[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-07-21 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636089#comment-14636089
 ] 

Naganarasimha G R commented on YARN-3045:
-

Test case failures and white space issue is not related to this patch. White 
space issue will try to rectify it with other review comments if any and for 
test case failure have raised YARN-3941, So either [~djp] or [~sjlee0] can 
further review this jira

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045-YARN-2928.005.patch, YARN-3045-YARN-2928.006.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

yarn-issues@hadoop.apache.org

2015-07-21 Thread Naganarasimha G R (JIRA)

Naganarasimha G R created YARN-3951:
---

 Summary: Test case failures in TestLogAggregationService, 
TestResourceLocalizationService &TestContainer
 Key: YARN-3951
 URL: https://issues.apache.org/jira/browse/YARN-3951
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


Found some test case failures  in YARN-3045 build which were not related to 
YARN-3045 patch
TestContainer.testKillOnLocalizedWhenContainerNotLaunched
{quote}
java.lang.AssertionError: expected: but 
was:
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.TestContainer.testKillOnLocalizedWhenContainerNotLaunched(TestContainer.java:413)
{quote}

TestResourceLocalizationService.testLocalizationHeartbeat
{quote}
Wanted but not invoked:
eventHandler.handle(

);
-> at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testLocalizationHeartbeat(TestResourceLocalizationService.java:900)
Actually, there were zero interactions with this mock.
{quote}

TestResourceLocalizationService.testPublicResourceAddResourceExceptions
{quote}
java.lang.AssertionError: expected null, but was:<\{ \{ 
file:/local/PRIVATE/ef9783a7514fda92, 2411, FILE, null 
\},pending,\[(container_314159265358979_0003_01_42)\],2661055154305048,DOWNLOADING}>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotNull(Assert.java:664)
at org.junit.Assert.assertNull(Assert.java:646)
at org.junit.Assert.assertNull(Assert.java:656)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testPublicResourceAddResourceExceptions(TestResourceLocalizationService.java:1366)
{quote}

TestLogAggregationService.testLogAggregationCreateDirsFailsWithoutKillingNM
{quote}
org.mortbay.util.MultiException: Multiple exceptions
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.checkEvents(TestLogAggregationService.java:1046)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testLogAggregationCreateDirsFailsWithoutKillingNM(TestLogAggregationService.java:736)
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell

2015-07-21 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-3950:

Attachment: YARN-3950.001.patch

> Add unique SHELL_ID environment variable to DistributedShell
> 
>
> Key: YARN-3950
> URL: https://issues.apache.org/jira/browse/YARN-3950
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-3950.001.patch
>
>
> As discussed in [this 
> comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
>  it would be useful to have a monotonically increasing and independent ID of 
> some kind that is unique per shell in the distributed shell program.
> We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell

2015-07-21 Thread Robert Kanter (JIRA)

Robert Kanter created YARN-3950:
---

 Summary: Add unique SHELL_ID environment variable to 
DistributedShell
 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


As discussed in [this 
comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
 it would be useful to have a monotonically increasing and independent ID of 
some kind that is unique per shell in the distributed shell program.

We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3926) Extend the YARN resource model for easier resource-type management and profiles

2015-07-21 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636068#comment-14636068
 ] 

Varun Vasudev commented on YARN-3926:
-

Thanks for the feedback [~kasha]! I'm fine with changing the config variables 
nomenclature to the one you suggested. I just wanted to clarify that simply 
using the same config file won't avoid the questions you raised(specifically 2 
and 3).  The one way we can avoid (2) and (3) is have versions of the resources 
configs but I think that's a little complex. We could mitigate the issue by 
building some tools to verify that a proposed config file would work with the 
existing RM/NM. I'm open to suggestions.

With regards to node labels, I had initial conversations with [~leftnoteasy] 
but I haven't thought through the model in enough detail. My initial thinking 
is that we would modify the ResourceMapEntry to add a string/list of strings 
which can be used to specify node labels.

> Extend the YARN resource model for easier resource-type management and 
> profiles
> ---
>
> Key: YARN-3926
> URL: https://issues.apache.org/jira/browse/YARN-3926
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: Proposal for modifying resource model and profiles.pdf
>
>
> Currently, there are efforts to add support for various resource-types such 
> as disk(YARN-2139), network(YARN-2140), and  HDFS bandwidth(YARN-2681). These 
> efforts all aim to add support for a new resource type and are fairly 
> involved efforts. In addition, once support is added, it becomes harder for 
> users to specify the resources they need. All existing jobs have to be 
> modified, or have to use the minimum allocation.
> This ticket is a proposal to extend the YARN resource model to a more 
> flexible model which makes it easier to support additional resource-types. It 
> also considers the related aspect of “resource profiles” which allow users to 
> easily specify the various resources they need for any given container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-21 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636058#comment-14636058
 ] 

Li Lu commented on YARN-3908:
-

Hi [~sjlee0], I think the code you posted here belongs to timeline v1 
(o.a.h.yarn.api.records.timeline.*), but the v2 version is in 
o.a.h.yarn.api.records.timelineservice.*. TimelineEvent in v2, modified in 
YARN-3836, does use id for all related tasks. We're no longer using event info 
for equality check in that version. 

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-21 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636057#comment-14636057
 ] 

Sangjin Lee commented on YARN-3908:
---

Sorry my bad. I mistakenly pulled up the v.1 of {{TimelineEvent}}. Our version 
uses only the id and the timestamp for equality:

{code:title=TimelineEvent.java|borderStyle=solid}
  @Override
  public int hashCode() {
int result = (int) (timestamp ^ (timestamp >>> 32));
result = 31 * result + id.hashCode();
return result;
  }

  @Override
  public boolean equals(Object o) {
if (this == o)
  return true;
if (!(o instanceof TimelineEvent))
  return false;

TimelineEvent event = (TimelineEvent) o;

if (timestamp != event.timestamp)
  return false;
if (!id.equals(event.id)) {
  return false;
}
return true;
  }

  @Override
  public int compareTo(TimelineEvent other) {
if (timestamp > other.timestamp) {
  return -1;
} else if (timestamp < other.timestamp) {
  return 1;
} else {
  return id.compareTo(other.id);
}
  }
{code}

So that answers my first question. Sorry for the confusion! Only the second 
question remains...

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3949) ensure timely flush of timeline writes

2015-07-21 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636056#comment-14636056
 ] 

Sangjin Lee commented on YARN-3949:
---

For background, HBase's {{BufferedMutatorImpl}} does flush writes to region 
servers when the write buffer becomes full per configuration. In a high 
throughput situation, normally writes should appear on the storage in a timely 
manner. However, it still remains the case there is no hard guarantee that the 
data will be available on the storage by "X seconds/minutes". This problem will 
be more pronounced if the writer is mostly idle.

Here is one proposal.

First, we would introduce {{flush()}} on the {{TimelineWriter}} interface. 
Users of {{TimelineWriter}} would call {{flush()}} to be able to force flushing 
writes to the backend storage. Implementations of {{TimelineWriter}} would 
implement {{flush()}} as appropriate for the respective storage. In case of 
HBase, it would result in {{BufferedMutator.flush()}} for all tables.

Second, we would implement periodic invocation of {{TimelineWriter.flush()}} in 
the layer that calls {{TimelineWriter}}, where the frequency of flush is 
configurable. For example, in the timeline collector we could have a background 
thread that calls {{TimelineWriter.flush()}} regularly. The {{flush()}} method 
may also be called for critical writes such as lifecycle events. In those 
cases, the timeline collector code could call {{TimelineWriter.write()}} 
followed by {{TimelineWriter.flush()}} before returning to the caller.

Let me know what you think of the proposal. Thanks!

> ensure timely flush of timeline writes
> --
>
> Key: YARN-3949
> URL: https://issues.apache.org/jira/browse/YARN-3949
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>
> Currently flushing of timeline writes is not really handled. For example, 
> {{HBaseTimelineWriterImpl}} relies on HBase's {{BufferedMutator}} to batch 
> and write puts asynchronously. However, {{BufferedMutator}} may not flush 
> them to HBase unless the internal buffer fills up.
> We do need a flush functionality first to ensure that data are written in a 
> reasonably timely manner, and to be able to ensure some critical writes are 
> done synchronously (e.g. key lifecycle events).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-21 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636050#comment-14636050
 ] 

Sangjin Lee commented on YARN-3908:
---

Hmm, then isn't this incorrect?

{code:title=TimelineEvent.java|borderStyle=solid}
  @Override
  public int compareTo(TimelineEvent other) {
if (timestamp > other.timestamp) {
  return -1;
} else if (timestamp < other.timestamp) {
  return 1;
} else {
  return eventType.compareTo(other.eventType);
}
  }

  @Override
  public boolean equals(Object o) {
if (this == o)
  return true;
if (o == null || getClass() != o.getClass())
  return false;

TimelineEvent event = (TimelineEvent) o;

if (timestamp != event.timestamp)
  return false;
if (!eventType.equals(event.eventType))
  return false;
if (eventInfo != null ? !eventInfo.equals(event.eventInfo) :
event.eventInfo != null)
  return false;

return true;
  }

  @Override
  public int hashCode() {
int result = (int) (timestamp ^ (timestamp >>> 32));
result = 31 * result + eventType.hashCode();
result = 31 * result + (eventInfo != null ? eventInfo.hashCode() : 0);
return result;
  }
{code}

First of all, id is not even used. Instead type is used. Also, event info is 
part of the equality semantics.

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-21 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636009#comment-14636009
 ] 

Li Lu commented on YARN-3908:
-

Hi [~sjlee0], I don't think we're still using event type in new TimelineEvent 
v2. However, the behavior you mentioned is quite consistent with the v1 
TimelineEvent. Could you please double check this? Thanks! 

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-21 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635996#comment-14635996
 ] 

Sangjin Lee commented on YARN-3908:
---

[~jrottinghuis], [~vrushalic], and I had offline chats, and we feel that we may 
need to revisit how we store events.

Currently (with this patch) we store the event with the column name 
"e!eventId?infoKey" and the column value being the info value. The event 
timestamp is stored as the cell timestamp. We're realizing that this may not be 
a correct way to store events.

I'm basing this on the 
[discussion|https://issues.apache.org/jira/browse/YARN-3836?focusedCommentId=14619729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14619729]
 we had when we talked about the equality and identity semantics of 
{{TimelineEvent}}. Namely, the id *and* the timestamp form the identity of a 
{{TimelineEvent}}. Then I think storing the timestamp in the HBase cell 
timestamp does not work.

Some questions for you, [~zjshen] and [~gtCarrera9].

(1) *What defines the identity of a {{TimelineEvent}}?*
Is it the event id + timestamp? How about the event type? If you look at the 
{{equals()}} and the {{hashCode()}} implementations of {{TimelineEvent}}, it 
uses the timestamp, the event type, and even the info as a whole, but the id is 
not used for equality. How does that square with the stated intent that the 
event id and the timestamp form the identity?

(2) *What would be the access pattern* for {{TimelineEvents}}?*
Is pretty much the only access pattern "give me all the events that belong to 
this entity"?

Also specifically, would you ever query for an event with the id *and* the 
timestamp? It is not reasonable for readers to be able to provide the event 
timestamp for queries, right?

Would you also query for just the event id? What other access patterns need to 
be supported?

Clarifying those things would help us correctly implement the schema. Thanks!

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing

2015-07-21 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635991#comment-14635991
 ] 

Wangda Tan commented on YARN-1645:
--

+1 to latest patch, thanks [~mding].

> ContainerManager implementation to support container resizing
> -
>
> Key: YARN-1645
> URL: https://issues.apache.org/jira/browse/YARN-1645
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-1645-YARN-1197.3.patch, 
> YARN-1645-YARN-1197.4.patch, YARN-1645-YARN-1197.5.patch, YARN-1645.1.patch, 
> YARN-1645.2.patch, yarn-1645.1.patch
>
>
> Implementation of ContainerManager for container resize, including:
> 1) ContainerManager resize logic 
> 2) Relevant test cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2921) Fix MockRM/MockAM#waitForState sleep too long

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved YARN-2921.

Resolution: Fixed

> Fix MockRM/MockAM#waitForState sleep too long
> -
>
> Key: YARN-2921
> URL: https://issues.apache.org/jira/browse/YARN-2921
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Tsuyoshi Ozawa
> Fix For: 2.8.0
>
> Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
> YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
> YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
> YARN-2921.008.patch
>
>
> MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
> second). This leads to slow tests and sometimes failures if the 
> App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (YARN-2921) Fix MockRM/MockAM#waitForState sleep too long

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer reopened YARN-2921:


> Fix MockRM/MockAM#waitForState sleep too long
> -
>
> Key: YARN-2921
> URL: https://issues.apache.org/jira/browse/YARN-2921
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Tsuyoshi Ozawa
> Fix For: 2.8.0
>
> Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
> YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
> YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
> YARN-2921.008.patch
>
>
> MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
> second). This leads to slow tests and sometimes failures if the 
> App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3641:
---
Assignee: Junping Du  (was: Allen Wittenauer)

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3641:
---
Reporter: Allen Wittenauer  (was: Junping Du)

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3641:
---
Reporter: Junping Du  (was: Allen Wittenauer)

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Allen Wittenauer
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer reassigned YARN-3641:
--

Assignee: Allen Wittenauer  (was: Junping Du)

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Allen Wittenauer
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing

2015-07-21 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635954#comment-14635954
 ] 

Jian He commented on YARN-1645:
---

looks good, +1

> ContainerManager implementation to support container resizing
> -
>
> Key: YARN-1645
> URL: https://issues.apache.org/jira/browse/YARN-1645
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-1645-YARN-1197.3.patch, 
> YARN-1645-YARN-1197.4.patch, YARN-1645-YARN-1197.5.patch, YARN-1645.1.patch, 
> YARN-1645.2.patch, yarn-1645.1.patch
>
>
> Implementation of ContainerManager for container resize, including:
> 1) ContainerManager resize logic 
> 2) Relevant test cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3299) Synchronize RM and Generic History Service Web-UIs

2015-07-21 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong resolved YARN-3299.
-
Resolution: Fixed

> Synchronize RM and Generic History Service Web-UIs
> --
>
> Key: YARN-3299
> URL: https://issues.apache.org/jira/browse/YARN-3299
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager, webapp, yarn
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>
> After YARN-1809, we are using the same protocol to fetch the information and 
> display in their webUI. RM webUI will use ApplicationClientProtocol, and 
> Generic History Service web ui will use ApplicationHistoryProtocol. Both of 
> them extend the same protocol. 
> Also, we have common appblock/attemptblock/containerblock shared by both RM 
> webUI and ATS webUI.
> But we are still missing some information, such as outstanding resource 
> requests, preemption metrics, etc.
> This ticket will be used as parent ticket to track all the remaining issues 
> for RM webUI and ATS webUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3299) Synchronize RM and Generic History Service Web-UIs

2015-07-21 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635940#comment-14635940
 ] 

Xuan Gong commented on YARN-3299:
-

Resolving this umbrella JIRA. And as new requirements/bugs come in, we can open 
new tickets.
* Will leave the open sub-tasks as they are.
* No fix-version as this was done across releases.

> Synchronize RM and Generic History Service Web-UIs
> --
>
> Key: YARN-3299
> URL: https://issues.apache.org/jira/browse/YARN-3299
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager, webapp, yarn
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>
> After YARN-1809, we are using the same protocol to fetch the information and 
> display in their webUI. RM webUI will use ApplicationClientProtocol, and 
> Generic History Service web ui will use ApplicationHistoryProtocol. Both of 
> them extend the same protocol. 
> Also, we have common appblock/attemptblock/containerblock shared by both RM 
> webUI and ATS webUI.
> But we are still missing some information, such as outstanding resource 
> requests, preemption metrics, etc.
> This ticket will be used as parent ticket to track all the remaining issues 
> for RM webUI and ATS webUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635927#comment-14635927
 ] 

Hudson commented on YARN-3878:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8197 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8197/])
YARN-3878. AsyncDispatcher can hang while stopping if it is configured for 
draining events on stop. Contributed by Varun Saxena (jianhe: rev 
393fe71771e3ac6bc0efe59d9aaf19d3576411b3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/TestAsyncDispatcher.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java


> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch, 
> YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, 
> YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, 
> YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Nativ

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-21 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635897#comment-14635897
 ] 

Jian He commented on YARN-3878:
---

thanks ! committing this.

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch, 
> YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, 
> YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, 
> YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
> [0x

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-21 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635892#comment-14635892
 ] 

Anubhav Dhoot commented on YARN-3878:
-

Agree this is ok to ignore

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch, 
> YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, 
> YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, 
> YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait

[jira] [Commented] (YARN-3852) Add docker container support to container-executor

2015-07-21 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635879#comment-14635879
 ] 

Varun Vasudev commented on YARN-3852:
-

Thanks for the latest patch [~ashahab]. Patch looks good to me, just a couple 
of minor changes -
# In container-executor.c and container-executor.h
{code}
-int check_dir(const char* npath, mode_t st_mode, mode_t desired, int 
finalComponent) {
+int check_dir(char* npath, mode_t st_mode, mode_t desired, int finalComponent) 
{
{code}
and
{code}
-int check_dir(const char* npath, mode_t st_mode, mode_t desired,
+int check_dir(char* npath, mode_t st_mode, mode_t desired,
int finalComponent);

-int create_validate_dir(const char* npath, mode_t perm, const char* path,
+int create_validate_dir(char* npath, mode_t perm, char* path,
int finalComponent);
{code}
You've removed the const-ness of npath.
# In container-executor.c
{code}
+int create_script_paths(const char *work_dir,
+  const char *script_name, const char *cred_file,
+ char** script_file_dest, char** cred_file_dest,
+ int* container_file_source, int* cred_file_source ) {
{code}

The rest of the patch looks good to me.

> Add docker container support to container-executor 
> ---
>
> Key: YARN-3852
> URL: https://issues.apache.org/jira/browse/YARN-3852
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Abin Shahab
> Attachments: YARN-3852-1.patch, YARN-3852.patch
>
>
> For security reasons, we need to ensure that access to the docker daemon and 
> the ability to run docker containers is restricted to privileged users ( i.e 
> users running applications should not have direct access to docker). In order 
> to ensure the node manager can run docker commands, we need to add docker 
> support to the container-executor binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2015-07-21 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635876#comment-14635876
 ] 

Junping Du commented on YARN-2019:
--

If so, I think we should at least differentiate RM and NM policies - user could 
be conservative to RM state store failure but be aggressive to NM state store 
failure. May be using "yarn.resourcemanager.fail-fast" here? Then we can use 
"yarn.nodemanager.fail-fast" later and may for other daemons (timeline service, 
etc.).

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Jian He
>Priority: Critical
>  Labels: ha
> Attachments: YARN-2019.1-wip.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3852) Add docker container support to container-executor

2015-07-21 Thread Abin Shahab (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635872#comment-14635872
 ] 

Abin Shahab commented on YARN-3852:
---

The test failures are unrelated to the container-executor.c changes.

> Add docker container support to container-executor 
> ---
>
> Key: YARN-3852
> URL: https://issues.apache.org/jira/browse/YARN-3852
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Abin Shahab
> Attachments: YARN-3852-1.patch, YARN-3852.patch
>
>
> For security reasons, we need to ensure that access to the docker daemon and 
> the ability to run docker containers is restricted to privileged users ( i.e 
> users running applications should not have direct access to docker). In order 
> to ensure the node manager can run docker commands, we need to add docker 
> support to the container-executor binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3874) Optimize and synchronize FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3874:
---
Summary: Optimize and synchronize FS Reader and Writer Implementations  
(was: Combine FS Reader and Writer Implementations)

> Optimize and synchronize FS Reader and Writer Implementations
> -
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch, 
> YARN-3874-YARN-2928.02.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3874) Optimize and synchronize FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3874:
---
Attachment: YARN-3874-YARN-2928.02.patch

> Optimize and synchronize FS Reader and Writer Implementations
> -
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch, 
> YARN-3874-YARN-2928.02.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-07-21 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635761#comment-14635761
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

The test result is as follows:
{quote}
-1 overall.  

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated 48 warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version ) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.
{quote}
A javadoc warning looks not related to the patch.

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.003.patch, 
> YARN-3798-branch-2.7.004.patch, YARN-3798-branch-2.7.005.patch, 
> YARN-3798-branch-2.7.006.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop

[jira] [Created] (YARN-3949) ensure timely flush of timeline writes

2015-07-21 Thread Sangjin Lee (JIRA)

Sangjin Lee created YARN-3949:
-

 Summary: ensure timely flush of timeline writes
 Key: YARN-3949
 URL: https://issues.apache.org/jira/browse/YARN-3949
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee


Currently flushing of timeline writes is not really handled. For example, 
{{HBaseTimelineWriterImpl}} relies on HBase's {{BufferedMutator}} to batch and 
write puts asynchronously. However, {{BufferedMutator}} may not flush them to 
HBase unless the internal buffer fills up.

We do need a flush functionality first to ensure that data are written in a 
reasonably timely manner, and to be able to ensure some critical writes are 
done synchronously (e.g. key lifecycle events).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-21 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635754#comment-14635754
 ] 

Sangjin Lee commented on YARN-3908:
---

I filed YARN-3949 to address the need for timely flush of writes.

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-21 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635727#comment-14635727
 ] 

Bibin A Chundatt commented on YARN-3932:


[~leftnoteasy] checkstyle issue is already existing one and test failure are 
unrelated  to this patch.

> SchedulerApplicationAttempt#getResourceUsageReport should be based on 
> NodeLabel
> ---
>
> Key: YARN-3932
> URL: https://issues.apache.org/jira/browse/YARN-3932
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
> 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
> 0006-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg
>
>
> Application Resource Report shown wrong when node Label is used.
> 1.Submit application with NodeLabel
> 2.Check RM UI for resources used 
> Allocated CPU VCores and Allocated Memory MB is always {{zero}}
> {code}
>  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
> AggregateAppResourceUsage runningResourceUsage =
> getRunningAggregateAppResourceUsage();
> Resource usedResourceClone =
> Resources.clone(attemptResourceUsage.getUsed());
> Resource reservedResourceClone =
> Resources.clone(attemptResourceUsage.getReserved());
> return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
> reservedContainers.size(), usedResourceClone, reservedResourceClone,
> Resources.add(usedResourceClone, reservedResourceClone),
> runningResourceUsage.getMemorySeconds(),
> runningResourceUsage.getVcoreSeconds());
>   }
> {code}
> should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3460) Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635721#comment-14635721
 ] 

Hadoop QA commented on YARN-3460:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 54s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   8m 30s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 25s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 24s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 24s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 49s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m  0s | Tests passed in 
hadoop-yarn-registry. |
| | |  40m 25s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12731250/YARN-3460-3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 5137b38 |
| hadoop-yarn-registry test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8600/artifact/patchprocess/testrun_hadoop-yarn-registry.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8600/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8600/console |


This message was automatically generated.

> Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM
> 
>
> Key: YARN-3460
> URL: https://issues.apache.org/jira/browse/YARN-3460
> Project: Hadoop YARN
>  Issue Type: Test
>Affects Versions: 3.0.0, 2.6.0
> Environment: $ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T11:37:52-06:00)
> Maven home: /opt/apache-maven-3.2.1
> Java version: 1.7.0, vendor: IBM Corporation
> Java home: /usr/lib/jvm/ibm-java-ppc64le-71/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.10.0-229.ael7b.ppc64le", arch: "ppc64le", 
> family: "unix"
>Reporter: pascal oliva
> Attachments: HADOOP-11810-1.patch, YARN-3460-1.patch, 
> YARN-3460-2.patch, YARN-3460-3.patch
>
>
> TestSecureRMRegistryOperations failed with JBM IBM JAVA
> mvn test -X 
> -Dtest=org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations
> ModuleTotal Failure Error Skipped
> -
> hadoop-yarn-registry 12  0   12 0
> -
>  Total  12  0   12 0
> With 
> javax.security.auth.login.LoginException: Bad JAAS configuration: 
> unrecognized option: isInitiator
> and 
> Bad JAAS configuration: unrecognized option: storeKey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635701#comment-14635701
 ] 

Hadoop QA commented on YARN-3932:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  23m 16s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   9m 52s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 54s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 55s | The applied patch generated  1 
new checkstyle issues (total was 186, now 186). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 22s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  50m 21s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 100m  3s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter |
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746389/0006-YARN-3932.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cf74772 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8599/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8599/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8599/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8599/console |


This message was automatically generated.

> SchedulerApplicationAttempt#getResourceUsageReport should be based on 
> NodeLabel
> ---
>
> Key: YARN-3932
> URL: https://issues.apache.org/jira/browse/YARN-3932
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
> 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
> 0006-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg
>
>
> Application Resource Report shown wrong when node Label is used.
> 1.Submit application with NodeLabel
> 2.Check RM UI for resources used 
> Allocated CPU VCores and Allocated Memory MB is always {{zero}}
> {code}
>  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
> AggregateAppResourceUsage runningResourceUsage =
> getRunningAggregateAppResourceUsage();
> Resource usedResourceClone =
> Resources.clone(attemptResourceUsage.getUsed());
> Resource reservedResourceClone =
> Resources.clone(attemptResourceUsage.getReserved());
> return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
> reservedContainers.size(), usedResourceClone, reservedResourceClone,
> Resources.add(usedResourceClone, reservedResourceClone),
> runningResourceUsage.getMemorySeconds(),
> runningResourceUsage.getVcoreSeconds());
>   }
> {code}
> should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3460) Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM

2015-07-21 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635624#comment-14635624
 ] 

Allen Wittenauer commented on YARN-3460:


{code}
+Method methodInitialize =
+kerb5LoginObject.getClass().getMethod("initialize", Subject.class, 
CallbackHandler.class, Map.class, Map.class);
+methodInitialize.invoke(kerb5LoginObject, subject, null, new 
HashMap(), options );
{code}

There's a tab here.  Would you mind removing it?

If no one else has any other comments, I'll be committing this by the end of 
the day. 

P.S.: downloading the IBM JDK is an exercise in frustration.

> Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM
> 
>
> Key: YARN-3460
> URL: https://issues.apache.org/jira/browse/YARN-3460
> Project: Hadoop YARN
>  Issue Type: Test
>Affects Versions: 3.0.0, 2.6.0
> Environment: $ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T11:37:52-06:00)
> Maven home: /opt/apache-maven-3.2.1
> Java version: 1.7.0, vendor: IBM Corporation
> Java home: /usr/lib/jvm/ibm-java-ppc64le-71/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.10.0-229.ael7b.ppc64le", arch: "ppc64le", 
> family: "unix"
>Reporter: pascal oliva
> Attachments: HADOOP-11810-1.patch, YARN-3460-1.patch, 
> YARN-3460-2.patch, YARN-3460-3.patch
>
>
> TestSecureRMRegistryOperations failed with JBM IBM JAVA
> mvn test -X 
> -Dtest=org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations
> ModuleTotal Failure Error Skipped
> -
> hadoop-yarn-registry 12  0   12 0
> -
>  Total  12  0   12 0
> With 
> javax.security.auth.login.LoginException: Bad JAAS configuration: 
> unrecognized option: isInitiator
> and 
> Bad JAAS configuration: unrecognized option: storeKey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635590#comment-14635590
 ] 

Hadoop QA commented on YARN-433:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  1s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 40s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 47s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 19s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  51m 51s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m 37s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746382/YARN-433.4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3b7ffc4 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8597/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8597/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8597/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8597/console |


This message was automatically generated.

> When RM is catching up with node updates then it should not expire acquired 
> containers
> --
>
> Key: YARN-433
> URL: https://issues.apache.org/jira/browse/YARN-433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, 
> YARN-433.4.patch
>
>
> RM expires containers that are not launched within some time of being 
> allocated. The default is 10mins. When an RM is not keeping up with node 
> updates then it may not be aware of new launched containers. If the expire 
> thread fires for such containers then the RM can expire them even though they 
> may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3925) ContainerLogsUtils#getContainerLogFile fails to read container log files from full disks.

2015-07-21 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635546#comment-14635546
 ] 

Jason Lowe commented on YARN-3925:
--

Nice catch, [~zxu]!

I think the patch will work, but it is a bit convoluted to use a new dir 
allocator and a new config property as a roundabout way to call 
LocalDirAllocator.AllocatorPerContext.getLocalPathToRead which is not a very 
complicated function.  LocalDirsHandlerService already has the list of Paths to 
check, so if we just refactored the getLocalPathToRead functionality into a 
reusable Path utility function to locate a subpath given a list of top-level 
paths to search it would be very straightforward.

> ContainerLogsUtils#getContainerLogFile fails to read container log files from 
> full disks.
> -
>
> Key: YARN-3925
> URL: https://issues.apache.org/jira/browse/YARN-3925
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3925.000.patch
>
>
> ContainerLogsUtils#getContainerLogFile fails to read files from full disks.
> {{getContainerLogFile}} depends on 
> {{LocalDirsHandlerService#getLogPathToRead}} to get the log file, but 
> {{LocalDirsHandlerService#getLogPathToRead}} calls 
> {{logDirsAllocator.getLocalPathToRead}} and {{logDirsAllocator}} uses 
> configuration {{YarnConfiguration.NM_LOG_DIRS}}, which will be updated to not 
> include full disks in {{LocalDirsHandlerService#checkDirs}}:
> {code}
> Configuration conf = getConfig();
> List localDirs = getLocalDirs();
> conf.setStrings(YarnConfiguration.NM_LOCAL_DIRS,
> localDirs.toArray(new String[localDirs.size()]));
> List logDirs = getLogDirs();
> conf.setStrings(YarnConfiguration.NM_LOG_DIRS,
>   logDirs.toArray(new String[logDirs.size()]));
> {code}
> ContainerLogsUtils#getContainerLogFile is used by NMWebServices#getLogs and 
> ContainerLogsPage.ContainersLogsBlock#render to read the log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3838) Rest API failing when ip configured in RM address in secure https mode

2015-07-21 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635534#comment-14635534
 ] 

Bibin A Chundatt commented on YARN-3838:


Hi [~xgong] any comments on this 

> Rest API failing when ip configured in RM address in secure https mode
> --
>
> Key: YARN-3838
> URL: https://issues.apache.org/jira/browse/YARN-3838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-HADOOP-12096.patch, 0001-YARN-3810.patch, 
> 0001-YARN-3838.patch, 0002-YARN-3810.patch, 0002-YARN-3838.patch
>
>
> Steps to reproduce
> ===
> 1.Configure hadoop.http.authentication.kerberos.principal as below
> {code:xml}
>   
> hadoop.http.authentication.kerberos.principal
> HTTP/_h...@hadoop.com
>   
> {code}
> 2. In RM web address also configure IP 
> 3. Startup RM 
> Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
> /ws/v1/cluster/info"}}
> *Actual*
> Rest API  failing
> {code}
> 2015-06-16 19:03:49,845 DEBUG 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
> Authentication exception: GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos credentails)
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos credentails)
>   at 
> org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
>   at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
>   at 
> org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins

2015-07-21 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635531#comment-14635531
 ] 

Robert Kanter commented on YARN-3528:
-

[~brahmareddy], are you still working on this?

> Tests with 12345 as hard-coded port break jenkins
> -
>
> Key: YARN-3528
> URL: https://issues.apache.org/jira/browse/YARN-3528
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
>Priority: Blocker
>  Labels: test
> Attachments: YARN-3528-002.patch, YARN-3528.patch
>
>
> A lot of the YARN tests have hard-coded the port 12345 for their services to 
> come up on.
> This makes it impossible to have scheduled or precommit tests to run 
> consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
> appear to get ignored completely.
> A quick grep of "12345" shows up many places in the test suite where this 
> practise has developed.
> * All {{BaseContainerManagerTest}} subclasses
> * {{TestNodeManagerShutdown}}
> * {{TestContainerManager}}
> + others
> This needs to be addressed through portscanning and dynamic port allocation. 
> Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635512#comment-14635512
 ] 

Varun Saxena commented on YARN-3874:


Many are not related as well.

> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics

2015-07-21 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635513#comment-14635513
 ] 

Junping Du commented on YARN-3816:
--

Thanks [~sjlee0] for review and comments!
bq. If I understand correctly, this patch basically does a time integral of a 
given metric, or "the area under the curve" for the metric as a function of 
time. For example, if the underlying metric is a container CPU usage, the 
"aggregated" metric according to TimelineMetric.aggregateTo() would be a 
cumulative CPU usage over time for that container (in the units of CPU-millis).
That's correct. As a poc patch for app aggregation, we only pick up some metric 
to aggregated in some way to demonstrate overall end-to-end flow. I understand 
there could be more important aggregated metrics there and I will try to add 
more in following patches.

bq. While this is certainly a useful number to keep track of, this was not the 
app-level aggregation I had in mind. IMO, the app-level aggregation (or any 
aggregation for that matter) is all about rolling metrics up from child 
entities to the parent entity. I would have thought that it would be the first 
thing we want to get to. It looks, however, as though that aggregation is not 
done in this patch. I don't see any code that rolls up values from containers 
to the application. Are you planning to introduce that soon?
Yes. I should add that part in poc v2 patch for taking a "snapshot" for 
resource consumption on an application. Previous area value is also kept for 
different purpose (resource billing/charge, etc.).

bq. This type of time integral works only if the underlying metric is a gauge. 
For example, for any counter-like metric (e.g. HDFS bytes read) which is 
cumulative in nature, the time integral does not make sense. We will need to 
introduce another type dimension to the metrics that signifies whether it is a 
counter or a gauge, but this is just to note that the time integral works only 
for gauges.
I agree that we should differentiate counter with gauge. For the previous one, 
we are more focus on its cumulative property while the later one is more focus 
on "snapshot". However, in practice, there are cases that some aggregated 
metrics has both properties, like "area" value here - we do need its cumulative 
values and also could be interested in getting values within a given time 
interval. Isn't it?

bq. Also, this is pretty similar to what we talked about during the offline 
meeting as "average/max" for gauges, except that it's not divided over time. We 
discussed that we want to introduce time averages and maxes for gauges (see 
"time average & max" in 
https://issues.apache.org/jira/secure/attachment/12743390/aggregation-design-discussion.pdf).
 Are we thinking of replacing that with this?
No. Nothing get changed on the design since our last discussions. The average 
and max is also important but I just haven't get bandwidth to add in poc stage 
as adding existing things could be more straight-forward. I will add it later.

bq. In the specific case of container CPU usage, it seems to me that emitting 
the actual CPU time millis directly would be a far easier and more accurate way 
to capture this info. I believe it's readily available, and it would be a 
counter-like metric instead of a gauge. Therefore the time integral doesn't 
apply (as it already is one). But all you need to do at the app-level 
aggregation for it is just to sum it up. I recognize that this time integral 
would be useful for other things, but just wanted to point that out.
Thanks for pointing that out. I agree this is more precisely and will update 
this in following patch.

> [Aggregation] App-level Aggregation for YARN system metrics
> ---
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, config

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-07-21 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635499#comment-14635499
 ] 

zhihai xu commented on YARN-3591:
-

+1 for [~jlowe]'s comment. Yes, It fixes some problems we have today without 
creating new ones.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-21 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0006-YARN-3932.patch

Hi [~leftnoteasy] attaching patch again to retrigger CI and for review

> SchedulerApplicationAttempt#getResourceUsageReport should be based on 
> NodeLabel
> ---
>
> Key: YARN-3932
> URL: https://issues.apache.org/jira/browse/YARN-3932
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
> 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
> 0006-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg
>
>
> Application Resource Report shown wrong when node Label is used.
> 1.Submit application with NodeLabel
> 2.Check RM UI for resources used 
> Allocated CPU VCores and Allocated Memory MB is always {{zero}}
> {code}
>  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
> AggregateAppResourceUsage runningResourceUsage =
> getRunningAggregateAppResourceUsage();
> Resource usedResourceClone =
> Resources.clone(attemptResourceUsage.getUsed());
> Resource reservedResourceClone =
> Resources.clone(attemptResourceUsage.getReserved());
> return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
> reservedContainers.size(), usedResourceClone, reservedResourceClone,
> Resources.add(usedResourceClone, reservedResourceClone),
> runningResourceUsage.getMemorySeconds(),
> runningResourceUsage.getVcoreSeconds());
>   }
> {code}
> should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635483#comment-14635483
 ] 

Hadoop QA commented on YARN-3798:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746388/YARN-3798-branch-2.7.006.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3b7ffc4 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8598/console |


This message was automatically generated.

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.003.patch, 
> YARN-3798-branch-2.7.004.patch, YARN-3798-branch-2.7.005.patch, 
> YARN-3798-branch-2.7.006.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 10:09:44,887 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> o

[jira] [Commented] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635482#comment-14635482
 ] 

Varun Saxena commented on YARN-3874:


Test failures are related. Will fix


> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-07-21 Thread Tsuyoshi Ozawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3798:
-
Attachment: YARN-3798-branch-2.7.006.patch

[~zxu] thank you for the comment. Attaching a patch to address your comment. 

1. Using rc == Code.OK.intValue() instead of rc == 0.
2. Calling Thread.currentThread().interrupt(); to restore the interrupted 
status after catching InterruptedException from syncInternal.

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.003.patch, 
> YARN-3798-branch-2.7.004.patch, YARN-3798-branch-2.7.005.patch, 
> YARN-3798-branch-2.7.006.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 10:09:44,887 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> out ZK retries. Giving up!
> 2015-06-09 10:09:44,887 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
> updating appAttempt: appattempt_1433764310492_7152_01
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zo

[jira] [Commented] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635473#comment-14635473
 ] 

Hadoop QA commented on YARN-3874:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  18m 36s | Pre-patch YARN-2928 has 7 
extant Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  1s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   8m  0s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 55s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 53s | The applied patch generated  9 
new checkstyle issues (total was 6, now 13). |
| {color:green}+1{color} | whitespace |   0m 18s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   3m 55s | The patch appears to introduce 6 
new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | mapreduce tests | 109m 58s | Tests failed in 
hadoop-mapreduce-client-jobclient. |
| {color:red}-1{color} | yarn tests |   6m 54s | Tests failed in 
hadoop-yarn-applications-distributedshell. |
| {color:red}-1{color} | yarn tests |  13m 37s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:red}-1{color} | yarn tests |   1m 23s | Tests failed in 
hadoop-yarn-server-timelineservice. |
| | | 177m  7s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-timelineservice |
| Failed unit tests | hadoop.mapred.TestMRTimelineEventHandling |
|   | hadoop.yarn.applications.distributedshell.TestDistributedShell |
|   | hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens |
|   | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl |
|   | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens |
|   | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | 
hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 |
|   | hadoop.yarn.server.resourcemanager.TestApplicationCleanup |
|   | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions |
|   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched |
|   | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCResponseId |
|   | hadoop.yarn.server.resourcemanager.TestRMHA |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesHttpStaticUserPermissions
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs |
|   | hadoop.yarn.server.resourcemanager.rmapp.attempt.TestAMLivelinessMonitor |
|   | hadoop.yarn.server.resourcemanager.TestApplicationMasterService |
|   | hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels |
|   | hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
|   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue |
|   | hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel
 |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher |
|   | hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
|   | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppAttempt |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
 |
|   | hadoop.yarn.server.resourcemanager.TestRMProxyUsersConf |
|   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior
 |
|   | hadoop.yarn.server.resourcemanager.TestApplicationACLs |
|   | hadoop.yarn.server.resourcemanager.scheduler.TestAbstra

[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler

2015-07-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635427#comment-14635427
 ] 

Hudson commented on YARN-2003:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8193 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8193/])
YARN-2003. Support for Application priority : Changes in RM and Capacity 
Scheduler. (Sunil G via wangda) (wangda: rev 
c39ca541f498712133890961598bbff50d89d68b)
* 
hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplication.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/Queue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAddedSchedulerEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoComparator.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationPriority.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/SchedulableEntity.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hado

[jira] [Commented] (YARN-3915) scmadmin help message correction

2015-07-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635429#comment-14635429
 ] 

Hudson commented on YARN-3915:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8193 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8193/])
YARN-3915. scmadmin help message correction  (Bibin A Chundatt via aw) (aw: rev 
da2d1ac4bc0bf0812b9a2a1ffbb7748113cdaf6d)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/SCMAdmin.java


> scmadmin help message correction 
> -
>
> Key: YARN-3915
> URL: https://issues.apache.org/jira/browse/YARN-3915
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: 0001-YARN-3915.patch
>
>
> Help message for scmadmin
> *Actual*  {{hadoop scmadmin}} *expected*  {{yarn scmadmin}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits

2015-07-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635428#comment-14635428
 ] 

Hudson commented on YARN-3261:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8193 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8193/])
YARN-3261. rewrite resourcemanager restart doc to remove roadmap bits (Gururaj 
Shetty via aw) (aw: rev 3b7ffc4f3f0ffb0fa6c324da6d88803f5b233832)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRestart.md
* hadoop-yarn-project/CHANGES.txt


> rewrite resourcemanager restart doc to remove roadmap bits 
> ---
>
> Key: YARN-3261
> URL: https://issues.apache.org/jira/browse/YARN-3261
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Gururaj Shetty
> Fix For: 3.0.0
>
> Attachments: YARN-3261.01.patch
>
>
> Another mixture of roadmap and instruction manual that seems to be ever 
> present in a lot of the recently written documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3948) Display Application Priority in RM Web UI

2015-07-21 Thread Sunil G (JIRA)

Sunil G created YARN-3948:
-

 Summary: Display Application Priority in RM Web UI
 Key: YARN-3948
 URL: https://issues.apache.org/jira/browse/YARN-3948
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: webapp
Affects Versions: 2.7.1
Reporter: Sunil G
Assignee: Sunil G


Application Priority can be displayed in RM Web UI Application page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3941) Proportional Preemption policy should try to avoid sending duplicate PREEMPT_CONTAINER event to scheduler

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635421#comment-14635421
 ] 

Hadoop QA commented on YARN-3941:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 56s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 33s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 48s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 21s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  51m 29s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m 14s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746368/0001-YARN-3941.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 29cf887b |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8596/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8596/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8596/console |


This message was automatically generated.

> Proportional Preemption policy should try to avoid sending duplicate 
> PREEMPT_CONTAINER event to scheduler
> -
>
> Key: YARN-3941
> URL: https://issues.apache.org/jira/browse/YARN-3941
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3941.patch
>
>
> Currently ProportionalCPP tries to send multiple PREEMPT_CONTAINER events to 
> scheduler during every cycle of preemption check till the container is either 
> forcefully killed or preempted by AM. 
> This can be throttled from ProportionalPreemptionPolicy to avoid excess of 
> events to scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3944) Connection refused to nodemanagers are retried at multiple levels

2015-07-21 Thread Siqi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-3944:
--
Priority: Critical  (was: Blocker)

> Connection refused to nodemanagers are retried at multiple levels
> -
>
> Key: YARN-3944
> URL: https://issues.apache.org/jira/browse/YARN-3944
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Siqi Li
>Assignee: Siqi Li
>Priority: Critical
> Attachments: YARN-3944.v1.patch
>
>
> This is related to YARN-3238. When NM is down, ipc client will get 
> ConnectException.
> Caused by: java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
> However, retry happens at two layers(ipc retry 40 times and serverProxy 
> retrying 91 times), this could end up with ~1 hour retry interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler

2015-07-21 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635417#comment-14635417
 ] 

Sunil G commented on YARN-2003:
---

Thank you very much [~jianhe] and [~leftnoteasy] for support.

> Support for Application priority : Changes in RM and Capacity Scheduler
> ---
>
> Key: YARN-2003
> URL: https://issues.apache.org/jira/browse/YARN-2003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 
> 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 
> 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 
> 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 
> 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 
> 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 
> 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 
> 0021-YARN-2003.patch, 0022-YARN-2003.patch, 0023-YARN-2003.patch, 
> 0024-YARN-2003.patch
>
>
> AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
> Submission Context and store.
> Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-3261:
---
Issue Type: Improvement  (was: Bug)

> rewrite resourcemanager restart doc to remove roadmap bits 
> ---
>
> Key: YARN-3261
> URL: https://issues.apache.org/jira/browse/YARN-3261
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Gururaj Shetty
> Attachments: YARN-3261.01.patch
>
>
> Another mixture of roadmap and instruction manual that seems to be ever 
> present in a lot of the recently written documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-21 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-433:
---
Attachment: YARN-433.4.patch

[~zxu] Thanks for the comments.
Upload a new patch to address all the latest comments

> When RM is catching up with node updates then it should not expire acquired 
> containers
> --
>
> Key: YARN-433
> URL: https://issues.apache.org/jira/browse/YARN-433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, 
> YARN-433.4.patch
>
>
> RM expires containers that are not launched within some time of being 
> allocated. The default is 10mins. When an RM is not keeping up with node 
> updates then it may not be aware of new launched containers. If the expire 
> thread fires for such containers then the RM can expire them even though they 
> may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635380#comment-14635380
 ] 

Allen Wittenauer commented on YARN-3641:


I can't see how to change this from 'Pending Closed' to 'Fixed'. :(

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer reassigned YARN-3641:
--

Assignee: Junping Du  (was: Allen Wittenauer)

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-07-21 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer reassigned YARN-3641:
--

Assignee: Allen Wittenauer  (was: Junping Du)

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
> in stopping NM's sub-services.
> ---
>
> Key: YARN-3641
> URL: https://issues.apache.org/jira/browse/YARN-3641
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, rolling upgrade
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Allen Wittenauer
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM 
> restart with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
> /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
> temporarily unavailable
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
> Resource temporarily unavailable
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
> (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NodeManager at 
> c6403.ambari.apache.org/192.168.64.103
> /
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
> if (isStopping.getAndSet(true)) {
>   return;
> }
> super.serviceStop();
> stopRecoveryStore();
> DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, 
> LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
> services get stopped with exception could cause stopRecoveryStore() get 
> skipped which means levelDB store is not get closed. So next time NM start, 
> it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3903) Disable preemption at Queue level for Fair Scheduler

2015-07-21 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635332#comment-14635332
 ] 

Karthik Kambatla commented on YARN-3903:


I would like to understand the request better. Is the request to avoid a 
specific queue preempt resources from other queues? Or, is it to avoid 
preempting resources from a specific queue? 


> Disable preemption at Queue level for Fair Scheduler
> 
>
> Key: YARN-3903
> URL: https://issues.apache.org/jira/browse/YARN-3903
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0
> Environment: 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt2-1~bpo70+1 
> (2014-12-08) x86_64
>Reporter: He Tianyi
>Assignee: Karthik Kambatla
>Priority: Trivial
> Attachments: YARN-3093.1.patch, YARN-3093.2.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> YARN-2056 supports disabling preemption at queue level for CapacityScheduler.
> As for fair scheduler, we recently encountered the same need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3903) Disable preemption at Queue level for Fair Scheduler

2015-07-21 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3903:
---
Fix Version/s: (was: 3.0.0)

> Disable preemption at Queue level for Fair Scheduler
> 
>
> Key: YARN-3903
> URL: https://issues.apache.org/jira/browse/YARN-3903
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0
> Environment: 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt2-1~bpo70+1 
> (2014-12-08) x86_64
>Reporter: He Tianyi
>Priority: Trivial
> Attachments: YARN-3093.1.patch, YARN-3093.2.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> YARN-2056 supports disabling preemption at queue level for CapacityScheduler.
> As for fair scheduler, we recently encountered the same need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3903) Disable preemption at Queue level for Fair Scheduler

2015-07-21 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3903:
---
Assignee: (was: Karthik Kambatla)

> Disable preemption at Queue level for Fair Scheduler
> 
>
> Key: YARN-3903
> URL: https://issues.apache.org/jira/browse/YARN-3903
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0
> Environment: 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt2-1~bpo70+1 
> (2014-12-08) x86_64
>Reporter: He Tianyi
>Priority: Trivial
> Attachments: YARN-3093.1.patch, YARN-3093.2.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> YARN-2056 supports disabling preemption at queue level for CapacityScheduler.
> As for fair scheduler, we recently encountered the same need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-07-21 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635326#comment-14635326
 ] 

Karthik Kambatla commented on YARN-3943:


Marking it critical, since some users see very poor performance when the disk 
utilization hovers around the max-disk-utilization mark. 

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-07-21 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3943:
---
Priority: Critical  (was: Major)

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3947) Add support for short host names in yarn decommissioning process

2015-07-21 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635306#comment-14635306
 ] 

Sunil G commented on YARN-3947:
---

Hi [~aanand001c]

I have a doubt here.
If I configure {{yarn.nodemanager.hostname}} to a short_name, and assume my 
short name has a mapping in /etc/hosts or other DNS conf to its FQDN. Then I 
think I can use like {{short_name:port}}.
Here port will be mandatory as we can run multiple node managers in same node. 
Thoughts?

> Add support for short host names in yarn decommissioning process
> 
>
> Key: YARN-3947
> URL: https://issues.apache.org/jira/browse/YARN-3947
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Amit Anand
>Priority: Minor
>
> When running {{yarn decommissioning}} the {{yarn rmadmin -refreshNodes}} 
> doesn't like short host names for the nodes to be decommissioned in 
> {{yarn.exclude}} file. It requires {{FQDN}} for the host name to be present 
> to be able to successfully decommission a node. The decommissioning behavior 
> in {{HDFS}} is different as it can take short host names. 
> Below are the details of what I am seeing:
> My {{yarn.exlcude}} has short name for the host name:
> {code}
> bcpc-vm1
> {code}
> Running:
> {code}
> sudo -u yarn yarn rmadmin -refreshNodes
> {code}
> shows following entries in the log file:
> {code}
> 2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
> resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
> 2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
> the includes file to 
> 2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
> the excludes file to /etc/hadoop/conf/yarn.exclude
> 2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: 
> Refreshing hosts (include/exclude) list
> 2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
> bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
> 2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
> bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
> 2015-07-21 11:14:18,803 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
> {code}
> And the node is not decommissioned. 
> When I add the {{FQDN}} for the host name the decommissioning works 
> successfully and I see following in the RM logs:
> {code}
> 2015-07-21 11:14:43,453 INFO org.apache.hadoop.conf.Configuration: found 
> resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
> 2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
> the includes file to 
> 2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
> the excludes file to /etc/hadoop/conf/yarn.exclude
> 2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: 
> Refreshing hosts (include/exclude) list
> 2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Adding 
> bcpc-vm1.example.com to the list of excluded hosts from 
> /etc/hadoop/conf/yarn.exclude
> 2015-07-21 11:14:43,456 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=10.100.0.11 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
> 2015-07-21 11:14:44,198 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: bcpc-vm1.example.com:35197 hostname: 
> bcpc-vm1.example.com:35197
> 2015-07-21 11:14:44,198 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node bcpc-vm1.example.com:35197 as it is now DECOMMISSIONED
> 2015-07-21 11:14:44,199 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> bcpc-vm1.example.com:35197 Node Transitioned from RUNNING to DECOMMISSIONED
> 2015-07-21 11:14:44,199 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node bcpc-vm1.example.com:35197 cluster capacity:  vCores:96>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2015-07-21 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635301#comment-14635301
 ] 

Karthik Kambatla commented on YARN-2019:


We could have two fail-fast configs - one for daemon and one for app/container. 
If we could do with general fail-fast configs, we should try and avoid adding 
component-specific configs; otherwise, we ll end up making configuring Yarn 
even harder. 

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Jian He
>Priority: Critical
>  Labels: ha
> Attachments: YARN-2019.1-wip.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-07-21 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635292#comment-14635292
 ] 

Jason Lowe commented on YARN-3591:
--

Sorry for the delay, as I was on vacation and am still working through the 
backlog.  An incremental improvement where we try to avoid using 
bad/non-existent resources for future containers but still fail to cleanup old 
resources on bad disks sounds fine to me.  IIUC it fixes some problems we have 
today without creating new ones.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3947) Add support for short host names in yarn decommissioning process

2015-07-21 Thread Amit Anand (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Anand updated YARN-3947:
-
Description: 
When running {{yarn decommissioning}} the {{yarn rmadmin -refreshNodes}} 
doesn't like short host names for the nodes to be decommissioned in 
{{yarn.exclude}} file. It requires {{FQDN}} for the host name to be present to 
be able to successfully decommission a node. The decommissioning behavior in 
{{HDFS}} is different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
{code}
bcpc-vm1
{code}

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
{code}

And the node is not decommissioned. 

When I add the {{FQDN}} for the host name the decommissioning works 
successfully and I see following in the RM logs:

{code}
2015-07-21 11:14:43,453 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1.example.com to the list of excluded hosts from 
/etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.100.0.11 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: bcpc-vm1.example.com:35197 hostname: 
bcpc-vm1.example.com:35197
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node bcpc-vm1.example.com:35197 as it is now DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
bcpc-vm1.example.com:35197 Node Transitioned from RUNNING to DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Removed node bcpc-vm1.example.com:35197 cluster capacity: 
{code}



  was:
When running {{yarn decommissioning}} the {{yarn rmadmin -refreshNodes}} 
doesn't like short host names for the nodes to be decommissioned in 
{{yarn.exclude}} file. It requires {{FQDN}} for the host name to be present to 
be able to successfully decommission a node. The decommissioning behavior in 
{{HDFS}} is different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
{code}
bcpc-vm1
{code}

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=Admi

[jira] [Updated] (YARN-3947) Add support for short host names in yarn decommissioning process

2015-07-21 Thread Amit Anand (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Anand updated YARN-3947:
-
Description: 
When running {{yarn decommissioning}} the {{yarn rmadmin -refreshNodes}} 
doesn't like short host names for the nodes to be decommissioned in 
{{yarn.exclude}} file. It requires {{FQDN}} for the host name to be present to 
be able to successfully decommission a node. The decommissioning behavior in 
{{HDFS}} is different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
bcpc-vm1

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
{code}

And the node is not decommissioned. 

When I add the {{FQDN}} for the host name the decommissioning works 
successfully and I see following in the RM logs:

{code}
2015-07-21 11:14:43,453 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf.LAB-A/yarn-site.xml
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1.example.com to the list of excluded hosts from 
/etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.100.0.11 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: bcpc-vm1.example.com:35197 hostname: 
bcpc-vm1.example.com:35197
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node bcpc-vm1.example.com:35197 as it is now DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
bcpc-vm1.example.com:35197 Node Transitioned from RUNNING to DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Removed node bcpc-vm1.example.com:35197 cluster capacity: 
{code}



  was:
When running {yarn decommissioning} the {{yarn rmadmin -refreshNodes}} doesn't 
like short host names for the nodes to be decommissioned in {yarn.exclude} 
file. It requires {{FQDN}} for the host name to be present to be able to 
successfully decommission a node. The decommissioning behavior in {{HDFS}} is 
different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
bcpc-vm1

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCES

[jira] [Updated] (YARN-3947) Add support for short host names in yarn decommissioning process

2015-07-21 Thread Amit Anand (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Anand updated YARN-3947:
-
Description: 
When running {{yarn decommissioning}} the {{yarn rmadmin -refreshNodes}} 
doesn't like short host names for the nodes to be decommissioned in 
{{yarn.exclude}} file. It requires {{FQDN}} for the host name to be present to 
be able to successfully decommission a node. The decommissioning behavior in 
{{HDFS}} is different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
{code}
bcpc-vm1
{code}

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
{code}

And the node is not decommissioned. 

When I add the {{FQDN}} for the host name the decommissioning works 
successfully and I see following in the RM logs:

{code}
2015-07-21 11:14:43,453 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf.LAB-A/yarn-site.xml
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1.example.com to the list of excluded hosts from 
/etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.100.0.11 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: bcpc-vm1.example.com:35197 hostname: 
bcpc-vm1.example.com:35197
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node bcpc-vm1.example.com:35197 as it is now DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
bcpc-vm1.example.com:35197 Node Transitioned from RUNNING to DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Removed node bcpc-vm1.example.com:35197 cluster capacity: 
{code}



  was:
When running {{yarn decommissioning}} the {{yarn rmadmin -refreshNodes}} 
doesn't like short host names for the nodes to be decommissioned in 
{{yarn.exclude}} file. It requires {{FQDN}} for the host name to be present to 
be able to successfully decommission a node. The decommissioning behavior in 
{{HDFS}} is different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
bcpc-vm1

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService

[jira] [Updated] (YARN-3947) Add support for short host names in yarn decommissioning process

2015-07-21 Thread Amit Anand (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Anand updated YARN-3947:
-
Description: 
When running {yarn decommissioning} the {{yarn rmadmin -refreshNodes}} doesn't 
like short host names for the nodes to be decommissioned in {yarn.exclude} 
file. It requires {{FQDN}} for the host name to be present to be able to 
successfully decommission a node. The decommissioning behavior in {{HDFS}} is 
different as it can take short host names. 

Below are the details of what I am seeing:

My {{yarn.exlcude}} has short name for the host name:
bcpc-vm1

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
{code}

And the node is not decommissioned. 

When I add the {{FQDN}} for the host name the decommissioning works 
successfully and I see following in the RM logs:

{code}
2015-07-21 11:14:43,453 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf.LAB-A/yarn-site.xml
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1.example.com to the list of excluded hosts from 
/etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.100.0.11 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: bcpc-vm1.example.com:35197 hostname: 
bcpc-vm1.example.com:35197
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node bcpc-vm1.example.com:35197 as it is now DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
bcpc-vm1.example.com:35197 Node Transitioned from RUNNING to DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Removed node bcpc-vm1.example.com:35197 cluster capacity: 
{code}



  was:
When running {yarn decommissioning} the {yarn rmadmin -refreshNodes} doesn't 
like short host names for the nodes to be decommissioned in {yarn.exclude} 
file. It requires {FQDN} for the host name to be present to be able to 
successfully decommission a node. The decommissioning behavior in {HDFS} is 
different as it can take short host names. 

Below are the details of what I am seeing:

My {yarn.exlcude} has short name for the host name:
bcpc-vm1

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
{code}

An

[jira] [Created] (YARN-3947) Add support for short host names in yarn decommissioning process

2015-07-21 Thread Amit Anand (JIRA)

Amit Anand created YARN-3947:


 Summary: Add support for short host names in yarn decommissioning 
process
 Key: YARN-3947
 URL: https://issues.apache.org/jira/browse/YARN-3947
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Amit Anand
Priority: Minor


When running {yarn decommissioning} the {yarn rmadmin -refreshNodes} doesn't 
like short host names for the nodes to be decommissioned in {yarn.exclude} 
file. It requires {FQDN} for the host name to be present to be able to 
successfully decommission a node. The decommissioning behavior in {HDFS} is 
different as it can take short host names. 

Below are the details of what I am seeing:

My {yarn.exlcude} has short name for the host name:
bcpc-vm1

Running:
{code}
sudo -u yarn yarn rmadmin -refreshNodes
{code}

shows following entries in the log file:
{code}
2015-07-21 11:14:18,795 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf/yarn-site.xml
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:18,802 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1 to the list of excluded hosts from /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:18,803 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.0.100.12 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
{code}

And the node is not decommissioned. 

When I add the {FQDN} for the host name the decommissioning works successfully 
and I see following in the RM logs:

{code}
2015-07-21 11:14:43,453 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/etc/hadoop/conf.LAB-A/yarn-site.xml
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the includes file to 
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Setting 
the excludes file to /etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2015-07-21 11:14:43,456 INFO org.apache.hadoop.util.HostsFileReader: Adding 
bcpc-vm1.example.com to the list of excluded hosts from 
/etc/hadoop/conf/yarn.exclude
2015-07-21 11:14:43,456 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.100.0.11 OPERATION=refreshNodes  TARGET=AdminService RESULT=SUCCESS
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: bcpc-vm1.example.com:35197 hostname: 
bcpc-vm1.example.com:35197
2015-07-21 11:14:44,198 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node bcpc-vm1.example.com:35197 as it is now DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
bcpc-vm1.example.com:35197 Node Transitioned from RUNNING to DECOMMISSIONED
2015-07-21 11:14:44,199 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Removed node bcpc-vm1.example.com:35197 cluster capacity: 
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated

2015-07-21 Thread Nathan Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635246#comment-14635246
 ] 

Nathan Roberts commented on YARN-3945:
--

My feeling is the documentation on minimum-user-limit-percent needs a rewrite. 
It makes it sound like minimum-user-limit-percent caps the amount of resource 
to say 50% if there are 2 applications submitted to the queue. This isn't the 
case (afaik). My understanding is that it tries to guarantee all active  
applications this percentage of a queue's capacity (configured or current, 
whichever is larger). Note: an active application is one that is currently 
requesting resources, a running application that has all the resources it 
needs, is NOT active.  If one application stops asking for additional 
resources, the other applications can certainly go higher than the 50%. 
user-limit-factor is what determines the absolute maximum capacity a user can 
consume within a queue. 

Basically, minimum-user-limit percent defines how fair the queue is. The lower 
the value, the sooner the queue will try to spread resources evenly across all 
users in the queue. The higher the value, the more fifo it behaves. 

> maxApplicationsPerUser is wrongly calculated
> 
>
> Key: YARN-3945
> URL: https://issues.apache.org/jira/browse/YARN-3945
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> maxApplicationsPerUser is currently calculated based on the formula
> {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * 
> userLimitFactor)}} but description of userlimit is 
> {quote}
> Each queue enforces a limit on the percentage of resources allocated to a 
> user at any given time, if there is demand for resources. The user limit can 
> vary between a minimum and maximum value.{color:red} The the former (the 
> minimum value) is set to this property value {color} and the latter (the 
> maximum value) depends on the number of users who have submitted 
> applications. For e.g., suppose the value of this property is 25. If two 
> users have submitted applications to a queue, no single user can use more 
> than 50% of the queue resources. If a third user submits an application, no 
> single user can use more than 33% of the queue resources. With 4 or more 
> users, no user can use more than 25% of the queues resources. A value of 100 
> implies no user limits are imposed. The default is 100. Value is specified as 
> a integer.
> {quote}
> configuration related to minimum limit should not be made used in a formula 
> to calculate max applications for a user



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3941) Proportional Preemption policy should try to avoid sending duplicate PREEMPT_CONTAINER event to scheduler

2015-07-21 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3941:
--
Attachment: 0001-YARN-3941.patch

Hi [~leftnoteasy]
Uploading an initial version of patch.

*testExpireKill* already handles the verification of this change. Multiple 
PREEMPT_CONTAINER event will not be raised for same container for each interval 
check of ProportionalCPP.

> Proportional Preemption policy should try to avoid sending duplicate 
> PREEMPT_CONTAINER event to scheduler
> -
>
> Key: YARN-3941
> URL: https://issues.apache.org/jira/browse/YARN-3941
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3941.patch
>
>
> Currently ProportionalCPP tries to send multiple PREEMPT_CONTAINER events to 
> scheduler during every cycle of preemption check till the container is either 
> forcefully killed or preempted by AM. 
> This can be throttled from ProportionalPreemptionPolicy to avoid excess of 
> events to scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state.

2015-07-21 Thread Sumit Nigam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635202#comment-14635202
 ] 

Sumit Nigam commented on YARN-3946:
---

Hi [~varun_saxena] - 
Yes, the idea is not to only debug the issue (which you rightly mentioned, 
Admin can). I am currently on 2.6.0 and will try 2.7.0 when I can, for sure.

There are too many reasons to be able to correlate as to what may have happened 
- AM level, resource level, queue level, possibly a combination of these, etc. 
A programmatic API is also useful to apply corrective measures - say, I can 
program to submit my app to a whole new queue altogether, etc. after I notice 
it is queue level capacity issue or try reserving container, etc - all 
programatically!

Another important use case is that of attempting to submit the app (say, 
through own AM) and after a period of remaining in ACCEPTED state, reporting 
back automatically as to why the state remains so. A REST API is extremely 
useful in such a case. With this, it would be possible to to even ascertain 
when a job moves to ACCEPTED state from RUNNING state itself (RM restart, AM 
crash + restart). Again, this currently requires looking through logs / UI to 
ascertain what happened. In esp big clusters, this is indeed non-trivial.

I'd agree with Nagannarasimha that we should be able to know that without 
administrative understanding of the same. Plus, I am not working on this.


> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3814) REST API implementation for getting raw entities in TimelineReader

2015-07-21 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635181#comment-14635181
 ] 

Varun Saxena commented on YARN-3814:


bq. For example, characters such as ":" are reserved characters in URL that are 
not directly allowed in queries or other parts of the URL. They should always 
be properly encoded (e.g. "%3A").
[~sjlee0], I think this tends to depend on whether the class used to construct 
URL at client is RFC-2396 compliant or not.
I am for instance in my unit tests able to send both ":" and "%3A". And both 
are interpreted as ":" at server side.
We anyways were using ":" even in ATSv1 so behavior would be same for current 
users.

> REST API implementation for getting raw entities in TimelineReader
> --
>
> Key: YARN-3814
> URL: https://issues.apache.org/jira/browse/YARN-3814
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3814-YARN-2928.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-07-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635152#comment-14635152
 ] 

Hadoop QA commented on YARN-3045:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 56s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 7 new or modified test files. |
| {color:green}+1{color} | javac |   7m 59s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 50s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 33s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  3s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 26s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 58s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   8m  8s | Tests passed in 
hadoop-yarn-applications-distributedshell. |
| {color:red}-1{color} | yarn tests |   6m  6s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  53m  7s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
 |
|   | 
hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
 |
|   | hadoop.yarn.server.nodemanager.containermanager.container.TestContainer |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746328/YARN-3045-YARN-2928.006.patch
 |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | YARN-2928 / eb1932d |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8594/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8594/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8594/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8594/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8594/console |


This message was automatically generated.

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045-YARN-2928.005.patch, YARN-3045-YARN-2928.006.patch, 
> YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635115#comment-14635115
 ] 

Varun Saxena commented on YARN-3874:


[~sjlee0], [~zjshen], kindly review.

Following has been done :
# Both FS reader and writer implementations have been made consistent with each 
other. Classes have not yet been combined as no final decision was taken on it 
in YARN-3051. We can decide if we will combine the classes or not.
# Moved some of the code common to reader and writer impls into 
{{TimelineStorageUtils}}
# Writer impl will now write app flow mapping file.
# As [~zjshen] said in YARN-3051, used {{FileSystem}} class so that it can be 
used for HDFS as well.
# Entity files would now have created time in the file name. This will be used 
to limit the entities returned and do filtering based on created time.
# Modified times as of now, wont be part of file name as we will have to 
consistently change filenames if we do that. If required, that can be included 
as well. We can use rename operation. Thoughts ?
# Writer would choose the file based on entity ID if created time is not given 
in request. If created time is 0 or negative and file for an entity id is being 
written for the first time, an error will be sent back.
# App flow mapping will be cached in reader and writer on start. And entries 
will be added in reader whenever mapping has to be queried from file


> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3874:
---
Attachment: YARN-3874-YARN-2928.01.patch

> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3874:
---
Attachment: (was: YARN-3874-YARN-2928.01.patch)

> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3874:
---
Attachment: YARN-3874-YARN-2928.01.patch

> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3874-YARN-2928.01.patch
>
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3874) Combine FS Reader and Writer Implementations

2015-07-21 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3874:
---
Attachment: (was: YARN-3874-YARN-2928.01.patch)

> Combine FS Reader and Writer Implementations
> 
>
> Key: YARN-3874
> URL: https://issues.apache.org/jira/browse/YARN-3874
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> Combine FS Reader and Writer Implementations and make them consistent with 
> each other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 113 matches

Mail list logo