date:20150517

[jira] [Commented] (YARN-3645) ResourceManager can't start success if attribute value of "aclSubmitApps" is null in fair-scheduler.xml

2015-05-17 Thread Mohammad Shahid Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547601#comment-14547601
 ] 

Mohammad Shahid Khan commented on YARN-3645:


Loading with invalid node configuration is not feasible.
But instead of throwing the NullPointerException, we can 
*AllocationConfigurationException* with proper message so that the reason of 
failure could be identified easily.
{code}
if ("aclAdministerApps".equals(field.getTagName())) {
Text aclText = (Text)field.getFirstChild();
if (aclText == null) {
  throw new AllocationConfigurationException(
  "Invalid admin ACL configuration in allocation file");
}
String text = ((Text)field.getFirstChild()).getData();
acls.put(QueueACL.ADMINISTER_QUEUE, new AccessControlList(text));
  }
{code}

> ResourceManager can't start success if  attribute value of "aclSubmitApps" is 
> null in fair-scheduler.xml
> 
>
> Key: YARN-3645
> URL: https://issues.apache.org/jira/browse/YARN-3645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.2
>Reporter: zhoulinlin
>
> The "aclSubmitApps" is configured in fair-scheduler.xml like below:
> 
> 
>  
> The resourcemanager log:
> 2015-05-14 12:59:48,623 INFO org.apache.hadoop.service.AbstractService: 
> Service ResourceManager failed in state INITED; cause: 
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
> to initialize FairScheduler
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
> to initialize FairScheduler
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:493)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:920)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:240)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1159)
> Caused by: java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1301)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1318)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   ... 7 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:458)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:337)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1299)
>   ... 9 more
> 2015-05-14 12:59:48,623 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning 
> to standby state
> 2015-05-14 12:59:48,623 INFO 
> com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory: plugin 
> transitionToStandbyIn
> 2015-05-14 12:59:48,623 WARN org.apache.hadoop.service.AbstractService: When 
> stopping the service ResourceManager : java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory.transitionToStandbyIn(YarnPlatformPluginProxyFactory.java:71)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:997)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1058)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>   at 
> org.apache.hadoop.yarn.server.resourcemana

[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup

2015-05-17 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547562#comment-14547562
 ] 

Naganarasimha G R commented on YARN-2729:
-

Thanks [~vinodkv] for replying,

bq. I think once we start marking this script-based provider feature as public, 
the expected output from the script will automatically become a public 
interface unless we explicitly say no. We should start thinking about this now 
to avoid uncertainty in the future?
True, its better to think about it for both script(YARN-2729) and config 
(YARN-2923) based providers now itself if we are making it as public. Initial 
thought what i have is 
{code}
NODE_LABELS=|[,Labels]
where Label Type = Partition, Constraint
and default if not specified can be Partition.
{code}
Going further i think distributed labels will be more suitable for 
constraints/attribute [YARN-3409] so we can think of having {{Constraint}} as 
default too. Also we need not specify Whether Partition is Exclusive or 
Non-Exclusive, it does not make significance from the NM as Exclusivity of 
partition labels are already specified while adding it to cluster labels set in 
RM.
One More suggestion is In Node Label object, can we think of having Enum 
instead of isExclusive, Enum can have ExclusivePartion,NonExclusivePartition, 
(Constraint in future) and so on.
bq. Isn't AbstractNodeLabelsProvider a good place to do these steps?
Well AbstractNodeLabelsProvider will only be applicable to the whitelisted 
providers (Config and script) currently, And also the purpose was for removing 
duplicate code related to Timertask and related configs. So is your suggestion 
to expose the AbstractNodeLabelsProvider to be a public interface ? or can we 
think of having intermediate manager class and have configurations for timer 
requirement and leave the NodeLabelsProvider interface as is. ?


> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup
> ---
>
> Key: YARN-2729
> URL: https://issues.apache.org/jira/browse/YARN-2729
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, 
> YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, 
> YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, 
> YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, 
> YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch, 
> YARN-2729.20150517-1.patch
>
>
> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Raju Bairishetti (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547544#comment-14547544
 ] 

Raju Bairishetti commented on YARN-3644:


W can have a new config like NODEMANAGER_ALIVE_ON_RM_CONNECTION_FAILURES? Based 
on this config value NM takes a decision on shutdown. In this way we can honour 
the existing behaviour as well.

I will provide a patch shortly. Not able to assign myself. Can anyone help me 
in assigning?

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-17 Thread Raju Bairishetti (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547537#comment-14547537
 ] 

Raju Bairishetti commented on YARN-3646:


[~vinodkv] I will provide a patch shortly. 
 I am not able to assign myself. Can anyone help me in assigning myself? 

 

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-17 Thread Raju Bairishetti (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547529#comment-14547529
 ] 

Raju Bairishetti commented on YARN-3646:


bq. Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default 
policy is not sufficient, but also RetryPolicies.RetryForever.shouldRetry() 
should check for Connect exceptions and handle it. Otherwise shouldRetry always 
return RetryAction.RETRY action.

 Do we need to catch exception in shouldRetry if we have separate 
exceptionToPolicy map  which contains only connectionException entry. ( like 
exceptiontoPolicyMap.put(connectionException, FOREVER polcicy))

Seems we do not even require exceptionToPolicy for FOREVER policy if we catch 
the exception in shouldRetry method.

thoughts?

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Srikanth Sundarrajan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547506#comment-14547506
 ] 

Srikanth Sundarrajan commented on YARN-3644:


[~vinodkv], YARN-3644 is independent of this. In our setup we ran into this 
before we ran into YARN-3646. NM gives up trying for about 30 odd mts by 
default (default settings) before *attempting* to shut itself down. Is there an 
issue if this wait time is much (infinitely) longer (for both HA & Non-HA 
setup). An orthogonal issue is that when NM attempts to shut itself down, it 
doesn't actually go down and lingers around for days without actually accepting 
any containers, unless restarted (will file another issue for this).

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3651) Tracking url in ApplicationCLI wrong for running application

2015-05-17 Thread Jian He (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3651:
--
Assignee: (was: Jian He)

> Tracking url in ApplicationCLI wrong for running application
> 
>
> Key: YARN-3651
> URL: https://issues.apache.org/jira/browse/YARN-3651
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications, resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Priority: Minor
>
> Application URL in Application CLI wrong
> Steps to reproduce
> ==
> 1. Start HA setup insecure mode
> 2.Configure HTTPS_ONLY
> 3.Submit application to cluster
> 4.Execute command ./yarn application -list
> 5.Observer tracking URL shown
> {code}
> 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History 
> server at /:45034
> Total number of applications (application-types: [] and states: [SUBMITTED, 
> ACCEPTED, RUNNING]):1
> Application-Id --- Tracking-URL
> application_1431672734347_0003   *http://host-10-19-92-117:13013*
> {code}
> *Expected*
> https://:64323/proxy/application_1431672734347_0003 /



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3651) Tracking url in ApplicationCLI wrong for running application

2015-05-17 Thread Jian He (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-3651:
-

Assignee: Jian He

> Tracking url in ApplicationCLI wrong for running application
> 
>
> Key: YARN-3651
> URL: https://issues.apache.org/jira/browse/YARN-3651
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications, resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Jian He
>Priority: Minor
>
> Application URL in Application CLI wrong
> Steps to reproduce
> ==
> 1. Start HA setup insecure mode
> 2.Configure HTTPS_ONLY
> 3.Submit application to cluster
> 4.Execute command ./yarn application -list
> 5.Observer tracking URL shown
> {code}
> 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History 
> server at /:45034
> Total number of applications (application-types: [] and states: [SUBMITTED, 
> ACCEPTED, RUNNING]):1
> Application-Id --- Tracking-URL
> application_1431672734347_0003   *http://host-10-19-92-117:13013*
> {code}
> *Expected*
> https://:64323/proxy/application_1431672734347_0003 /



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547496#comment-14547496
 ] 

sandflee commented on YARN-3668:


I don't want the service to terminated if AM goes down, yarn will also restart 
AM until it is launched successfully. 
By outside ways we could detect this situation and replace a new AM jar.

> Long run service shouldn't be killed even if Yarn crashed
> -
>
> Key: YARN-3668
> URL: https://issues.apache.org/jira/browse/YARN-3668
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: sandflee
>
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547494#comment-14547494
 ] 

Vinod Kumar Vavilapalli commented on YARN-3668:
---

So you don't want the service to be terminated even if the ApplicationMaster 
goes down and will never get launched again?

> Long run service shouldn't be killed even if Yarn crashed
> -
>
> Key: YARN-3668
> URL: https://issues.apache.org/jira/browse/YARN-3668
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: sandflee
>
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547492#comment-14547492
 ] 

Vinod Kumar Vavilapalli commented on YARN-3644:
---

Actually, for all the above cases, we want NMs to just continue for a while 
without losing any work and finally give up after some time. The only 
difference between a HA vs non-HA setup is that in HA setup NMs will just wait 
many times over trying each of the RMs.

Getting into the business of detecting and acting on partitions is best left up 
to admins/tools.

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547489#comment-14547489
 ] 

Vinod Kumar Vavilapalli commented on YARN-3644:
---

bq. In large clusters, if RM is down for maintenance for longer period, all the 
NMs shuts themselves down, requiring additional work to bring up the NMs.
bq. Right now, NM shuts down itself only in case of connection failures. NM 
ignores all other kinds of exceptions and errors while sending heartbeats.
This path usually shouldn't happen at all as the RMProxy layer is supposed to 
retry _enough_, except perhaps for the bug at YARN-3646. We eventually want to 
give up if the retry layer itself gives up. Given that, is this JIRA simply a 
dup of YARN-3646? /cc [~jianhe] [~xgong]

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3646:
--
Target Version/s: 2.8.0, 2.7.1

[~raju.bairishetti], would you like to provide a patch?

/cc [~xgong], [~jianhe] who wrote most of this code.

Targeting 2.7.1/2.8.0, but more likely one is 2.8.0. Can see if we can get it 
into earlier releases too depending on their schedule.

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547480#comment-14547480
 ] 

Vinod Kumar Vavilapalli commented on YARN-3480:
---

bq. I think we need to have a lower limit on the failure-validaty interval to 
avoid situations like this.
Filed YARN-3669.

> Recovery may get very slow with lots of services with lots of app-attempts
> --
>
> Key: YARN-3480
> URL: https://issues.apache.org/jira/browse/YARN-3480
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch, YARN-3480.04.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3669) Attempt-failures validatiy interval should have a global admin configurable lower limit

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)

Vinod Kumar Vavilapalli created YARN-3669:
-

 Summary: Attempt-failures validatiy interval should have a global 
admin configurable lower limit
 Key: YARN-3669
 URL: https://issues.apache.org/jira/browse/YARN-3669
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Found this while reviewing YARN-3480.

bq. When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a 
small value, retried attempts might be very large. So we need to delete some 
attempts stored in RMStateStore and RMStateStore.
I think we need to have a lower limit on the failure-validaty interval to avoid 
situations like this.

Having this will avoid pardoning too-many failures in too-short a duration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster

2015-05-17 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547478#comment-14547478
 ] 

Weiwei Yang commented on YARN-3526:
---

Thanks [~xgong]

> ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
> -
>
> Key: YARN-3526
> URL: https://issues.apache.org/jira/browse/YARN-3526
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 2.6.0
> Environment: Red Hat Enterprise Linux Server 6.4 
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>  Labels: BB2015-05-TBR
> Fix For: 2.7.1
>
> Attachments: YARN-3526.001.patch, YARN-3526.002.patch
>
>
> On a QJM HA cluster, view RM web UI to track job status, it shows
> This is standby RM. Redirecting to the current active RM: 
> http://:8088/proxy/application_1427338037905_0008/mapreduce
> it refreshes every 3 sec but never going to the correct tracking page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster

2015-05-17 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547477#comment-14547477
 ] 

Weiwei Yang commented on YARN-3526:
---

Thanks [~xgong]

> ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
> -
>
> Key: YARN-3526
> URL: https://issues.apache.org/jira/browse/YARN-3526
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 2.6.0
> Environment: Red Hat Enterprise Linux Server 6.4 
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>  Labels: BB2015-05-TBR
> Fix For: 2.7.1
>
> Attachments: YARN-3526.001.patch, YARN-3526.002.patch
>
>
> On a QJM HA cluster, view RM web UI to track job status, it shows
> This is standby RM. Redirecting to the current active RM: 
> http://:8088/proxy/application_1427338037905_0008/mapreduce
> it refreshes every 3 sec but never going to the correct tracking page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547473#comment-14547473
 ] 

Vinod Kumar Vavilapalli commented on YARN-3480:
---

bq. we might need keep failed attempts those are in validity window, so it is 
the minimum number of attempts that we should keep. So when apps specify how 
much they want the platform to remember, we need consider it as another minimum 
number of attempts that we should keep.
What I proposed is a global limit on attempts-to-remember that can be 
overridden to a lower value by individual apps if needed. So, yes, like you are 
saying, this global limit should usually be such that RM can _atleast_ remember 
attempts that can happen in all apps' one failure-validity-interval.

bq. It makes recovery more fast, and does not lose any attempts' history. 
However it will makes recovery process a little more complicated. The former 
method(removing attempts) is more concise, and just likes logrotate, if we 
could accept the absence of some attempts' history information, I would prefer 
it.
Without doing this, we will unnecessarily be forcing apps to lose history 
simply because the platform cannot recover quickly enough.

Thinking more, how about we only have (limits + asynchronous recovery) for 
services, once YARN-1039 goes in? Non-service apps anyways are not expected to 
have a lot of app-attempts.

> Recovery may get very slow with lots of services with lots of app-attempts
> --
>
> Key: YARN-3480
> URL: https://issues.apache.org/jira/browse/YARN-3480
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch, YARN-3480.04.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547468#comment-14547468
 ] 

Vinod Kumar Vavilapalli commented on YARN-2729:
---

bq. I think the format expected from the command should be more structured. 
Specifically as we expect more per-label attributes in line with YARN-3565.
bq. So IMHO if there is plan to make this interface public & stable then would 
be better do these changes now itself if not it would better done after 
requirement for constraint labels, so that more clarity on structure would be 
there? Tan, Wangda and you can share your opinion on this, based on it will do 
the modifications.
I think once we start marking this script-based provider feature as public, the 
expected output from the script will automatically become a public interface 
unless we explicitly say no. We should start thinking about this now to avoid 
uncertainty in the future?

bq. These needs to be done irrespective of the label provider (system or 
user's) hence kept it in NodeStatusUpdaterImpl , but if req to be moved out 
then we need to bring in some intermediate manager(/helper/delegator) class 
between NodeStatusUpdaterImpl and NodeLabelsProvider.
Isn't AbstractNodeLabelsProvider a good place to do these steps?

> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup
> ---
>
> Key: YARN-2729
> URL: https://issues.apache.org/jira/browse/YARN-2729
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, 
> YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, 
> YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, 
> YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, 
> YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch, 
> YARN-2729.20150517-1.patch
>
>
> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3547) FairScheduler: Apps that have no resource demand should not participate scheduling

2015-05-17 Thread Xianyin Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547461#comment-14547461
 ] 

Xianyin Xin commented on YARN-3547:
---

Agree, [~leftnoteasy]. Now we have YARN-3547.004.patch using 
{{SchedulerApplicationAttempt.getAppAttemptResourceUsage().getPending()}} and 
YARN-3547.005.patch using {{getDemand() - getResourceUsage()}}. 

> FairScheduler: Apps that have no resource demand should not participate 
> scheduling
> --
>
> Key: YARN-3547
> URL: https://issues.apache.org/jira/browse/YARN-3547
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-3547.001.patch, YARN-3547.002.patch, 
> YARN-3547.003.patch, YARN-3547.004.patch, YARN-3547.005.patch
>
>
> At present, all of the 'running' apps participate the scheduling process, 
> however, most of them may have no resource demand on a production cluster, as 
> the app's status is running other than waiting for resource at the most of 
> the app's lifetime. It's not a wise way we sort all the 'running' apps and 
> try to fulfill them, especially on a large-scale cluster which has heavy 
> scheduling load. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547456#comment-14547456
 ] 

Vinod Kumar Vavilapalli commented on YARN-3561:
---

I see you filed HADOOP-11989. Assuming _that_ is the root-cause, we can close 
this as a duplicate.

> Non-AM Containers continue to run even after AM is stopped
> --
>
> Key: YARN-3561
> URL: https://issues.apache.org/jira/browse/YARN-3561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.0
> Environment: debian 7
>Reporter: Gour Saha
>Priority: Critical
> Attachments: app0001.zip, application_1431771946377_0001.zip
>
>
> Non-AM containers continue to run even after application is stopped. This 
> occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a 
> Hadoop 2.6 deployment. 
> Following are the NM logs from 2 different nodes:
> *host-07* - where Slider AM was running
> *host-03* - where Storm NIMBUS container was running.
> *Note:* The logs are partial, starting with the time when the relevant Slider 
> AM and NIMBUS containers were allocated, till the time when the Slider AM was 
> stopped. Also, the large number of "Memory usage" log lines were removed 
> keeping only a few starts and ends of every segment.
> *NM log from host-07 where Slider AM container was running:*
> {noformat}
> 2015-04-29 00:39:24,614 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for 
> container_1428575950531_0020_02_01
> 2015-04-29 00:41:10,310 INFO  ipc.Server (Server.java:saslProcess(1306)) - 
> Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE)
> 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for 
> container_1428575950531_0021_01_01 by user yarn
> 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new 
> application reference for app application_1428575950531_0021
> 2015-04-29 00:41:10,323 INFO  application.Application 
> (ApplicationImpl.java:handle(464)) - Application 
> application_1428575950531_0021 transitioned from NEW to INITING
> 2015-04-29 00:41:10,325 INFO  nodemanager.NMAuditLogger 
> (NMAuditLogger.java:logSuccess(89)) - USER=yarn   IP=10.84.105.162
> OPERATION=Start Container Request   TARGET=ContainerManageImpl  
> RESULT=SUCCESS  APPID=application_1428575950531_0021
> CONTAINERID=container_1428575950531_0021_01_01
> 2015-04-29 00:41:10,328 WARN  logaggregation.LogAggregationService 
> (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root 
> Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: 
> [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple 
> users.
> 2015-04-29 00:41:10,328 WARN  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:(182)) - rollingMonitorInterval is set as 
> -1. The log rolling mornitoring interval is disabled. The logs will be 
> aggregated after this application is finished.
> 2015-04-29 00:41:10,351 INFO  application.Application 
> (ApplicationImpl.java:transition(304)) - Adding 
> container_1428575950531_0021_01_01 to application 
> application_1428575950531_0021
> 2015-04-29 00:41:10,352 INFO  application.Application 
> (ApplicationImpl.java:handle(464)) - Application 
> application_1428575950531_0021 transitioned from INITING to RUNNING
> 2015-04-29 00:41:10,356 INFO  container.Container 
> (ContainerImpl.java:handle(999)) - Container 
> container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING
> 2015-04-29 00:41:10,357 INFO  containermanager.AuxServices 
> (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId 
> application_1428575950531_0021
> 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,358 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,358 INFO  localizer.LocalizedRes

[jira] [Commented] (YARN-3652) A SchedulerMetrics may be need for evaluating the scheduler's performance

2015-05-17 Thread Xianyin Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547435#comment-14547435
 ] 

Xianyin Xin commented on YARN-3652:
---

Thanks [~vinodkv], that's very helpful.

> A SchedulerMetrics may be need for evaluating the scheduler's performance
> -
>
> Key: YARN-3652
> URL: https://issues.apache.org/jira/browse/YARN-3652
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, scheduler
>Reporter: Xianyin Xin
>
> As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating 
> the scheduler's performance. The performance indexes includes #events waiting 
> for being handled by scheduler, the throughput, the scheduling delay and/or 
> other indicators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547434#comment-14547434
 ] 

sandflee commented on YARN-3668:


seems not enough,if AM crashed on launch because of AM's bug, application will 
fail finally. I think it's the problem of AM not application, yarn should 
handle this.

> Long run service shouldn't be killed even if Yarn crashed
> -
>
> Key: YARN-3668
> URL: https://issues.apache.org/jira/browse/YARN-3668
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: sandflee
>
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547424#comment-14547424
 ] 

Xuan Gong commented on YARN-3668:
-

bq. If am crashed and reaches am max fail times, applications are killed. If we 
set am max fail times to a big one or unlimit am max fail times, RM may have 
too many AppAttempt to store in memory and RMStateStore,

YARN-611 and YARN-614 are not enough to cover the cases you described ?

> Long run service shouldn't be killed even if Yarn crashed
> -
>
> Key: YARN-3668
> URL: https://issues.apache.org/jira/browse/YARN-3668
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: sandflee
>
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup

2015-05-17 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2729:

Attachment: YARN-2729.20150517-1.patch

Hi [~wangda]
# rebased the patch on top of 3565
# Moved common code which was earlier here to YARN-2923, as 2923 jira will be 
going first

> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup
> ---
>
> Key: YARN-2729
> URL: https://issues.apache.org/jira/browse/YARN-2729
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, 
> YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, 
> YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, 
> YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, 
> YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch, 
> YARN-2729.20150517-1.patch
>
>
> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String

2015-05-17 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547411#comment-14547411
 ] 

Naganarasimha G R commented on YARN-3565:
-

Thanks [~aw], for looking it .

> NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object 
> instead of String
> -
>
> Key: YARN-3565
> URL: https://issues.apache.org/jira/browse/YARN-3565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
>Priority: Blocker
> Attachments: YARN-3565-20150502-1.patch, YARN-3565.20150515-1.patch, 
> YARN-3565.20150516-1.patch
>
>
> Now NM HB/Register uses Set, it will be hard to add new fields if we 
> want to support specifying NodeLabel type such as exclusivity/constraints, 
> etc. We need to make sure rolling upgrade works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String

2015-05-17 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547316#comment-14547316
 ] 

Allen Wittenauer commented on YARN-3565:


bq. I think currently white space is getting calculated on the diff output 
rather just the modified lines only (diff has some lines before and after the 
modifications).

That's not how it works.  But I'll look to see if there is an off-by-one error 
here.

> NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object 
> instead of String
> -
>
> Key: YARN-3565
> URL: https://issues.apache.org/jira/browse/YARN-3565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
>Priority: Blocker
> Attachments: YARN-3565-20150502-1.patch, YARN-3565.20150515-1.patch, 
> YARN-3565.20150516-1.patch
>
>
> Now NM HB/Register uses Set, it will be hard to add new fields if we 
> want to support specifying NodeLabel type such as exclusivity/constraints, 
> etc. We need to make sure rolling upgrade works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-05-17 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547304#comment-14547304
 ] 

Li Lu commented on YARN-3051:
-

Hi [~varun_saxena], I think the new patch name pattern should be, 
YARN-3051-YARN-2928.***.patch. Would you please try that again? Thanks! 

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, 
> YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-05-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547299#comment-14547299
 ] 

Hadoop QA commented on YARN-3051:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732621/YARN-3051.wip.02.YARN-2928.patch
 |
| Optional Tests | shellcheck javadoc javac unit findbugs checkstyle |
| git revision | trunk / cab0dad |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7963/console |


This message was automatically generated.

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, 
> YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3133) Move NodeHealthStatus and associated protobuf to hadoop common

2015-05-17 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3133:
---
Description: Move NodeHealthStatus and associated protobuf to hadoop common 
as HDFS needs to use it.  (was: Move NodeHealthStatus and associated protobuf 
to hadoop common as HDFS needs to use it,)

> Move NodeHealthStatus and associated protobuf to hadoop common
> --
>
> Key: YARN-3133
> URL: https://issues.apache.org/jira/browse/YARN-3133
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> Move NodeHealthStatus and associated protobuf to hadoop common as HDFS needs 
> to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository

2015-05-17 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3339:
---
Assignee: Ravindra Kumar Naik

> TestDockerContainerExecutor should pull a single image and not the entire 
> centos repository
> ---
>
> Key: YARN-3339
> URL: https://issues.apache.org/jira/browse/YARN-3339
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: test
>Affects Versions: 2.6.0
> Environment: Linux
>Reporter: Ravindra Kumar Naik
>Assignee: Ravindra Kumar Naik
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3339-branch-2.6.0.001.patch, 
> YARN-3339-trunk.001.patch
>
>
> TestDockerContainerExecutor test pulls the entire centos repository which is 
> time consuming.
> Pulling a specific image (e.g. centos7) will be sufficient to run the test 
> successfully and will save time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job

2015-05-17 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3560:
---
 Target Version/s: 2.8.0
Affects Version/s: 2.7.0

> Not able to navigate to the cluster from tracking url (proxy) generated after 
> submission of job
> ---
>
> Key: YARN-3560
> URL: https://issues.apache.org/jira/browse/YARN-3560
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Anushri
>Priority: Minor
> Attachments: YARN-3560.patch
>
>
> a standalone web proxy server is enabled in the cluster
> when a job is submitted the url generated contains proxy
> track this url
> in the web page , if we try to navigate to the cluster links [about. 
> applications, or scheduler] it gets redirected to some default port instead 
> of actual RM web port configured
> as such it throws "webpage not available"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job

2015-05-17 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3560:
---
Attachment: (was: YARN-3560.patch)

> Not able to navigate to the cluster from tracking url (proxy) generated after 
> submission of job
> ---
>
> Key: YARN-3560
> URL: https://issues.apache.org/jira/browse/YARN-3560
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anushri
>Priority: Minor
> Attachments: YARN-3560.patch
>
>
> a standalone web proxy server is enabled in the cluster
> when a job is submitted the url generated contains proxy
> track this url
> in the web page , if we try to navigate to the cluster links [about. 
> applications, or scheduler] it gets redirected to some default port instead 
> of actual RM web port configured
> as such it throws "webpage not available"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job

2015-05-17 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3560:
---
Attachment: YARN-3560.patch

Please review the attached patch.

> Not able to navigate to the cluster from tracking url (proxy) generated after 
> submission of job
> ---
>
> Key: YARN-3560
> URL: https://issues.apache.org/jira/browse/YARN-3560
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anushri
>Priority: Minor
> Attachments: YARN-3560.patch
>
>
> a standalone web proxy server is enabled in the cluster
> when a job is submitted the url generated contains proxy
> track this url
> in the web page , if we try to navigate to the cluster links [about. 
> applications, or scheduler] it gets redirected to some default port instead 
> of actual RM web port configured
> as such it throws "webpage not available"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job

2015-05-17 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3560:
---
Attachment: YARN-3560.patch

Please review the attached patch

> Not able to navigate to the cluster from tracking url (proxy) generated after 
> submission of job
> ---
>
> Key: YARN-3560
> URL: https://issues.apache.org/jira/browse/YARN-3560
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anushri
>Priority: Minor
> Attachments: YARN-3560.patch
>
>
> a standalone web proxy server is enabled in the cluster
> when a job is submitted the url generated contains proxy
> track this url
> in the web page , if we try to navigate to the cluster links [about. 
> applications, or scheduler] it gets redirected to some default port instead 
> of actual RM web port configured
> as such it throws "webpage not available"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped

2015-05-17 Thread Chackaravarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547262#comment-14547262
 ] 

Chackaravarthy commented on YARN-3561:
--

Improper (specific to this env) kill command construction was the issue. I 
tested by making changes in Shell.java class to construct the kill command as 
follows (including two hyphens) :
{noformat}
kill -signalNo -- -
{noformat} 

It works fine with this change in debian7.

> Non-AM Containers continue to run even after AM is stopped
> --
>
> Key: YARN-3561
> URL: https://issues.apache.org/jira/browse/YARN-3561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.0
> Environment: debian 7
>Reporter: Gour Saha
>Priority: Critical
> Attachments: app0001.zip, application_1431771946377_0001.zip
>
>
> Non-AM containers continue to run even after application is stopped. This 
> occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a 
> Hadoop 2.6 deployment. 
> Following are the NM logs from 2 different nodes:
> *host-07* - where Slider AM was running
> *host-03* - where Storm NIMBUS container was running.
> *Note:* The logs are partial, starting with the time when the relevant Slider 
> AM and NIMBUS containers were allocated, till the time when the Slider AM was 
> stopped. Also, the large number of "Memory usage" log lines were removed 
> keeping only a few starts and ends of every segment.
> *NM log from host-07 where Slider AM container was running:*
> {noformat}
> 2015-04-29 00:39:24,614 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for 
> container_1428575950531_0020_02_01
> 2015-04-29 00:41:10,310 INFO  ipc.Server (Server.java:saslProcess(1306)) - 
> Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE)
> 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for 
> container_1428575950531_0021_01_01 by user yarn
> 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new 
> application reference for app application_1428575950531_0021
> 2015-04-29 00:41:10,323 INFO  application.Application 
> (ApplicationImpl.java:handle(464)) - Application 
> application_1428575950531_0021 transitioned from NEW to INITING
> 2015-04-29 00:41:10,325 INFO  nodemanager.NMAuditLogger 
> (NMAuditLogger.java:logSuccess(89)) - USER=yarn   IP=10.84.105.162
> OPERATION=Start Container Request   TARGET=ContainerManageImpl  
> RESULT=SUCCESS  APPID=application_1428575950531_0021
> CONTAINERID=container_1428575950531_0021_01_01
> 2015-04-29 00:41:10,328 WARN  logaggregation.LogAggregationService 
> (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root 
> Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: 
> [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple 
> users.
> 2015-04-29 00:41:10,328 WARN  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:(182)) - rollingMonitorInterval is set as 
> -1. The log rolling mornitoring interval is disabled. The logs will be 
> aggregated after this application is finished.
> 2015-04-29 00:41:10,351 INFO  application.Application 
> (ApplicationImpl.java:transition(304)) - Adding 
> container_1428575950531_0021_01_01 to application 
> application_1428575950531_0021
> 2015-04-29 00:41:10,352 INFO  application.Application 
> (ApplicationImpl.java:handle(464)) - Application 
> application_1428575950531_0021 transitioned from INITING to RUNNING
> 2015-04-29 00:41:10,356 INFO  container.Container 
> (ContainerImpl.java:handle(999)) - Container 
> container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING
> 2015-04-29 00:41:10,357 INFO  containermanager.AuxServices 
> (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId 
> application_1428575950531_0021
> 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,358 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tm

[jira] [Updated] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup

2015-05-17 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2923:

Attachment: YARN-2923.20150517-1.patch

Hi [~wangda]
   Attaching a WIP patch (may be need some more static checks after jenkins 
run) 
# rebased the patch on top of 3565
# Moved common code which was earlier in 2729 to here as this jira will be 
going first
# corrected most of [~vinodkv]'s comments in 2729 and pending are :

* I think the format expected from the command should be more structured. 
Specifically as we expect more per-label attributes in line with YARN-3565.
* Not caused by your patch but worth fixing here. NodeStatusUpdaterImpl 
shouldn't worry about invalid label-set, previous-valid-labels and label 
validation. You should move all that functionality into NodeLabelsProvider.
* Can you add the documentation for setting this up too too?

For these wanted to discuss with you before working on it.


> Support configuration based NodeLabelsProvider Service in Distributed Node 
> Label Configuration Setup 
> -
>
> Key: YARN-2923
> URL: https://issues.apache.org/jira/browse/YARN-2923
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, 
> YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, 
> YARN-2923.20150517-1.patch
>
>
> As part of Distributed Node Labels configuration we need to support Node 
> labels to be configured in Yarn-site.xml. And on modification of Node Labels 
> configuration in yarn-site.xml, NM should be able to get modified Node labels 
> from this NodeLabelsprovider service without NM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547168#comment-14547168
 ] 

sandflee commented on YARN-3668:


If am crashed and reaches am max fail times, applications are killed. If we set 
am max fail times to a big one or unlimit am max fail times, RM may have too 
many AppAttempt to store in memory and RMStateStore, YARN-3480 could resolve 
this problem by storing limited appAttempt.

> Long run service shouldn't be killed even if Yarn crashed
> -
>
> Key: YARN-3668
> URL: https://issues.apache.org/jira/browse/YARN-3668
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: sandflee
>
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547165#comment-14547165
 ] 

sandflee commented on YARN-3668:


If all RM crashed, all running containers will be killed, YARN-3644 discuss this

> Long run service shouldn't be killed even if Yarn crashed
> -
>
> Key: YARN-3668
> URL: https://issues.apache.org/jira/browse/YARN-3668
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: sandflee
>
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3668) Long run service shouldn't be killed even if Yarn crashed

2015-05-17 Thread sandflee (JIRA)

sandflee created YARN-3668:
--

 Summary: Long run service shouldn't be killed even if Yarn crashed
 Key: YARN-3668
 URL: https://issues.apache.org/jira/browse/YARN-3668
 Project: Hadoop YARN
  Issue Type: Wish
Reporter: sandflee


For long running service, it shouldn't be killed even if all yarn component 
crashed, with RM work preserving and NM restart, yarn could take over 
applications again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547159#comment-14547159
 ] 

sandflee commented on YARN-3644:


In our cluster we also have to face this problem, I'd like to have some work on 
this if possible, expecting more comments!

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547155#comment-14547155
 ] 

sandflee commented on YARN-3644:


[~raju.bairishetti] thanks for your reply,  If RM HA is not enabled, we can fix 
it like this. But with RM HA, there're some condition to consider.
1, both RM A and RM B  reset the connection,  seems RMs are in trouble, NM keep 
containers alive
2, both RM A and RM B socket timeout, seems NM are network partitioned with RMs 
or RM machine all crashed(Any way to distinguish them?), NM kills all containers
3, one RM reset the connection and the other socket timeout, It's difficult to 
handle, sine we knows nothing about active RM, both RM maybe all crashed, or 
just active RM are network partitioned. 
I suggest backup RM also responses and tells NM I'm backup RM. So It becomes 
3.1  one RM reset the connection and the other socket timeout, seems RM in 
trouble, just keep containers alive 
3.2  one RM are backup and  the other RM socket timeout,  seems NM are network 
partitioned with active RM, kill all containers

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT

2015-05-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547152#comment-14547152
 ] 

Hadoop QA commented on YARN-126:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 36s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 31s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 31s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m  5s | The applied patch generated  
15 new checkstyle issues (total was 42, now 56). |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 15  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 40s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | common tests |  22m 45s | Tests failed in 
hadoop-common. |
| | |  59m 40s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.util.TestGenericOptionsParser |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12733377/YARN-126.002.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cab0dad |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7962/artifact/patchprocess/diffcheckstylehadoop-common.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7962/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7962/artifact/patchprocess/testrun_hadoop-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7962/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7962/console |


This message was automatically generated.

> yarn rmadmin help message contains reference to hadoop cli and JT
> -
>
> Key: YARN-126
> URL: https://issues.apache.org/jira/browse/YARN-126
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.0.3-alpha
>Reporter: Thomas Graves
>Assignee: Rémy SAISSY
>  Labels: usability
> Attachments: YARN-126.002.patch, YARN-126.patch
>
>
> has option to specify a job tracker and the last line for general command 
> line syntax had "bin/hadoop command [genericOptions] [commandOptions]"
> ran "yarn rmadmin" to get usage:
> RMAdmin
> Usage: java RMAdmin
>[-refreshQueues]
>[-refreshNodes]
>[-refreshUserToGroupsMappings]
>[-refreshSuperUserGroupsConfiguration]
>[-refreshAdminAcls]
>[-refreshServiceAcl]
>[-help [cmd]]
> Generic options supported are
> -conf  specify an application configuration file
> -D use value for given property
> -fs   specify a namenode
> -jt specify a job tracker
> -files specify comma separated files to be 
> copied to the map reduce cluster
> -libjars specify comma separated jar files 
> to include in the classpath.
> -archives specify comma separated 
> archives to be unarchived on the compute machines.
> The general command line syntax is
> bin/hadoop command [genericOptions] [commandOptions]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT

2015-05-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rémy SAISSY updated YARN-126:
-
Attachment: YARN-126.002.patch

> yarn rmadmin help message contains reference to hadoop cli and JT
> -
>
> Key: YARN-126
> URL: https://issues.apache.org/jira/browse/YARN-126
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.0.3-alpha
>Reporter: Thomas Graves
>Assignee: Rémy SAISSY
>  Labels: usability
> Attachments: YARN-126.002.patch, YARN-126.patch
>
>
> has option to specify a job tracker and the last line for general command 
> line syntax had "bin/hadoop command [genericOptions] [commandOptions]"
> ran "yarn rmadmin" to get usage:
> RMAdmin
> Usage: java RMAdmin
>[-refreshQueues]
>[-refreshNodes]
>[-refreshUserToGroupsMappings]
>[-refreshSuperUserGroupsConfiguration]
>[-refreshAdminAcls]
>[-refreshServiceAcl]
>[-help [cmd]]
> Generic options supported are
> -conf  specify an application configuration file
> -D use value for given property
> -fs   specify a namenode
> -jt specify a job tracker
> -files specify comma separated files to be 
> copied to the map reduce cluster
> -libjars specify comma separated jar files 
> to include in the classpath.
> -archives specify comma separated 
> archives to be unarchived on the compute machines.
> The general command line syntax is
> bin/hadoop command [genericOptions] [commandOptions]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT

2015-05-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rémy SAISSY updated YARN-126:
-
Attachment: (was: YARN-126.002.patch)

> yarn rmadmin help message contains reference to hadoop cli and JT
> -
>
> Key: YARN-126
> URL: https://issues.apache.org/jira/browse/YARN-126
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.0.3-alpha
>Reporter: Thomas Graves
>Assignee: Rémy SAISSY
>  Labels: usability
> Attachments: YARN-126.patch
>
>
> has option to specify a job tracker and the last line for general command 
> line syntax had "bin/hadoop command [genericOptions] [commandOptions]"
> ran "yarn rmadmin" to get usage:
> RMAdmin
> Usage: java RMAdmin
>[-refreshQueues]
>[-refreshNodes]
>[-refreshUserToGroupsMappings]
>[-refreshSuperUserGroupsConfiguration]
>[-refreshAdminAcls]
>[-refreshServiceAcl]
>[-help [cmd]]
> Generic options supported are
> -conf  specify an application configuration file
> -D use value for given property
> -fs   specify a namenode
> -jt specify a job tracker
> -files specify comma separated files to be 
> copied to the map reduce cluster
> -libjars specify comma separated jar files 
> to include in the classpath.
> -archives specify comma separated 
> archives to be unarchived on the compute machines.
> The general command line syntax is
> bin/hadoop command [genericOptions] [commandOptions]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3651) Tracking url in ApplicationCLI wrong for running application

2015-05-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt resolved YARN-3651.

Resolution: Won't Fix

[~devraj.jaiman] . Thank you for looking into the same.
Closing the issue as won't fix since its done intentionally.

> Tracking url in ApplicationCLI wrong for running application
> 
>
> Key: YARN-3651
> URL: https://issues.apache.org/jira/browse/YARN-3651
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications, resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Priority: Minor
>
> Application URL in Application CLI wrong
> Steps to reproduce
> ==
> 1. Start HA setup insecure mode
> 2.Configure HTTPS_ONLY
> 3.Submit application to cluster
> 4.Execute command ./yarn application -list
> 5.Observer tracking URL shown
> {code}
> 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History 
> server at /:45034
> Total number of applications (application-types: [] and states: [SUBMITTED, 
> ACCEPTED, RUNNING]):1
> Application-Id --- Tracking-URL
> application_1431672734347_0003   *http://host-10-19-92-117:13013*
> {code}
> *Expected*
> https://:64323/proxy/application_1431672734347_0003 /



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-05-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547113#comment-14547113
 ] 

Hadoop QA commented on YARN-3044:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 26s | Pre-patch YARN-2928 compilation 
is healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 43s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 52s | The applied patch generated  1 
new checkstyle issues (total was 241, now 242). |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 40s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 42s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 13s | The patch appears to introduce 7 
new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   0m 27s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |  52m 56s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   0m 55s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  97m  9s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-resourcemanager |
|  |  Unchecked/unconfirmed cast from 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent 
in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At 
AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptFinishedEvent
 in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At AbstractTimelineServicePublisher.java:[line 79] |
|  |  Unchecked/unconfirmed cast from 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent 
in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At 
AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.AppAttemptRegisteredEvent
 in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At AbstractTimelineServicePublisher.java:[line 76] |
|  |  Unchecked/unconfirmed cast from 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
 in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At 
AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationACLsUpdatedEvent
 in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At AbstractTimelineServicePublisher.java:[line 73] |
|  |  Unchecked/unconfirmed cast from 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent 
in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At 
AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationCreatedEvent
 in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At AbstractTimelineServicePublisher.java:[line 67] |
|  |  Unchecked/unconfirmed cast from 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsEvent to 
org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent 
in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent)
  At 
AbstractTimelineServicePublisher.java:org.apache.hadoop.yarn.server.resourcemanager.metrics.ApplicationFinishedEvent
 in 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractTimelineServicePublisher.handle(SystemMetricsEvent

[jira] [Updated] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-05-17 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3044:

Attachment: YARN-3044-YARN-2928.008.patch

Hi [~zjshen]
Uploading a patch with following corrections, Please review :
1. RMContainerEntity has been removed and instead ContainerEntity with new 
event is published
2. Removed the duplicated code by having a abstract class for 
TimelineServicePublisher
3. Removed the code for Application Config (as per  Zhijie's suggestion)
4. yarn.system-metrics-publisher.rm.publish.container-metrics -> 
yarn.rm.system-metrics-publisher.emit-container-events
5. corrected ??Methods/innner classes in SystemMetricsPublisher don't need to 
be changed to "public"??


> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
>  Labels: BB2015-05-TBR
> Attachments: YARN-3044-YARN-2928.004.patch, 
> YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, 
> YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, 
> YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, 
> YARN-3044.20150416-1.patch
>
>
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

48 matches

Mail list logo