[jira] [Assigned] (YARN-8118) Better utilize gracefully decommissioning node managers

2018-11-21 Thread Ravi Prakash (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-8118:
--

Assignee: Karthik Palaniappan

> Better utilize gracefully decommissioning node managers
> ---
>
> Key: YARN-8118
> URL: https://issues.apache.org/jira/browse/YARN-8118
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.8.2
> Environment: * Google Compute Engine (Dataproc)
>  * Java 8
>  * Hadoop 2.8.2 using client-mode graceful decommissioning
>Reporter: Karthik Palaniappan
>Assignee: Karthik Palaniappan
>Priority: Major
> Attachments: YARN-8118-branch-2.001.patch
>
>
> Proposal design doc with background + details (please comment directly on 
> doc): 
> [https://docs.google.com/document/d/1hF2Bod_m7rPgSXlunbWGn1cYi3-L61KvQhPlY9Jk9Hk/edit#heading=h.ab4ufqsj47b7]
> tl;dr Right now, DECOMMISSIONING nodes must wait for in-progress applications 
> to complete before shutting down, but they cannot run new containers from 
> those in-progress applications. This is wasteful, particularly in 
> environments where you are billed by resource usage (e.g. EC2).
> Proposal: YARN should schedule containers from in-progress applications on 
> DECOMMISSIONING nodes, but should still avoid scheduling containers from new 
> applications. That will make in-progress applications complete faster and let 
> nodes decommission faster. Overall, this should be cheaper.
> I have a working patch without unit tests that's surprisingly just a few real 
> lines of code (patch 001). If folks are happy with the proposal, I'll write 
> unit tests and also write a patch targeted at trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2018-05-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-5762:
---
Target Version/s:   (was: 2.10.0)

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6378) Negative usedResources memory in CapacityScheduler

2018-05-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6378:
---
Target Version/s:   (was: 2.8.3)

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Priority: Major
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2018-05-18 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480254#comment-16480254
 ] 

Ravi Prakash commented on YARN-5762:


Sorry for the extremely late reply Arun. I'm afraid I don't think I'll find the 
cycles to work on this in the next few months. 

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6378) Negative usedResources memory in CapacityScheduler

2018-05-18 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480253#comment-16480253
 ] 

Ravi Prakash commented on YARN-6378:


I'm afraid I don't think I'll find the cycles to work on this in the next few 
months. 

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Priority: Major
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6378) Negative usedResources memory in CapacityScheduler

2018-05-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-6378:
--

Assignee: (was: Ravi Prakash)

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Priority: Major
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2018-05-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-5762:
--

Assignee: (was: Ravi Prakash)

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-13 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240899#comment-16240899
 ] 

Ravi Prakash edited comment on YARN-7450 at 11/13/17 4:55 PM:
--

{code}
2017-10-29 02:30:30,260 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
Error when publishing entity [YARN_APPLICATION,application_1507181091525_3046]
com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: Login 
failure for @ from keytab 
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:235)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:184)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:246)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:483)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:332)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:329)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:329)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:314)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:452)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationCreatedEvent(SystemMetricsPublisher.java:265)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:220)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:469)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:464)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Login failure for @ 
from keytab 
at 
org.apache.hadoop.security.UserGroupInformation.reloginFromKeytab(UserGroupInformation.java:1109)
at 
org.apache.hadoop.security.UserGroupInformation.checkTGTAndReloginFromKeytab(UserGroupInformation.java:1042)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:500)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 23 more
Caused by: javax.security.auth.login.LoginException: Generic error (description 
in e-text) (60) - LOOKING_UP_CLIENT
at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
at 
com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at 

[jira] [Updated] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-7450:
---
Description: 
We saw a stack trace (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent issue that failed the kerberos relogin 
from keytab. However, I'm assuming this was *not* retried because I only saw 
one instance of this stack trace.  I propose that this operation should have 
been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.

  was:
We saw a stack trace (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.


> ATS Client should retry on intermittent Kerberos issues.
> 
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
>Reporter: Ravi Prakash
>
> We saw a stack trace (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent issue that failed the kerberos relogin 
> from keytab. However, I'm assuming this was *not* retried because I only saw 
> one instance of this stack trace.  I propose that this operation should have 
> been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-13 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249816#comment-16249816
 ] 

Ravi Prakash commented on YARN-7450:


This is more complicated than I thought. 
[UserGroupInformation.reloginFromKeyTab()|https://github.com/apache/hadoop/blob/975a57a6886e81e412bea35bf597beccc807a66f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1321]
 *clears* away existing credentials, so a temporary Kerberos failure which may 
have been resolved by trying to retry the login in some time, will be exhibited 
immediately.

Hi [~daryn] ! Am I reading this right? Is it practical to fix this behavior?



> ATS Client should retry on intermittent Kerberos issues.
> 
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
>Reporter: Ravi Prakash
>
> We saw a stack trace (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent network issue that failed the kerberos 
> relogin from keytab. However, I'm assuming this was *not* retried because I 
> only saw one instance of this stack trace.  I propose that this operation 
> should have been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-7450:
---
Description: 
We saw a stack trace (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.

  was:
We saw a stack track (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.


> ATS Client should retry on intermittent Kerberos issues.
> 
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
>Reporter: Ravi Prakash
>
> We saw a stack trace (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent network issue that failed the kerberos 
> relogin from keytab. However, I'm assuming this was *not* retried because I 
> only saw one instance of this stack trace.  I propose that this operation 
> should have been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-06 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240899#comment-16240899
 ] 

Ravi Prakash edited comment on YARN-7450 at 11/6/17 9:20 PM:
-

{code}
2017-10-29 02:30:30,260 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
Error when publishing entity [YARN_APPLICATION,application_1507181091525_3046]
com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: Login 
failure for @ from keytab 
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:235)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:184)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:246)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:483)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:332)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:329)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:329)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:314)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:452)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationCreatedEvent(SystemMetricsPublisher.java:265)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:220)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:469)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:464)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Login failure for @ 
from keytab 
at 
org.apache.hadoop.security.UserGroupInformation.reloginFromKeytab(UserGroupInformation.java:1109)
at 
org.apache.hadoop.security.UserGroupInformation.checkTGTAndReloginFromKeytab(UserGroupInformation.java:1042)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:500)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 23 more
Caused by: javax.security.auth.login.LoginException: Generic error (description 
in e-text) (60) - LOOKING_UP_CLIENT
at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
at 
com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at 

[jira] [Updated] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-06 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-7450:
---
Description: 
We saw a stack track (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.

It seems, this caused events at the ResourceManager to queue up and eventually 
stop responding to even basic {{yarn application -list}} commands.

  was:
We saw a stack track (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.


> ATS Client should retry on intermittent Kerberos issues.
> 
>
> Key: YARN-7450
> URL: https://issues.apache.org/jira/browse/YARN-7450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Affects Versions: 2.7.3
> Environment: Hadoop-2.7.3
>Reporter: Ravi Prakash
>
> We saw a stack track (posted in the first comment) in the ResourceManager 
> logs for the TimelineClientImpl not being able to relogin from keytab.
> I'm guessing there was an intermittent network issue that failed the kerberos 
> relogin from keytab. However, I'm assuming this was *not* retried because I 
> only saw one instance of this stack trace.  I propose that this operation 
> should have been retried.
> It seems, this caused events at the ResourceManager to queue up and 
> eventually stop responding to even basic {{yarn application -list}} commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-06 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240899#comment-16240899
 ] 

Ravi Prakash commented on YARN-7450:


{code}
2017-10-29 02:30:30,260 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
Error when publishing entity [YARN_APPLICATION,application_1507181091525_3046]
com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: Login 
failure for @ from keytab 
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:235)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:184)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:246)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:483)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:332)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:329)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:329)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:314)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:452)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationCreatedEvent(SystemMetricsPublisher.java:265)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:220)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:469)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:464)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Login failure for @ 
from keytab 
at 
org.apache.hadoop.security.UserGroupInformation.reloginFromKeytab(UserGroupInformation.java:1109)
at 
org.apache.hadoop.security.UserGroupInformation.checkTGTAndReloginFromKeytab(UserGroupInformation.java:1042)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:500)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 23 more
Caused by: javax.security.auth.login.LoginException: Generic error (description 
in e-text) (60) - LOOKING_UP_CLIENT
at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
at 
com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
   

[jira] [Created] (YARN-7450) ATS Client should retry on intermittent Kerberos issues.

2017-11-06 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-7450:
--

 Summary: ATS Client should retry on intermittent Kerberos issues.
 Key: YARN-7450
 URL: https://issues.apache.org/jira/browse/YARN-7450
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: ATSv2
Affects Versions: 2.7.3
 Environment: Hadoop-2.7.3
Reporter: Ravi Prakash


We saw a stack track (posted in the first comment) in the ResourceManager logs 
for the TimelineClientImpl not being able to relogin from keytab.

I'm guessing there was an intermittent network issue that failed the kerberos 
relogin from keytab. However, I'm assuming this was *not* retried because I 
only saw one instance of this stack trace.  I propose that this operation 
should have been retried.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7283) Nodemanager can't start

2017-10-03 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-7283.

Resolution: Invalid

Hi Nguyen! JIRA is for reporting bugs. Please use the user mailing list for 
these questions.

Please set {{yarn.nodemanager.resource.memory-mb}}, 
{{yarn.nodemanager.resource.cpu-vcores}} in your yarn-site.xml

> Nodemanager can't start
> ---
>
> Key: YARN-7283
> URL: https://issues.apache.org/jira/browse/YARN-7283
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.4
>Reporter: Nguyen Xuan Tinh
>
> i installed hadoop with psedou mode  follow 
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
>  . Then when i start-all.sh i got 
> 26177 SecondaryNameNode
> 26355 ResourceManager
> 12211 Jps
> 25814 NameNode
> 25976 DataNode
> so i saw log of nodemanager
> {code:java}
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved 
> SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, 
> Message from ResourceManager: NodeManager from  ubuntu doesn't satisfy 
> minimum allocations, Sending SHUTDOWN signal to the NodeManager.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:278)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:197)
>   ... 6 more
> 2017-10-03 08:47:49,883 INFO org.apache.hadoop.service.AbstractService: 
> Service NodeManager failed in state STARTED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN 
> signal from Resourcemanager ,Registration of NodeManager failed, Message from 
> ResourceManager: NodeManager from  ubuntu doesn't satisfy minimum 
> allocations, Sending SHUTDOWN signal to the NodeManager.
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN 
> signal from Resourcemanager ,Registration of NodeManager failed, Message from 
> ResourceManager: NodeManager from  ubuntu doesn't satisfy minimum 
> allocations, Sending SHUTDOWN signal to the NodeManager.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:203)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:496)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved 
> SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, 
> Message from ResourceManager: NodeManager from  ubuntu doesn't satisfy 
> minimum allocations, Sending SHUTDOWN signal to the NodeManager.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:278)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:197)
>   ... 6 more
> {code}
> How i can resolve this ? Please help me, Thanks for reading



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2017-10-03 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-5762:
---
Target Version/s: 2.10.0  (was: 2.9.0)

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-08-08 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119164#comment-16119164
 ] 

Ravi Prakash commented on YARN-6054:


Sure Junping! I took the liberty of cherry-picking the change into branch-2.8 
and branch-2.8.2 after running the unit test. Internally at my company we have 
backported this already and were running without problems because of this issue 
with Hadoop-2.7.3. Thanks for the suggestion of merging into 2.8.2

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha2, 2.8.2
>
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch, 
> YARN-6054.03.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-08-08 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6054:
---
Fix Version/s: 2.8.2

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha2, 2.8.2
>
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch, 
> YARN-6054.03.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-08-08 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6054:
---
Target Version/s: 2.9.0, 2.8.2  (was: 2.9.0)

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-alpha2, 2.8.2
>
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch, 
> YARN-6054.03.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6889) The throughput of timeline server is too small

2017-07-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103517#comment-16103517
 ] 

Ravi Prakash commented on YARN-6889:


Also, it would be good if you can publish your methodology and results for the 
tests you did.

> The throughput of timeline server is too small
> --
>
> Key: YARN-6889
> URL: https://issues.apache.org/jira/browse/YARN-6889
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: YunFan Zhou
>Priority: Critical
>
> Recent large-scale pressure test single timeline server, and I found the 
> throughput of timeline server is too small.
> I setup multiple servers, and each of them setup multiple processes. Each 
> process was setup up multiple threads, I send different data size with each 
> thread.
> Although I use different pressures and different scenarios. But the timeline 
> server processing power is basically similar, and 
> the ability to process messages is about 70 per second.
> It can't meet our requirements, we should improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6889) The throughput of timeline server is too small

2017-07-27 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-6889.

Resolution: Duplicate

> The throughput of timeline server is too small
> --
>
> Key: YARN-6889
> URL: https://issues.apache.org/jira/browse/YARN-6889
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: YunFan Zhou
>Priority: Critical
>
> Recent large-scale pressure test single timeline server, and I found the 
> throughput of timeline server is too small.
> I setup multiple servers, and each of them setup multiple processes. Each 
> process was setup up multiple threads, I send different data size with each 
> thread.
> Although I use different pressures and different scenarios. But the timeline 
> server processing power is basically similar, and 
> the ability to process messages is about 70 per second.
> It can't meet our requirements, we should improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6889) The throughput of timeline server is too small

2017-07-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103515#comment-16103515
 ] 

Ravi Prakash commented on YARN-6889:


Which version are you using? Are you aware of 
https://issues.apache.org/jira/browse/YARN-2928 and 
https://issues.apache.org/jira/browse/YARN-5355 ?

I'm closing this issue as Duplicate. Please re-open if you meant something 
other than the above two.

> The throughput of timeline server is too small
> --
>
> Key: YARN-6889
> URL: https://issues.apache.org/jira/browse/YARN-6889
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: YunFan Zhou
>Priority: Critical
>
> Recent large-scale pressure test single timeline server, and I found the 
> throughput of timeline server is too small.
> I setup multiple servers, and each of them setup multiple processes. Each 
> process was setup up multiple threads, I send different data size with each 
> thread.
> Although I use different pressures and different scenarios. But the timeline 
> server processing power is basically similar, and 
> the ability to process messages is about 70 per second.
> It can't meet our requirements, we should improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Closed] (YARN-6854) many job failed if NM couldn't detect disk error

2017-07-21 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash closed YARN-6854.
--

> many job failed if NM couldn't detect disk error
> 
>
> Key: YARN-6854
> URL: https://issues.apache.org/jira/browse/YARN-6854
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Priority: Critical
>
> checkDiskHealthy is enabled, but it couldn't find this error. leading 
> containers failed and new containers assigned to this node then failed again. 
> the disk error seems a filesystem error, all io operation (such as ls) failed 
> on $localdir/usercache/userFoo,  and no effect on other dir. 
> Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6854) many job failed if NM couldn't detect disk error

2017-07-21 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-6854.

Resolution: Not A Problem

Please send your queries to the mailing list. JIRA is for tracking confirmed 
issues.

Please fix your NodeManager health-check scripts. These are configured using: 
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html#External_Health_Script

> many job failed if NM couldn't detect disk error
> 
>
> Key: YARN-6854
> URL: https://issues.apache.org/jira/browse/YARN-6854
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Priority: Critical
>
> checkDiskHealthy is enabled, but it couldn't find this error. leading 
> containers failed and new containers assigned to this node then failed again. 
> the disk error seems a filesystem error, all io operation (such as ls) failed 
> on $localdir/usercache/userFoo,  and no effect on other dir. 
> Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-07-10 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6378:
---
Target Version/s: 2.8.2  (was: 2.8.1)

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-07-10 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081355#comment-16081355
 ] 

Ravi Prakash edited comment on YARN-6378 at 7/10/17 11:43 PM:
--

Hi Filipp! Thanks for your report. Is the occurrence of the first negative 
usedResources correlated with applications being moved between queues in your 
case too? You can check this easily from the ResoureManager logs


was (Author: raviprak):
Hi Filipp! Thanks for your report. Is the occurrence of the first negative 
usedResources correlated with applications being moved between queues in your 
case too?

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-07-10 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16081355#comment-16081355
 ] 

Ravi Prakash commented on YARN-6378:


Hi Filipp! Thanks for your report. Is the occurrence of the first negative 
usedResources correlated with applications being moved between queues in your 
case too?

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-07-10 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6378:
---
Affects Version/s: (was: 2.7.2)
   2.6.0

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-05-17 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015172#comment-16015172
 ] 

Ravi Prakash commented on YARN-6378:


Hi powerinf! It is possible. Do you have the CapacityScheduler? If you have the 
FairScheduler, YARN-3933 may be relevant

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986114#comment-15986114
 ] 

Ravi Prakash commented on YARN-6378:


The occurrence of these negative usedResources is very strongly correlated with 
applications being moved from one queue to another. e.g. on one cluster which 
was started on March 11, usedResources wasn't negative until somebody moved an 
application from one queue to the afflicted queue on April 7th. Since then, the 
queue shows negative usedResources.

This might actually be a race condition. It seems like 
[LeafQueue.detachContainer|https://github.com/apache/hadoop/blob/28eb2aabebd15c15a357d86e23ca407d3c85211c/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java#L1890]
 neglects to lock the LeafQueue object. In comparison, the same thing when a 
container is completed is done after acquiring a lock on the LeafQueue object 
in 
[LeafQueue.completedContainer|https://github.com/apache/hadoop/blob/28eb2aabebd15c15a357d86e23ca407d3c85211c/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java#L1538]

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-27 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6378:
---
Description: 
Courtesy Thomas Nystrand, we found that on two of our clusters configured with 
the CapacityScheduler, usedResources occasionally becomes negative. 

e.g.
{code}
2017-03-15 11:10:09,449 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application attempt=appattempt_1487222361993_17177_01 
container=Container: [ContainerId: container_1487222361993_17177_01_14, 
NodeId: :27249, NodeHttpAddress: :8042, Resource: 
, Priority: 2, Token: null, ] queue=: 
capacity=0.2, absoluteCapacity=0.2, usedResources=, 
usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
numContainers=3 clusterResource= type=RACK_LOCAL
{code}

  was:
Courtesy Thomas Nystrand, we found that on one of our clusters configured with 
the CapacityScheduler, usedResources occasionally becomes negative. 

e.g.
{code}
2017-03-15 11:10:09,449 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application attempt=appattempt_1487222361993_17177_01 
container=Container: [ContainerId: container_1487222361993_17177_01_14, 
NodeId: :27249, NodeHttpAddress: :8042, Resource: 
, Priority: 2, Token: null, ] queue=: 
capacity=0.2, absoluteCapacity=0.2, usedResources=, 
usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
numContainers=3 clusterResource= type=RACK_LOCAL
{code}


> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on two of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6378:
---
Comment: was deleted

(was: From what I can tell, there's an app : 
{{application_1487222361993_12379}} which was moved first from interactive to 
the production queue, and then from production queue to the etl queue . This 
was a massive application so I'm not sure if the discrepancy in accounting is 
an artifact of the application being moved twice of it being a massive app and 
some race condition being triggered. Or if this application's shenanigans were 
at all involved ;-))

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on one of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-13 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968398#comment-15968398
 ] 

Ravi Prakash commented on YARN-6378:


>From what I can tell, there's an app : {{application_1487222361993_12379}} 
>which was moved first from interactive to the production queue, and then from 
>production queue to the etl queue . This was a massive application so I'm not 
>sure if the discrepancy in accounting is an artifact of the application being 
>moved twice of it being a massive app and some race condition being triggered. 
>Or if this application's shenanigans were at all involved ;-)

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on one of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6378:
---
Comment: was deleted

(was: I downloaded the RM logs (thanks again DP team) on dogfood. The RM for 
firstdata was restarted on 02-16. The first time since then that there are 
negative resources was on 03-01.
{code}
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting completed queue: root.etl stats: etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, 
usedCapacity=0.011363636, absoluteUsedCapacity=0.0022727272, numApps=1, 
numContainers=1
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1487222361993_12379_01 released container 
container_1487222361993_12379_01_61 on node: host: 
203-35.as1.altiscale.com:26469 #containers=9 available= used= with event: RELEASED
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Null container completed...
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1487222361993_12379_01_68 Container Transitioned from RUNNING to 
RELEASED
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 Completed container: container_1487222361993_12379_01_68 in state: 
RELEASED event:RELEASED
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released 
container container_1487222361993_12379_01_68 of capacity  on host 203-03.as1.altiscale.com:27249, which currently has 7 
containers,  used and  
available, release resources=true
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: etl 
used= numContainers=0 user=vijayasarathyparanthaman 
user-resources=
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
completedContainer container=Container: [ContainerId: 
container_1487222361993_12379_01_68, NodeId: 
203-03.as1.altiscale.com:27249, NodeHttpAddress: 203-03.as1.altiscale.com:8042, 
Resource: , Priority: 2, Token: Token { kind: 
ContainerToken, service: 10.247.57.232:27249 }, ] queue=etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster={code}

At 12:53, usedResources are 0,0 on etl
{code}
2017-03-01 12:53:17,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
completedContainer container=Container: [ContainerId: 
container_1487222361993_12294_01_01, NodeId: 
202-33.as1.altiscale.com:33675, NodeHttpAddress: 202-33.as1.altiscale.com:8042, 
Resource: , Priority: 0, Token: Token { kind: 
ContainerToken, service: 10.247.57.237:33675 }, ] queue=etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=
2017-03-01 12:53:17,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting completed queue: root.etl stats: etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0
{code}
Something happens between 12:53 and 13:35. Going to investigate.)

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on one of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> 

[jira] [Commented] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-13 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968360#comment-15968360
 ] 

Ravi Prakash commented on YARN-6378:


I downloaded the RM logs (thanks again DP team) on dogfood. The RM for 
firstdata was restarted on 02-16. The first time since then that there are 
negative resources was on 03-01.
{code}
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting completed queue: root.etl stats: etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, 
usedCapacity=0.011363636, absoluteUsedCapacity=0.0022727272, numApps=1, 
numContainers=1
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1487222361993_12379_01 released container 
container_1487222361993_12379_01_61 on node: host: 
203-35.as1.altiscale.com:26469 #containers=9 available= used= with event: RELEASED
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Null container completed...
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1487222361993_12379_01_68 Container Transitioned from RUNNING to 
RELEASED
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 Completed container: container_1487222361993_12379_01_68 in state: 
RELEASED event:RELEASED
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released 
container container_1487222361993_12379_01_68 of capacity  on host 203-03.as1.altiscale.com:27249, which currently has 7 
containers,  used and  
available, release resources=true
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: etl 
used= numContainers=0 user=vijayasarathyparanthaman 
user-resources=
2017-03-01 13:35:20,813 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
completedContainer container=Container: [ContainerId: 
container_1487222361993_12379_01_68, NodeId: 
203-03.as1.altiscale.com:27249, NodeHttpAddress: 203-03.as1.altiscale.com:8042, 
Resource: , Priority: 2, Token: Token { kind: 
ContainerToken, service: 10.247.57.232:27249 }, ] queue=etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster={code}

At 12:53, usedResources are 0,0 on etl
{code}
2017-03-01 12:53:17,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
completedContainer container=Container: [ContainerId: 
container_1487222361993_12294_01_01, NodeId: 
202-33.as1.altiscale.com:33675, NodeHttpAddress: 202-33.as1.altiscale.com:8042, 
Resource: , Priority: 0, Token: Token { kind: 
ContainerToken, service: 10.247.57.237:33675 }, ] queue=etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=
2017-03-01 12:53:17,934 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting completed queue: root.etl stats: etl: capacity=0.2, 
absoluteCapacity=0.2, usedResources=, usedCapacity=0.0, 
absoluteUsedCapacity=0.0, numApps=1, numContainers=0
{code}
Something happens between 12:53 and 13:35. Going to investigate.

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on one of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> 

[jira] [Assigned] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-04-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-6378:
--

Assignee: Ravi Prakash

> Negative usedResources memory in CapacityScheduler
> --
>
> Key: YARN-6378
> URL: https://issues.apache.org/jira/browse/YARN-6378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>
> Courtesy Thomas Nystrand, we found that on one of our clusters configured 
> with the CapacityScheduler, usedResources occasionally becomes negative. 
> e.g.
> {code}
> 2017-03-15 11:10:09,449 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1487222361993_17177_01 
> container=Container: [ContainerId: container_1487222361993_17177_01_14, 
> NodeId: :27249, NodeHttpAddress: :8042, Resource: 
> , Priority: 2, Token: null, ] queue=: 
> capacity=0.2, absoluteCapacity=0.2, usedResources=, 
> usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
> numContainers=3 clusterResource= type=RACK_LOCAL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6378) Negative usedResources memory in CapacityScheduler

2017-03-22 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-6378:
--

 Summary: Negative usedResources memory in CapacityScheduler
 Key: YARN-6378
 URL: https://issues.apache.org/jira/browse/YARN-6378
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, resourcemanager
Affects Versions: 2.7.2
Reporter: Ravi Prakash


Courtesy Thomas Nystrand, we found that on one of our clusters configured with 
the CapacityScheduler, usedResources occasionally becomes negative. 

e.g.
{code}
2017-03-15 11:10:09,449 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application attempt=appattempt_1487222361993_17177_01 
container=Container: [ContainerId: container_1487222361993_17177_01_14, 
NodeId: :27249, NodeHttpAddress: :8042, Resource: 
, Priority: 2, Token: null, ] queue=: 
capacity=0.2, absoluteCapacity=0.2, usedResources=, 
usedCapacity=0.03409091, absoluteUsedCapacity=0.006818182, numApps=1, 
numContainers=3 clusterResource= type=RACK_LOCAL
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities

2017-03-08 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902243#comment-15902243
 ] 

Ravi Prakash commented on YARN-3448:


Thanks for the awesome design and fix Jon! I've opened YARN-6311 for adding 
documentation for this store.

> Add Rolling Time To Lives Level DB Plugin Capabilities
> --
>
> Key: YARN-3448
> URL: https://issues.apache.org/jira/browse/YARN-3448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: YARN-3448.10.patch, YARN-3448.12.patch, 
> YARN-3448.13.patch, YARN-3448.14.patch, YARN-3448.15.patch, 
> YARN-3448.16.patch, YARN-3448.17.patch, YARN-3448.1.patch, YARN-3448.2.patch, 
> YARN-3448.3.patch, YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, 
> YARN-3448.8.patch, YARN-3448.9.patch
>
>
> For large applications, the majority of the time in LeveldbTimelineStore is 
> spent deleting old entities record at a time. An exclusive write lock is held 
> during the entire deletion phase which in practice can be hours. If we are to 
> relax some of the consistency constraints, other performance enhancing 
> techniques can be employed to maximize the throughput and minimize locking 
> time.
> Split the 5 sections of the leveldb database (domain, owner, start time, 
> entity, index) into 5 separate databases. This allows each database to 
> maximize the read cache effectiveness based on the unique usage patterns of 
> each database. With 5 separate databases each lookup is much faster. This can 
> also help with I/O to have the entity and index databases on separate disks.
> Rolling DBs for entity and index DBs. 99.9% of the data are in these two 
> sections 4:1 ration (index to entity) at least for tez. We replace DB record 
> removal with file system removal if we create a rolling set of databases that 
> age out and can be efficiently removed. To do this we must place a constraint 
> to always place an entity's events into it's correct rolling db instance 
> based on start time. This allows us to stitching the data back together while 
> reading and artificial paging.
> Relax the synchronous writes constraints. If we are willing to accept losing 
> some records that we not flushed in the operating system during a crash, we 
> can use async writes that can be much faster.
> Prefer Sequential writes. sequential writes can be several times faster than 
> random writes. Spend some small effort arranging the writes in such a way 
> that will trend towards sequential write performance over random write 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6311) We should write documentation for RollingLevelDBTimelineStore

2017-03-08 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-6311:
--

 Summary: We should write documentation for 
RollingLevelDBTimelineStore
 Key: YARN-6311
 URL: https://issues.apache.org/jira/browse/YARN-6311
 Project: Hadoop YARN
  Issue Type: Wish
Affects Versions: 3.0.0-alpha2
Reporter: Ravi Prakash
Priority: Minor


YARN-3448 added the RollingLevelDBTimelineStore to deal with problems in 
LevelDBTimelineStore . We should add documentation for it in TimelineServer.md .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5764) NUMA awareness support for launching containers

2017-01-11 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817623#comment-15817623
 ] 

Ravi Prakash edited comment on YARN-5764 at 1/11/17 8:21 AM:
-

Hi Devaraj! Thanks for all your work. Do you have any benchmarks results that 
would illustrate the kind of performance gains that could potentially be 
realised with this patch? It'd be good if others had an opportunity to test it 
in their hardware and setup.


was (Author: raviprak):
Hi Devaraj! Thanks for all your work. Do you have any benchmarks results that 
would illustrate the kind of performance gains that could potentially be 
realised with this patch?

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
> Attachments: NUMA Awareness for YARN Containers.pdf, 
> YARN-5764-v0.patch, YARN-5764-v1.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2017-01-11 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817623#comment-15817623
 ] 

Ravi Prakash commented on YARN-5764:


Hi Devaraj! Thanks for all your work. Do you have any benchmarks results that 
would illustrate the kind of performance gains that could potentially be 
realised with this patch?

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
> Attachments: NUMA Awareness for YARN Containers.pdf, 
> YARN-5764-v0.patch, YARN-5764-v1.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-10 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815822#comment-15815822
 ] 

Ravi Prakash commented on YARN-6054:


Thanks Naganarasimha!

Thanks for looking Li Lu! Please feel free to comment if you find anything and 
we'll get it in.

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch, 
> YARN-6054.03.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-09 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6054:
---
Attachment: YARN-6054.03.patch

Here's a patch with the improvements suggested by Naganarasimha.

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch, 
> YARN-6054.03.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-09 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812432#comment-15812432
 ] 

Ravi Prakash commented on YARN-6054:


Thanks Naganarasimha for your careful review! As I posted in the first comment, 
the repair did indeed fix the issue for us (we had a production incident.) As 
I'm sure you'll understand, we can't post the leveldb files in the open source.
# I feel this JIRA is very specific to the TimelineServer so I am hesitant to 
include other daemons. Also, as pointed out by Jason, (e.g. in the case of NM) 
graceful degradation would be a very hard thing to achieve. More likely, the 
state is corrupt and will cause undefined behavior.
# Fair point. Will do.
# Great idea. Will do.

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-06 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804747#comment-15804747
 ] 

Ravi Prakash commented on YARN-6054:


The test failure is not related and seems to already have been reported in 
YARN-5934 

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-06 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6054:
---
Attachment: YARN-6054.02.patch

Thanks Li Lu! I agree with you. A repair operation definitely changes the 
LevelDb files. In this patch I am creating a backup of the corrupted database. 
I am consciously neglecting to do cleanup of old backups because I don't expect 
this to occur too often. If we want automatic cleanup of old backups I propose 
we punt that to another JIRA.

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-05 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15802997#comment-15802997
 ] 

Ravi Prakash commented on YARN-6054:


Hi Li Lu! Thanks for your review!
As you can see, I am trying to repair only once (in the catch block) when the 
service is inited. If the repair (or the subsequent open) fails and throws an 
IOException then we will again crash out and fail to start the TimelineServer. 
According to Jason's comment, and I agree, at that point we can't really do 
anything (maybe operations personnel would need to delete the entire database). 

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-05 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-6054:
--

Assignee: Ravi Prakash

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-05 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-6054:
---
Attachment: YARN-6054.01.patch

Here's a patch along with a unit test.

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
> Attachments: YARN-6054.01.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-04 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799324#comment-15799324
 ] 

Ravi Prakash commented on YARN-6054:


Thanks to Jason's pointer for repairing the LevelDb 
[here|https://issues.apache.org/jira/browse/YARN-2873?focusedCommentId=14216259=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14216259],
 when we tried to "repair" the levelDb, the TS came up just fine.

> TimelineServer fails to start when some LevelDb state files are missing.
> 
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2
>Reporter: Ravi Prakash
>
> We encountered an issue recently where the TimelineServer failed to start 
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: /timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 
> missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 9 missing files; e.g.: 
> /timelineserver/leveldb-timeline-store.ldb/127897.sst
> at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the 
> TimelineServer should have graceful degradation instead of failing to start 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6054) TimelineServer fails to start when some LevelDb state files are missing.

2017-01-04 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-6054:
--

 Summary: TimelineServer fails to start when some LevelDb state 
files are missing.
 Key: YARN-6054
 URL: https://issues.apache.org/jira/browse/YARN-6054
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha2
Reporter: Ravi Prakash


We encountered an issue recently where the TimelineServer failed to start 
because some state files went missing.

{code}
2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
 failed in state INITED
; cause: org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing 
files; e.g.: /timelines
erver/leveldb-timeline-store.ldb/127897.sst
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing 
files; e.g.: /timelineserver/lev
eldb-timeline-store.ldb/127897.sst

2016-11-21 20:46:43,135 FATAL 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
 Error starting ApplicationHistoryServer
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing 
files; e.g.: 
/timelineserver/leveldb-timeline-store.ldb/127897.sst
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 
9 missing files; e.g.: 
/timelineserver/leveldb-timeline-store.ldb/127897.sst
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status -1
{code}
Ideally we shouldn't have any missing state files. However I'd posit that the 
TimelineServer should have graceful degradation instead of failing to start at 
all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5979) Make ApplicationReport and ApplicationResourceUsageReport @Evolving

2016-12-08 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733054#comment-15733054
 ] 

Ravi Prakash commented on YARN-5979:


Would this be a backward-incompatible change? Please mark the JIRA as such if 
yes.

> Make ApplicationReport and ApplicationResourceUsageReport @Evolving
> ---
>
> Key: YARN-5979
> URL: https://issues.apache.org/jira/browse/YARN-5979
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Reporter: Akira Ajisaka
>Assignee: Bibin A Chundatt
> Attachments: YARN-5979.0001.patch
>
>
> Abstract class ApplicationReport and ApplicationResourceUsageReport are 
> {{@Public}} and {{@Stable}}, but some methods are added between minor 
> releases and this breaks source-compatibility. We should make them 
> {{@Evolving}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2016-11-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1964:
---
Description: 
*This alpha feature has been deprecated in branch-2 and removed from trunk* 
Please see https://issues.apache.org/jira/browse/YARN-5388

Docker (https://www.docker.io/) is, increasingly, a very popular container 
technology.

In context of YARN, the support for Docker will provide a very elegant solution 
to allow applications to *package* their software into a Docker container 
(entire Linux file system incl. custom versions of perl, python etc.) and use 
it as a blueprint to launch all their YARN containers with requisite software 
environment. This provides both consistency (all YARN containers will have the 
same software environment) and isolation (no interference with whatever is 
installed on the physical machine).

  was:
Docker (https://www.docker.io/) is, increasingly, a very popular container 
technology.

In context of YARN, the support for Docker will provide a very elegant solution 
to allow applications to *package* their software into a Docker container 
(entire Linux file system incl. custom versions of perl, python etc.) and use 
it as a blueprint to launch all their YARN containers with requisite software 
environment. This provides both consistency (all YARN containers will have the 
same software environment) and isolation (no interference with whatever is 
installed on the physical machine).


> Create Docker analog of the LinuxContainerExecutor in YARN
> --
>
> Key: YARN-1964
> URL: https://issues.apache.org/jira/browse/YARN-1964
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.2.0
>Reporter: Arun C Murthy
>Assignee: Abin Shahab
> Fix For: 2.6.0
>
> Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
> yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
> yarn-1964-docker.patch, yarn-1964-docker.patch
>
>
> *This alpha feature has been deprecated in branch-2 and removed from trunk* 
> Please see https://issues.apache.org/jira/browse/YARN-5388
> Docker (https://www.docker.io/) is, increasingly, a very popular container 
> technology.
> In context of YARN, the support for Docker will provide a very elegant 
> solution to allow applications to *package* their software into a Docker 
> container (entire Linux file system incl. custom versions of perl, python 
> etc.) and use it as a blueprint to launch all their YARN containers with 
> requisite software environment. This provides both consistency (all YARN 
> containers will have the same software environment) and isolation (no 
> interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2466) Umbrella issue for Yarn launched Docker Containers

2016-11-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-2466:
---
Description: 
Docker (https://www.docker.io/) is, increasingly, a very popular container 
technology.

In context of YARN, the support for Docker will provide a very elegant solution 
to allow applications to package their software into a Docker container (entire 
Linux file system incl. custom versions of perl, python etc.) and use it as a 
blueprint to launch all their YARN containers with requisite software 
environment. This provides both consistency (all YARN containers will have the 
same software environment) and isolation (no interference with whatever is 
installed on the physical machine).

In addition to software isolation mentioned above, Docker containers will 
provide resource, network, and user-namespace isolation. 

Docker provides resource isolation through cgroups, similar to 
LinuxContainerExecutor. This prevents one job from taking other jobs 
resource(memory and CPU) on the same hadoop cluster. 

User-namespace isolation will ensure that the root on the container is mapped 
an unprivileged user on the host. This is currently being added to Docker.

Network isolation will ensure that one user’s network traffic is completely 
isolated from another user’s network traffic. 

Last but not the least, the interaction of Docker and Kerberos will have to be 
worked out. These Docker containers must work in a secure hadoop environment.

Additional details are here: 
https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers

  was:
*This has been deprecated and removed.* Please see 
https://issues.apache.org/jira/browse/YARN-5388 .

Docker (https://www.docker.io/) is, increasingly, a very popular container 
technology.

In context of YARN, the support for Docker will provide a very elegant solution 
to allow applications to package their software into a Docker container (entire 
Linux file system incl. custom versions of perl, python etc.) and use it as a 
blueprint to launch all their YARN containers with requisite software 
environment. This provides both consistency (all YARN containers will have the 
same software environment) and isolation (no interference with whatever is 
installed on the physical machine).

In addition to software isolation mentioned above, Docker containers will 
provide resource, network, and user-namespace isolation. 

Docker provides resource isolation through cgroups, similar to 
LinuxContainerExecutor. This prevents one job from taking other jobs 
resource(memory and CPU) on the same hadoop cluster. 

User-namespace isolation will ensure that the root on the container is mapped 
an unprivileged user on the host. This is currently being added to Docker.

Network isolation will ensure that one user’s network traffic is completely 
isolated from another user’s network traffic. 

Last but not the least, the interaction of Docker and Kerberos will have to be 
worked out. These Docker containers must work in a secure hadoop environment.

Additional details are here: 
https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers


> Umbrella issue for Yarn launched Docker Containers
> --
>
> Key: YARN-2466
> URL: https://issues.apache.org/jira/browse/YARN-2466
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.4.1
>Reporter: Abin Shahab
>Assignee: Abin Shahab
>
> Docker (https://www.docker.io/) is, increasingly, a very popular container 
> technology.
> In context of YARN, the support for Docker will provide a very elegant 
> solution to allow applications to package their software into a Docker 
> container (entire Linux file system incl. custom versions of perl, python 
> etc.) and use it as a blueprint to launch all their YARN containers with 
> requisite software environment. This provides both consistency (all YARN 
> containers will have the same software environment) and isolation (no 
> interference with whatever is installed on the physical machine).
> In addition to software isolation mentioned above, Docker containers will 
> provide resource, network, and user-namespace isolation. 
> Docker provides resource isolation through cgroups, similar to 
> LinuxContainerExecutor. This prevents one job from taking other jobs 
> resource(memory and CPU) on the same hadoop cluster. 
> User-namespace isolation will ensure that the root on the container is mapped 
> an unprivileged user on the host. This is currently being added to Docker.
> Network isolation will ensure that one user’s network traffic is completely 
> isolated from another user’s network traffic. 
> Last but not the least, the interaction of Docker and Kerberos will have to 
> be worked out. These Docker containers must work in a secure 

[jira] [Updated] (YARN-2466) Umbrella issue for Yarn launched Docker Containers

2016-11-18 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-2466:
---
Description: 
*This has been deprecated and removed.* Please see 
https://issues.apache.org/jira/browse/YARN-5388 .

Docker (https://www.docker.io/) is, increasingly, a very popular container 
technology.

In context of YARN, the support for Docker will provide a very elegant solution 
to allow applications to package their software into a Docker container (entire 
Linux file system incl. custom versions of perl, python etc.) and use it as a 
blueprint to launch all their YARN containers with requisite software 
environment. This provides both consistency (all YARN containers will have the 
same software environment) and isolation (no interference with whatever is 
installed on the physical machine).

In addition to software isolation mentioned above, Docker containers will 
provide resource, network, and user-namespace isolation. 

Docker provides resource isolation through cgroups, similar to 
LinuxContainerExecutor. This prevents one job from taking other jobs 
resource(memory and CPU) on the same hadoop cluster. 

User-namespace isolation will ensure that the root on the container is mapped 
an unprivileged user on the host. This is currently being added to Docker.

Network isolation will ensure that one user’s network traffic is completely 
isolated from another user’s network traffic. 

Last but not the least, the interaction of Docker and Kerberos will have to be 
worked out. These Docker containers must work in a secure hadoop environment.

Additional details are here: 
https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers

  was:
Docker (https://www.docker.io/) is, increasingly, a very popular container 
technology.

In context of YARN, the support for Docker will provide a very elegant solution 
to allow applications to package their software into a Docker container (entire 
Linux file system incl. custom versions of perl, python etc.) and use it as a 
blueprint to launch all their YARN containers with requisite software 
environment. This provides both consistency (all YARN containers will have the 
same software environment) and isolation (no interference with whatever is 
installed on the physical machine).

In addition to software isolation mentioned above, Docker containers will 
provide resource, network, and user-namespace isolation. 

Docker provides resource isolation through cgroups, similar to 
LinuxContainerExecutor. This prevents one job from taking other jobs 
resource(memory and CPU) on the same hadoop cluster. 

User-namespace isolation will ensure that the root on the container is mapped 
an unprivileged user on the host. This is currently being added to Docker.

Network isolation will ensure that one user’s network traffic is completely 
isolated from another user’s network traffic. 

Last but not the least, the interaction of Docker and Kerberos will have to be 
worked out. These Docker containers must work in a secure hadoop environment.

Additional details are here: 
https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers


> Umbrella issue for Yarn launched Docker Containers
> --
>
> Key: YARN-2466
> URL: https://issues.apache.org/jira/browse/YARN-2466
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.4.1
>Reporter: Abin Shahab
>Assignee: Abin Shahab
>
> *This has been deprecated and removed.* Please see 
> https://issues.apache.org/jira/browse/YARN-5388 .
> Docker (https://www.docker.io/) is, increasingly, a very popular container 
> technology.
> In context of YARN, the support for Docker will provide a very elegant 
> solution to allow applications to package their software into a Docker 
> container (entire Linux file system incl. custom versions of perl, python 
> etc.) and use it as a blueprint to launch all their YARN containers with 
> requisite software environment. This provides both consistency (all YARN 
> containers will have the same software environment) and isolation (no 
> interference with whatever is installed on the physical machine).
> In addition to software isolation mentioned above, Docker containers will 
> provide resource, network, and user-namespace isolation. 
> Docker provides resource isolation through cgroups, similar to 
> LinuxContainerExecutor. This prevents one job from taking other jobs 
> resource(memory and CPU) on the same hadoop cluster. 
> User-namespace isolation will ensure that the root on the container is mapped 
> an unprivileged user on the host. This is currently being added to Docker.
> Network isolation will ensure that one user’s network traffic is completely 
> isolated from another user’s network traffic. 
> Last but not the least, the 

[jira] [Commented] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2016-11-15 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669054#comment-15669054
 ] 

Ravi Prakash commented on YARN-5762:


Hi Jian He!
I did notice the {{ApplicationBaseProtocol.getApplications}} method. It would 
return a response of size O(number of applications in the cluster) . I don't 
know if for big clusters that would be more intensive than O(num of 
applicaitions on a node) * RPC for 1 application.
Should we just extend the API?


> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5842) spark job getting failed with memory not avail

2016-11-07 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-5842.

Resolution: Invalid

Please email u...@hadoop.apache.org for questions. JIRA is for reporting 
confirmed bugs.

> spark job getting failed with memory not avail
> --
>
> Key: YARN-5842
> URL: https://issues.apache.org/jira/browse/YARN-5842
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: applications
> Environment: spark running in emr 4.3 with hadoop 2.7 and spark 1.6.0
>Reporter: Mohamed Kajamoideen
>
> > config <- spark_config()
> > config$`sparklyr.shell.driver-memory` <- "4G"
> > config$`sparklyr.shell.executor-memory` <- "4G"
> > sc <- spark_connect(master = "yarn-client", config = config) 
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 27.0 (TID 1941, ip- .ec2.internal): org.apache.spark.SparkException: 
> Values to assemble cannot be null.
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:154)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:137)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$.assemble(VectorAssembler.scala:137)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$3.apply(VectorAssembler.scala:95)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$3.apply(VectorAssembler.scala:94)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Sou



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2016-11-01 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626879#comment-15626879
 ] 

Ravi Prakash commented on YARN-5762:


Thanks for your comment Jian He! I do agree that the optimal solution would be 
for the NodeManager to send a single request to the RM for the status of only 
the subset of applications it is currently running.
However {{ApplicationBaseProtocol}} doesn't currently have such an interface. 
Are you suggesting we should add another method which allows sets of 
applications? The [lookup in the 
RM|https://github.com/apache/hadoop/blob/fa1512a740b2ed2661743d6b5483ef3eb49e5634/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L356]
 is on a ConcurrentMap, so I'm guessing its pretty efficient already.

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2016-10-20 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-5762:
--

Assignee: Ravi Prakash

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2016-10-20 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-5762:
---
Attachment: YARN-5762.01.patch

Here's a simple 1 line patch

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Priority: Minor
> Attachments: YARN-5762.01.patch
>
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2016-10-20 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593147#comment-15593147
 ] 

Ravi Prakash commented on YARN-5762:


IMHO we should summarize these into 1 line instead of whole stack traces.

> Summarize ApplicationNotFoundException in the RM log
> 
>
> Key: YARN-5762
> URL: https://issues.apache.org/jira/browse/YARN-5762
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 2.7.2
>Reporter: Ravi Prakash
>Priority: Minor
>
> We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
> most likely caused by the {{AggregatedLogDeletionService}} [which 
> checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
>  that the application is not running anymore. e.g.
> {code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
> handler 20 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35401 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1451' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from :12205 Call#35404 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1473396553140_1452' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5762) Summarize ApplicationNotFoundException in the RM log

2016-10-20 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-5762:
--

 Summary: Summarize ApplicationNotFoundException in the RM log
 Key: YARN-5762
 URL: https://issues.apache.org/jira/browse/YARN-5762
 Project: Hadoop YARN
  Issue Type: Task
Affects Versions: 2.7.2
Reporter: Ravi Prakash
Priority: Minor


We found a lot of {{ApplicationNotFoundException}} in the RM logs. These were 
most likely caused by the {{AggregatedLogDeletionService}} [which 
checks|https://github.com/apache/hadoop/blob/262827cf75bf9c48cd95335eb04fd8ff1d64c538/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L156]
 that the application is not running anymore. e.g.
{code}2016-10-17 15:25:26,542 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 20 on 8032, call 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from :12205 Call#35401 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
with id 'application_1473396553140_1451' doesn't exist in RM.
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
2016-10-17 15:25:26,633 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
47 on 8032, call 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from :12205 Call#35404 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
with id 'application_1473396553140_1452' doesn't exist in RM.
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4832) NM side resource value should get updated if change applied in RM side

2016-10-10 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564038#comment-15564038
 ] 

Ravi Prakash commented on YARN-4832:


Thanks for all your work Junping and Jian! Should we also modify the limits set 
in 
[ContainersMonitorImpl.maxPmemAllottedForContainers|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java#L76]
 etc. also?

> NM side resource value should get updated if change applied in RM side
> --
>
> Key: YARN-4832
> URL: https://issues.apache.org/jira/browse/YARN-4832
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: YARN-4832-addendum.patch, YARN-4832-branch-2.patch, 
> YARN-4832-demo.patch, YARN-4832-v2.patch, YARN-4832-v3.patch, YARN-4832.patch
>
>
> Now, if we execute CLI to update node (single or multiple) resource in RM 
> side, NM will not receive any notification. It doesn't affect resource 
> scheduling but will make resource usage metrics reported by NM a bit weird. 
> We should sync up new resource between RM and NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5642) Typos in 11 log messages

2016-09-14 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491055#comment-15491055
 ] 

Ravi Prakash commented on YARN-5642:


Interesting research Mehran! Thanks for your contribution. In addition to 
typos, does your research also discover grammatical errors?

> Typos in 11 log messages 
> -
>
> Key: YARN-5642
> URL: https://issues.apache.org/jira/browse/YARN-5642
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Mehran Hassani
>Priority: Trivial
>  Labels: newbie
>
> I am conducting research on log related bugs. I tried to make a tool to fix 
> repetitive yet simple patterns of bugs that are related to logs. Typos in log 
> messages are one of the reoccurring bugs. Therefore, I made a tool find typos 
> in log statements. During my experiments, I managed to find the following 
> typos in Hadoop YARN:
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java,
>  LOG.info("AsyncDispatcher is draining to stop  igonring any new events."), 
> igonring should be ignoring
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/YarnAuthorizationProvider.java,
>  LOG.info(authorizerClass.getName() + " is instiantiated."), 
> instiantiated should be instantiated
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java,
>  LOG.info("AsyncDispatcher is draining to stop  igonring any new events."), 
> igonring should be ignoring
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/YarnAuthorizationProvider.java,
>  LOG.info(authorizerClass.getName() + " is instiantiated."),  
> instiantiated should be instantiated
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java,
>  LOG.info("Completed reading history information of all conatiners"+ " of 
> application attempt " + appAttemptId), 
> conatiners should be containers
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java,
>  LOG.info("Neither virutal-memory nor physical-memory monitoring is " 
> +"needed. Not running the monitor-thread"), 
> virutal should be virtual
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractReservationSystem.java,
>  LOG.info("Intialized plan {} based on reservable queue {}" plan.toString()  
> planQueueName), 
> Intialized should be Initialized
> In file 
> /hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java,
>  LOG.info("Initializing " + queueName + "\n" +"capacity = " + 
> queueCapacities.getCapacity() +" [= (float) configuredCapacity / 100 ]" + 
> "\n" +"asboluteCapacity = " + queueCapacities.getAbsoluteCapacity() +" [= 
> parentAbsoluteCapacity * capacity ]" + "\n" +"maxCapacity = " + 
> queueCapacities.getMaximumCapacity() +" [= configuredMaxCapacity ]" + "\n" 
> +"absoluteMaxCapacity = " + queueCapacities.getAbsoluteMaximumCapacity() +" 
> [= 1.0 maximumCapacity undefined  " +"(parentAbsoluteMaxCapacity * 
> maximumCapacity) / 100 otherwise ]" +"\n" +"userLimit = " + userLimit +" [= 
> configuredUserLimit ]" + "\n" +"userLimitFactor = " + userLimitFactor +" [= 
> configuredUserLimitFactor ]" + "\n" +"maxApplications = " + maxApplications 
> +" [= configuredMaximumSystemApplicationsPerQueue or" +" 
> (int)(configuredMaximumSystemApplications * absoluteCapacity)]" +"\n" 
> +"maxApplicationsPerUser = " + maxApplicationsPerUser +" [= 
> (int)(maxApplications * (userLimit / 100.0f) * " +"userLimitFactor) ]" + "\n" 
> +"usedCapacity = " + queueCapacities.getUsedCapacity() +" [= 
> usedResourcesMemory / " +"(clusterResourceMemory * absoluteCapacity)]" + "\n" 
> +"absoluteUsedCapacity = " + absoluteUsedCapacity +" [= usedResourcesMemory / 
> clusterResourceMemory]" + "\n" +"maxAMResourcePerQueuePercent = " + 
> maxAMResourcePerQueuePercent +" [= configuredMaximumAMResourcePercent ]" + 
> "\n" +"minimumAllocationFactor = " + minimumAllocationFactor +" [= 
> (float)(maximumAllocationMemory - minimumAllocationMemory) / " 
> +"maximumAllocationMemory ]" + "\n" +"maximumAllocation = " + 
> maximumAllocation +" [= configuredMaxAllocation ]" + "\n" +"numContainers = " 
> + numContainers +" [= 

[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens

2016-05-25 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300808#comment-15300808
 ] 

Ravi Prakash commented on YARN-2233:


Thanks for all the work Varun! 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java#L1622
 . Is there a way in which a custom authentication handler can be used to 
create / renew and cancel DTs?

> Implement web services to create, renew and cancel delegation tokens
> 
>
> Key: YARN-2233
> URL: https://issues.apache.org/jira/browse/YARN-2233
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Fix For: 2.5.0
>
> Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, 
> apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch, 
> apache-yarn-2233.5.patch
>
>
> Implement functionality to create, renew and cancel delegation tokens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5072) Support comma separated list of includes and excludes files

2016-05-11 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280599#comment-15280599
 ] 

Ravi Prakash commented on YARN-5072:


Just fyi, not all of us run with the same nodes for HDFS and YARN.

> Support comma separated list of includes and excludes files
> ---
>
> Key: YARN-5072
> URL: https://issues.apache.org/jira/browse/YARN-5072
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ming Ma
>
> Normally a yarn cluster shares the same hosts as the underlying HDFS cluster. 
> To make admin easier, we have {{yarn.resourcemanager.nodes.include-path}} 
> point to the same file or symlink of the {{dfs.hosts}} file used by HDFS.
> If we want to set up a yarn cluster to run on the same hosts of several HDFS 
> clusters combined, it means {{yarn.resourcemanager.nodes.include-path}} 
> should be able to point to a list of files each of which belongs to one HDFS 
> cluster.
> Backward compatibility, it seems ok to continue to reuse 
> {{yarn.resourcemanager.nodes.include-path}} as long as it can still take a 
> single file. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4067) available resource could be set negative

2015-08-21 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-4067:
---
Fix Version/s: (was: 2.7.1)

 available resource could be set negative
 

 Key: YARN-4067
 URL: https://issues.apache.org/jira/browse/YARN-4067
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: YARN-4067.patch


 as mentioned in YARN-4045 by [~leftnoteasy], available memory could be 
 negative due to reservation, propose to use componentwiseMax to 
 updateQueueStatistics in order to cap negative value to zero



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4067) available resource could be set negative

2015-08-21 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707237#comment-14707237
 ] 

Ravi Prakash commented on YARN-4067:


2.7.1 has already been released. Please choose either 2.7.2 if its a critical 
fix or 2.8.0

 available resource could be set negative
 

 Key: YARN-4067
 URL: https://issues.apache.org/jira/browse/YARN-4067
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: YARN-4067.patch


 as mentioned in YARN-4045 by [~leftnoteasy], available memory could be 
 negative due to reservation, propose to use componentwiseMax to 
 updateQueueStatistics in order to cap negative value to zero



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4016) docker container is still running when app is killed

2015-08-06 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-4016.

Resolution: Duplicate

Hong! Please reopen if you find that it hasn't been fixed in trunk

 docker container is still running when app is killed
 

 Key: YARN-4016
 URL: https://issues.apache.org/jira/browse/YARN-4016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo

 The docker_container_executor_session.sh is generated like below:
 {code}
 ### get the pid of docker container by docker inspect
 echo `/usr/bin/docker inspect --format {{.State.Pid}} 
 container_1438681002528_0001_01_02`  
 .../container_1438681002528_0001_01_02.pid.tmp
 ### rename *.pid.tmp to *.pid
 /bin/mv -f .../container_1438681002528_0001_01_02.pid.tmp 
 .../container_1438681002528_0001_01_02.pid
 ### launch the docker container
 /usr/bin/docker run  --rm  --net=host --name 
 container_1438681002528_0001_01_02 -v ... library/mysql 
 /container_1438681002528_0001_01_02/launch_container.sh 
 {code}
 This is obviously wrong because you can not get the pid of a docker container 
 before starting it.  When NodeManager try to kill the container, pid zero is 
 always read from the pid file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3500) Optimize ResourceManager Web loading speed

2015-06-29 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606262#comment-14606262
 ] 

Ravi Prakash commented on YARN-3500:


I'm not sure *why* it was moved to client side. Maybe [~vicaya] knows?

 Optimize ResourceManager Web loading speed
 --

 Key: YARN-3500
 URL: https://issues.apache.org/jira/browse/YARN-3500
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Peter Shi

 after running 10k jobs, resoucemanager webui load speed become slow. As 
 server side send 10k jobs information in one response, parsing and rendering 
 page will cost a long time. Current paging logic is done in browser side. 
 This issue makes server side to do the paging logic, so that the loading will 
 be fast.
 Loading 10k jobs costs 55 sec. loading 2k costs 7 sec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3856) YARN shoud allocate container that is closest to the data

2015-06-29 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606400#comment-14606400
 ] 

Ravi Prakash commented on YARN-3856:


Hi Jaehoon!

Thanks for your contribution and work. Were you aware of YARN-18 ? Although 
that effort has stagnated, I wonder if it can be updated to support both 
use-cases.

 YARN shoud allocate container that is closest to the data
 -

 Key: YARN-3856
 URL: https://issues.apache.org/jira/browse/YARN-3856
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Hadoop cluster with multi-level network hierarchy
Reporter: jaehoon ko
 Attachments: YARN-3856.001.patch, YARN-3856.002.patch


 Currently, given a Container request for a host, ResourceManager allocates a 
 Container with following priorities (RMContainerAllocator.java):
  - Requested host
  - a host in the same rack as the requested host
  - any host
 This can lead to a sub-optimal allocation if Hadoop cluster is deployed on 
 multi-level networked hosts (which is typical). For example, let's suppose a 
 network architecture with one core switches, two aggregate switches, four ToR 
 switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are 
 as follows:
 h1, h2: /c/a1/t1
 h3, h4: /c/a1/t2
 h5, h6: /c/a2/t3
 h7, h8: /c/a2/t4
 To allocate a container for data in h1, Hadoop first tries h1 itself, then 
 h2, then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of 
 network distance and bandwidth. However, current implementation choose one 
 from h3~h8 with equal probabilities.
 This limitation is more obvious when considering hadoop clusters deployed on 
 VM or containers. In this case, only the VMs or containers running in the 
 same physical host are considered rack local, and actual rack-local hosts are 
 chosen with same probabilities as far hosts.
 The root cause of this limitation is that RMContainerAllocator.java performs 
 exact matching on rack id to find a rack local host. Alternatively, we can 
 perform longest-prefix matching to find a closest host. Using the same 
 network architecture as above, with longest-prefix matching, hosts are 
 selected with the following priorities:
  h1
  h2
  h3 or h4
  h5 or h6 or h7 or h8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3302) TestDockerContainerExecutor should run automatically if it can detect docker in the usual place

2015-05-19 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550806#comment-14550806
 ] 

Ravi Prakash commented on YARN-3302:


+1. Lgtm. Committing shortly. Thanks Ravindra, Varun and Vinod.

 TestDockerContainerExecutor should run automatically if it can detect docker 
 in the usual place
 ---

 Key: YARN-3302
 URL: https://issues.apache.org/jira/browse/YARN-3302
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Ravi Prakash
Assignee: Ravindra Kumar Naik
 Attachments: YARN-3302-trunk.001.patch, YARN-3302-trunk.002.patch, 
 YARN-3302-trunk.003.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it

2015-05-14 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544532#comment-14544532
 ] 

Ravi Prakash commented on YARN-1519:


Thanks for the patch Eric and Radim. +1. lgtm. Will check in shortly

 check if sysconf is implemented before using it
 ---

 Key: YARN-1519
 URL: https://issues.apache.org/jira/browse/YARN-1519
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.3.0
Reporter: Radim Kolar
Assignee: Radim Kolar
  Labels: BB2015-05-TBR
 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, 
 nodemgr-sysconf.txt


 If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to 
 segfault because invalid pointer gets passed to libc function.
 fix: enforce minimum value 1024, same method is used in hadoop-common native 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1519) check if sysconf is implemented before using it

2015-05-14 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1519:
---
Fix Version/s: 2.8.0

 check if sysconf is implemented before using it
 ---

 Key: YARN-1519
 URL: https://issues.apache.org/jira/browse/YARN-1519
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.3.0
Reporter: Radim Kolar
Assignee: Radim Kolar
  Labels: BB2015-05-TBR
 Fix For: 2.8.0

 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, 
 nodemgr-sysconf.txt


 If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to 
 segfault because invalid pointer gets passed to libc function.
 fix: enforce minimum value 1024, same method is used in hadoop-common native 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it

2015-05-08 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535954#comment-14535954
 ] 

Ravi Prakash commented on YARN-1519:


Nitpick: We don't set an upper limit for something we are going to malloc. 
Earlier it was atleast limited to INT_MAX. Now its LONG_MAX. I'd rather keep 
typecasting the long to int.
Otherwise +1. Please change that and I'm happy to commit

 check if sysconf is implemented before using it
 ---

 Key: YARN-1519
 URL: https://issues.apache.org/jira/browse/YARN-1519
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.3.0
Reporter: Radim Kolar
Assignee: Radim Kolar
  Labels: BB2015-05-TBR
 Attachments: YARN-1519.002.patch, nodemgr-sysconf.txt


 If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to 
 segfault because invalid pointer gets passed to libc function.
 fix: enforce minimum value 1024, same method is used in hadoop-common native 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3500) Optimize ResourceManager Web loading speed

2015-04-20 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503294#comment-14503294
 ] 

Ravi Prakash commented on YARN-3500:


We have gone back and forth on paging server side and client side. The earlier 
UI used to be paged server side. For some reason, paging was moved to client 
side. One of them was that even for 10k jobs, the amount of JSON data was 
pretty small. [~shihaoliang] Have you profiled where those 55 seconds were 
spent? How much of that was network transfer?

Personally I like server side paging, specially if we default to showing a LOT 
of jobs. I'd be in favor of [Server-side 
Datatables|https://www.datatables.net/examples/data_sources/server_side.html] 
(but it may require a lot of effort). 

 Optimize ResourceManager Web loading speed
 --

 Key: YARN-3500
 URL: https://issues.apache.org/jira/browse/YARN-3500
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Peter Shi

 after running 10k jobs, resoucemanager webui load speed become slow. As 
 server side send 10k jobs information in one response, parsing and rendering 
 page will cost a long time. Current paging logic is done in browser side. 
 This issue makes server side to do the paging logic, so that the loading will 
 be fast.
 Loading 10k jobs costs 55 sec. loading 2k costs 7 sec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3302) TestDockerContainerExecutor should run automatically if it can detect docker in the usual place

2015-04-20 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503521#comment-14503521
 ] 

Ravi Prakash commented on YARN-3302:


Thank you for the patch Ravindra! I'm sorry for the delay in the review :(
1. Could you please change the javadoc for TestContainerExecutor. Running a 
test should not require a special compile time flag.
2. The patch doesn't seem to apply. Could you please rebase?


 TestDockerContainerExecutor should run automatically if it can detect docker 
 in the usual place
 ---

 Key: YARN-3302
 URL: https://issues.apache.org/jira/browse/YARN-3302
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Ravi Prakash
 Attachments: YARN-3302-trunk.001.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken

2015-04-08 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484917#comment-14484917
 ] 

Ravi Prakash commented on YARN-3429:


You may have inadvertantly used the wrong JIRA number in your commit [~rkanter] 
Ought to be YARN-3429 (instead of YARN-2429) I see comments on YARN-2429.

 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3324) TestDockerContainerExecutor should clean test docker image from local repository after test is done

2015-03-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384274#comment-14384274
 ] 

Ravi Prakash commented on YARN-3324:


Hi Ravindra! Is there a reason you want to delete the image everytime? Wouldn't 
that mean that it would have to be downloaded for each test run? Unless there's 
a good reason I'd be a -1 on the change.

 TestDockerContainerExecutor should clean test docker image from local 
 repository after test is done
 ---

 Key: YARN-3324
 URL: https://issues.apache.org/jira/browse/YARN-3324
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Chen He
 Attachments: YARN-3324-branch-2.6.0.002.patch, 
 YARN-3324-trunk.002.patch


 Current TestDockerContainerExecutor only cleans the temp directory in local 
 file system but leaves the test docker image in local docker repository. It 
 should be cleaned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3324) TestDockerContainerExecutor should clean test docker image from local repository after test is done

2015-03-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384280#comment-14384280
 ] 

Ravi Prakash commented on YARN-3324:


And to clarify, I see the image as a dependency for the test. Possibly as any 
other jar that may be needed to run a test. We don't delete the jars a tests 
depends on after every run, and so neither should we docker images.

 TestDockerContainerExecutor should clean test docker image from local 
 repository after test is done
 ---

 Key: YARN-3324
 URL: https://issues.apache.org/jira/browse/YARN-3324
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Chen He
 Attachments: YARN-3324-branch-2.6.0.002.patch, 
 YARN-3324-trunk.002.patch


 Current TestDockerContainerExecutor only cleans the temp directory in local 
 file system but leaves the test docker image in local docker repository. It 
 should be cleaned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository

2015-03-17 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash closed YARN-3339.
--

 TestDockerContainerExecutor should pull a single image and not the entire 
 centos repository
 ---

 Key: YARN-3339
 URL: https://issues.apache.org/jira/browse/YARN-3339
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 2.6.0
 Environment: Linux
Reporter: Ravindra Kumar Naik
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3339-branch-2.6.0.001.patch, 
 YARN-3339-trunk.001.patch


 TestDockerContainerExecutor test pulls the entire centos repository which is 
 time consuming.
 Pulling a specific image (e.g. centos7) will be sufficient to run the test 
 successfully and will save time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3288) Document and fix indentation in the DockerContainerExecutor code

2015-03-16 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-3288:
---
Attachment: YARN-3288.02.patch

This applies cleanly on trunk.

 Document and fix indentation in the DockerContainerExecutor code
 

 Key: YARN-3288
 URL: https://issues.apache.org/jira/browse/YARN-3288
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ravi Prakash
Assignee: Ravi Prakash
Priority: Trivial
 Attachments: YARN-3288.01.patch, YARN-3288.02.patch


 The DockerContainerExecutor has several lines over 80 chars and could use 
 some more documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3288) Document and fix indentation in the DockerContainerExecutor code

2015-03-16 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364316#comment-14364316
 ] 

Ravi Prakash commented on YARN-3288:


Thanks for the review Abin! If there are no objections, I'll commit this soon.

 Document and fix indentation in the DockerContainerExecutor code
 

 Key: YARN-3288
 URL: https://issues.apache.org/jira/browse/YARN-3288
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ravi Prakash
Assignee: Ravi Prakash
Priority: Trivial
 Attachments: YARN-3288.01.patch, YARN-3288.02.patch


 The DockerContainerExecutor has several lines over 80 chars and could use 
 some more documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository

2015-03-16 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363032#comment-14363032
 ] 

Ravi Prakash commented on YARN-3339:


Looks good to me. Will commit by EOD

 TestDockerContainerExecutor should pull a single image and not the entire 
 centos repository
 ---

 Key: YARN-3339
 URL: https://issues.apache.org/jira/browse/YARN-3339
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 2.6.0
 Environment: Linux
Reporter: Ravindra Kumar Naik
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-3339-branch-2.6.0.001.patch, 
 YARN-3339-trunk.001.patch


 TestDockerContainerExecutor test pulls the entire centos repository which is 
 time consuming.
 Pulling a specific image (e.g. centos7) will be sufficient to run the test 
 successfully and will save time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository

2015-03-16 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364194#comment-14364194
 ] 

Ravi Prakash commented on YARN-3339:


Also, the fix version is set by the committer. Its the target version you 
should set (to one that has yet to be released) as a contributor, so that when 
a release manager looks at what is targeted to the next release, they'd see 
your JIRA :-)

 TestDockerContainerExecutor should pull a single image and not the entire 
 centos repository
 ---

 Key: YARN-3339
 URL: https://issues.apache.org/jira/browse/YARN-3339
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 2.6.0
 Environment: Linux
Reporter: Ravindra Kumar Naik
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3339-branch-2.6.0.001.patch, 
 YARN-3339-trunk.001.patch


 TestDockerContainerExecutor test pulls the entire centos repository which is 
 time consuming.
 Pulling a specific image (e.g. centos7) will be sufficient to run the test 
 successfully and will save time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository

2015-03-16 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-3339:
---
Fix Version/s: (was: 2.6.0)
   2.8.0

 TestDockerContainerExecutor should pull a single image and not the entire 
 centos repository
 ---

 Key: YARN-3339
 URL: https://issues.apache.org/jira/browse/YARN-3339
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 2.6.0
 Environment: Linux
Reporter: Ravindra Kumar Naik
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3339-branch-2.6.0.001.patch, 
 YARN-3339-trunk.001.patch


 TestDockerContainerExecutor test pulls the entire centos repository which is 
 time consuming.
 Pulling a specific image (e.g. centos7) will be sufficient to run the test 
 successfully and will save time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3339) TestDockerContainerExecutor should pull a single image and not the entire centos repository

2015-03-16 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364187#comment-14364187
 ] 

Ravi Prakash commented on YARN-3339:


Thanks for the review Abin!

Thanks a lot for the contribution Ravindra! I've merged this into trunk and 
branch-2 . Since branch-2.7 has already been cut, it'll only make it to 2.8

Just fyi Ravindra, you need only provide a patch for a specific branch if the 
trunk patch doesn't apply cleanly. In this case the trunk patch was sufficient. 
Thanks again  

 TestDockerContainerExecutor should pull a single image and not the entire 
 centos repository
 ---

 Key: YARN-3339
 URL: https://issues.apache.org/jira/browse/YARN-3339
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 2.6.0
 Environment: Linux
Reporter: Ravindra Kumar Naik
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3339-branch-2.6.0.001.patch, 
 YARN-3339-trunk.001.patch


 TestDockerContainerExecutor test pulls the entire centos repository which is 
 time consuming.
 Pulling a specific image (e.g. centos7) will be sufficient to run the test 
 successfully and will save time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3302) TestDockerContainerExecutor should run automatically if it can detect docker in the usual place

2015-03-06 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-3302:
--

 Summary: TestDockerContainerExecutor should run automatically if 
it can detect docker in the usual place
 Key: YARN-3302
 URL: https://issues.apache.org/jira/browse/YARN-3302
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ravi Prakash






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2981) DockerContainerExecutor must support a Cluster-wide default Docker image

2015-03-05 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349159#comment-14349159
 ] 

Ravi Prakash commented on YARN-2981:


I believe this is a good change. Could you please add a unit test. I'm a +1 on 
the change after that.

 DockerContainerExecutor must support a Cluster-wide default Docker image
 

 Key: YARN-2981
 URL: https://issues.apache.org/jira/browse/YARN-2981
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Abin Shahab
Assignee: Abin Shahab
 Attachments: YARN-2981.patch, YARN-2981.patch, YARN-2981.patch, 
 YARN-2981.patch


 This allows the yarn administrator to add a cluster-wide default docker image 
 that will be used when there are no per-job override of docker images. With 
 this features, it would be convenient for newer applications like slider to 
 launch inside a cluster-default docker container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2015-03-04 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash reassigned YARN-1964:
--

Assignee: Ravi Prakash  (was: Abin Shahab)

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Ravi Prakash
 Fix For: 2.6.0

 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2015-03-04 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1964:
---
Assignee: Abin Shahab  (was: Ravi Prakash)

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Fix For: 2.6.0

 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3289) Docker images should be downloaded during localization

2015-03-03 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-3289:
--

 Summary: Docker images should be downloaded during localization
 Key: YARN-3289
 URL: https://issues.apache.org/jira/browse/YARN-3289
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ravi Prakash


We currently call docker run on images while launching containers. If the image 
size if sufficiently big, the task will timeout. We should download the image 
we want to run during localization (if possible) to prevent this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3288) Document and fix indentation in the DockerContainerExecutor code

2015-03-03 Thread Ravi Prakash (JIRA)
Ravi Prakash created YARN-3288:
--

 Summary: Document and fix indentation in the 
DockerContainerExecutor code
 Key: YARN-3288
 URL: https://issues.apache.org/jira/browse/YARN-3288
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ravi Prakash
Assignee: Ravi Prakash
Priority: Trivial


The DockerContainerExecutor has several lines over 80 chars and could use some 
more documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2981) DockerContainerExecutor must support a Cluster-wide default Docker image

2015-03-03 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345406#comment-14345406
 ] 

Ravi Prakash commented on YARN-2981:


Hi Abin! The patch doesn't apply because documentation has been converted from 
apt to markdown. Could you please update it?
Could you please limit lines to 80 chars?
Could you please also split out the functionality you are proposing to limit 
cpu shares and memory into another JIRA? And also for the user the container is 
run as.



 DockerContainerExecutor must support a Cluster-wide default Docker image
 

 Key: YARN-2981
 URL: https://issues.apache.org/jira/browse/YARN-2981
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Abin Shahab
Assignee: Abin Shahab
 Attachments: YARN-2981.patch


 This allows the yarn administrator to add a cluster-wide default docker image 
 that will be used when there are no per-job override of docker images. With 
 this features, it would be convenient for newer applications like slider to 
 launch inside a cluster-default docker container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1853) Allow containers to be ran under real user even in insecure mode

2015-02-25 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336805#comment-14336805
 ] 

Ravi Prakash commented on YARN-1853:


This seems like a dupe of YARN-2424 . Andrey! Could you please confirm?

 Allow containers to be ran under real user even in insecure mode
 

 Key: YARN-1853
 URL: https://issues.apache.org/jira/browse/YARN-1853
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.3.0
Reporter: Andrey Stepachev
 Attachments: YARN-1853-trunk.patch, YARN-1853.patch


 Currently unsecure cluster runs all containers under one user (typically 
 nobody). That is not appropriate, because yarn applications doesn't play well 
 with hdfs having enabled permissions. Yarn applications try to write data (as 
 expected) into /user/nobody regardless of user, who launched application.
 Another sideeffect is that it is not possible to configure cgroups for 
 particular users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-977) Interface for users/AM to know actual usage by the container

2015-02-25 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336808#comment-14336808
 ] 

Ravi Prakash commented on YARN-977:
---

Is this related to YARN-1856? 

 Interface for users/AM to know actual usage by the container
 

 Key: YARN-977
 URL: https://issues.apache.org/jira/browse/YARN-977
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Omkar Vinit Joshi

 Today we allocate resource (memory and cpu) and node manager starts the 
 container with requested resource [I am assuming they are using cgroups]. But 
 there is definitely a possibility of users requesting more than what they 
 actually may need during the execution of their container/job-task. If we add 
 a way for users/AM to know the actual usage of the requested/completed 
 container then they may optimize it for next run..
 This will be helpful for AM to optimize cpu/memory resource requests by 
 querying NM/RM to know avg/max cpu/memory usage of the container or may be 
 containers belonging to application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-1943) Multitenant LinuxContainerExecutor is incompatible with Simple Security mode.

2015-02-25 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-1943.

Resolution: Duplicate

Marking as dupe of YARN-2424. Please reopen if my understanding is incorrect

 Multitenant LinuxContainerExecutor is incompatible with Simple Security mode.
 -

 Key: YARN-1943
 URL: https://issues.apache.org/jira/browse/YARN-1943
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: jay vyas
Priority: Critical
  Labels: linux
 Fix For: 2.3.0


 As of hadoop 2.3.0, commit cc74a18c makes it so that nonsecureLocalUser 
 replaces the user who submits a job if security is disabled: 
 {noformat}
  return UserGroupInformation.isSecurityEnabled() ? user : nonsecureLocalUser;
 {noformat}
 However, the only way to enable security, is to NOT use SIMPLE authentication 
 mode:
 {noformat}
   public static boolean isSecurityEnabled() {
 return !isAuthenticationMethodEnabled(AuthenticationMethod.SIMPLE);
   }
 {noformat}
  
 Thus, the framework ENFORCES that SIMPLE login security -- nonSecureuser 
 for submission of LinuxExecutorContainer.
 This results in a confusing issue, wherein we submit a job as sally and 
 then get an exception that user nobody is not whitelisted and has UID  
 MAX_ID.
 My proposed solution is that we should be able to leverage 
 LinuxContainerExector regardless of hadoop's view of the security settings on 
 the cluster, i.e. decouple LinuxContainerExecutor logic from the 
 isSecurityEnabled return value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2348) ResourceManager web UI should display server-side time instead of UTC time

2015-02-09 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash resolved YARN-2348.

Resolution: Won't Fix

Managing clusters which may perhaps be in different timezones requires that all 
clusters share the same time zone. UTC was chosen for this reason and is quite 
widely accepted as the industry standard now.

 ResourceManager web UI should display server-side time instead of UTC time
 --

 Key: YARN-2348
 URL: https://issues.apache.org/jira/browse/YARN-2348
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Assignee: Leitao Guo
 Attachments: YARN-2348.2.patch, YARN-2348.3.patch, afterpatch.jpg


 ResourceManager web UI, including application list and scheduler, displays 
 UTC time in default,  this will confuse users who do not use UTC time. This 
 web UI should display server-side time in default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile

2015-01-21 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-3080:
---
Priority: Major  (was: Critical)

 The DockerContainerExecutor could not write the right pid to container pidFile
 --

 Key: YARN-3080
 URL: https://issues.apache.org/jira/browse/YARN-3080
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Beckham007

 The docker_container_executor_session.sh is like this:
 {quote}
 #!/usr/bin/env bash
 echo `/usr/bin/docker inspect --format {{.State.Pid}} 
 container_1421723685222_0008_01_02`  
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp
 /bin/mv -f 
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp
  
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid
 /usr/bin/docker run --rm  --name container_1421723685222_0008_01_02 -e 
 GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e 
 GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e 
 GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e 
 GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M 
 --cpu-shares=1024 -v 
 /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02
  -v 
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02
  -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash 
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh
 {quote}
 The DockerContainerExecutor use docker inspect before docker run, so the 
 docker inspect couldn't get the right pid for the docker, signalContainer() 
 and nm restart would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >