[jira] [Resolved] (YARN-6586) YARN to facilitate HTTPS in AM web server

2018-10-23 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-6586.
-
   Resolution: Fixed
Fix Version/s: 3.3.0

All subtasks are now complete.

Thanks for the reviews, especially [~haibochen]; and the help with the 
dependency issues, especially [~eyang].

> YARN to facilitate HTTPS in AM web server
> -
>
> Key: YARN-6586
> URL: https://issues.apache.org/jira/browse/YARN-6586
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.0.0-alpha2
>Reporter: Haibo Chen
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Design Document v1.pdf, Design Document v2.pdf, 
> YARN-6586.poc.patch
>
>
> MR AM today does not support HTTPS in its web server, so the traffic between 
> RMWebproxy and MR AM is in clear text.
> MR cannot easily achieve this mainly because MR AMs are untrusted by YARN. A 
> potential solution purely within MR, similar to what Spark has implemented, 
> is to allow users, when they enable HTTPS in MR job, to provide their own 
> keystore file, and then the file is uploaded to distributed cache and 
> localized for MR AM container. The configuration users need to do is complex.
> More importantly, in typical deployments, however, web browsers go through 
> RMWebProxy to indirectly access MR AM web server. In order to support MR AM 
> HTTPs, RMWebProxy therefore needs to trust the user-provided keystore, which 
> is problematic.  
> Alternatively, we can add an endpoint in NM web server that acts as a proxy 
> between AM web server and RMWebProxy. RMWebproxy, when configured to do so, 
> will send requests in HTTPS to the NM on which the AM is running, and the NM 
> then can communicate with the local AM web server in HTTP.   This adds one 
> hop between RMWebproxy and AM, but both MR and Spark can use such solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8922) Fix test-container-executor

2018-10-19 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8922:
---

 Summary: Fix test-container-executor
 Key: YARN-8922
 URL: https://issues.apache.org/jira/browse/YARN-8922
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.3.0
Reporter: Robert Kanter
Assignee: Robert Kanter


YARN-8448 attempted to fix the {{test-container-executor}} C test to be able to 
run as root.  The test claims that it should be possible to run as root; in 
fact, there are some tests that only run if you use root.  

One of the fixes was to change the permissions of the test's config dir to 0777 
from 0755.  The problem was that the directory was owned by root, but then 
other users would need to write files/directories under it, which would fail 
with 0755.  YARN-8448 fixed this by making it 0777.  However, this breaks 
running cetest because it expects the directory to be 0755, and it's run 
afterwards.

The proper fix for all this is to leave the directory at 0755, but to make sure 
it's owned by the "nodemanager" user.  Confusingly, in 
{{test-container-executor}}, that appears to be the {{username}} and not the 
{{yarn_username}} (i.e. {{username}} is the user running the NM while 
{{yarn_username}} is just some user running a Yarn app).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8857) Upgrade BouncyCastle

2018-10-08 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8857:
---

 Summary: Upgrade BouncyCastle
 Key: YARN-8857
 URL: https://issues.apache.org/jira/browse/YARN-8857
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.2.0
Reporter: Robert Kanter
Assignee: Robert Kanter


As part of my work on YARN-6586, I noticed that we're using a very old version 
of BouncyCastle:
{code:xml}

   org.bouncycastle
   bcprov-jdk16
   1.46
   test

{code}
The *-jdk16 artifacts have been discontinued and are not recommended (see 
[http://bouncy-castle.1462172.n4.nabble.com/Bouncycaslte-bcprov-jdk15-vs-bcprov-jdk16-td4656252.html]).
 
 In particular, the newest release, 1.46, is from {color:#FF}2011{color}! 
 [https://mvnrepository.com/artifact/org.bouncycastle/bcprov-jdk16]

The currently maintained and recommended artifacts are *-jdk15on:
 [https://www.bouncycastle.org/latest_releases.html]
 They're currently on version 1.60, released only a few months ago.

We should update BouncyCastle to the *-jdk15on artifacts and the 1.60 release. 
It's currently a test-only artifact, so there should be no 
backwards-compatibility issues with updating this. It's also needed for 
YARN-6586, where we'll actually be shipping it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8582) Documentation for AM HTTPS Support

2018-07-25 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8582:
---

 Summary: Documentation for AM HTTPS Support
 Key: YARN-8582
 URL: https://issues.apache.org/jira/browse/YARN-8582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: docs
Reporter: Robert Kanter
Assignee: Robert Kanter


Documentation for YARN-6586.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8449) RM HA for AM HTTPS Support

2018-06-21 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8449:
---

 Summary: RM HA for AM HTTPS Support
 Key: YARN-8449
 URL: https://issues.apache.org/jira/browse/YARN-8449
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Kanter
Assignee: Robert Kanter






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8448) AM HTTPS Support

2018-06-21 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8448:
---

 Summary: AM HTTPS Support
 Key: YARN-8448
 URL: https://issues.apache.org/jira/browse/YARN-8448
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Kanter
Assignee: Robert Kanter






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-16 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8310:
---

 Summary: Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
ContainerTokenIdentifier formats
 Key: YARN-8310
 URL: https://issues.apache.org/jira/browse/YARN-8310
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter


In some recent upgrade testing, we saw this error causing the NodeManager to 
fail to startup afterwards:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
com.google.protobuf.InvalidProtocolBufferException: Protocol message contained 
an invalid tag (zero).
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message 
contained an invalid tag (zero).
at 
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at 
com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at 
org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
at 
org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
at 
org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
at 
org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
at 
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at 
org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
at 
org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
at 
org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
at 
org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
... 5 more
{noformat}
The NodeManager fails because it's trying to read a 
{{ContainerTokenIdentifier}} in the "old" format before we changed them to 
protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
similar problem with the ResourceManager and RM Delegation Tokens.

To provide a better experience, we should make the code able to read the old 
format if it's unable to read it using the new format.  We didn't run into any 
errors with the other two types of tokens that YARN-668 incompatibly changed 
(NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix those while 
we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8051) TestRMEmbeddedElector#testCallbackSynchronization is flakey

2018-03-19 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-8051:
---

 Summary: TestRMEmbeddedElector#testCallbackSynchronization is 
flakey
 Key: YARN-8051
 URL: https://issues.apache.org/jira/browse/YARN-8051
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 3.2.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We've seen some rare flakey failures in 
{{TestRMEmbeddedElector#testCallbackSynchronization}}:
{noformat}
org.mockito.exceptions.verification.WantedButNotInvoked: 

Wanted but not invoked:
adminService.transitionToStandby();
-> at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215)
Actually, there were zero interactions with this mock.

at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:146)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:109)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7645) TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers is flakey with FairScheduler

2017-12-12 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7645:
---

 Summary: 
TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers is 
flakey with FairScheduler
 Key: YARN-7645
 URL: https://issues.apache.org/jira/browse/YARN-7645
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We've noticed some flakiness in 
{{TestContainerResourceUsage#testUsageAfterAMRestartWithMultipleContainers}} 
when using {{FairScheduler}}:
{noformat}
java.lang.AssertionError: Attempt state is not correct (timeout). 
expected: but was:
at 
org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.amRestartTests(TestContainerResourceUsage.java:275)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers(TestContainerResourceUsage.java:254)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7458) TestContainerManagerSecurity is still flakey

2017-11-07 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7458:
---

 Summary: TestContainerManagerSecurity is still flakey
 Key: YARN-7458
 URL: https://issues.apache.org/jira/browse/YARN-7458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0-beta1, 2.9.0
Reporter: Robert Kanter
Assignee: Robert Kanter


YARN-6150 made this less flakey, but we're still seeing an occasional issue 
here:
{noformat}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:420)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:356)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:167)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7389) Make TestResourceManager Scheduler agnostic

2017-10-24 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7389:
---

 Summary: Make TestResourceManager Scheduler agnostic
 Key: YARN-7389
 URL: https://issues.apache.org/jira/browse/YARN-7389
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.9.0, 3.0.0
Reporter: Robert Kanter
Assignee: Robert Kanter


Many of the tests in {{TestResourceManager}} override the scheduler to always 
be {{CapacityScheduler}}.  However, these tests should be made scheduler 
agnostic (they are testing the RM, not the scheduler).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7385) TestFairScheduler#testUpdateDemand and TestFSLeafQueue#testUpdateDemand are failing with NPE

2017-10-24 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7385:
---

 Summary: TestFairScheduler#testUpdateDemand and 
TestFSLeafQueue#testUpdateDemand are failing with NPE
 Key: YARN-7385
 URL: https://issues.apache.org/jira/browse/YARN-7385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.9.0, 3.0.0
Reporter: Robert Kanter
Assignee: Robert Kanter


{{TestFairScheduler#testUpdateDemand}} and {{TestFSLeafQueue#testUpdateDemand}} 
are failing with NPE:

{noformat}
java.lang.NullPointerException: null
at 
org.apache.hadoop.yarn.util.resource.Resources.addTo(Resources.java:180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.incUsedResource(FSQueue.java:494)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.addApp(FSLeafQueue.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testUpdateDemand(TestFairScheduler.java:5264)
 Standard Output84 ms
{noformat}

{noformat}
java.lang.NullPointerException: null
at 
org.apache.hadoop.yarn.util.resource.Resources.addTo(Resources.java:180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.incUsedResource(FSQueue.java:494)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.addApp(FSLeafQueue.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSLeafQueue.testUpdateDemand(TestFSLeafQueue.java:92)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7382) NoSuchElementException in FairScheduler after failover causes RM crash

2017-10-23 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7382:
---

 Summary: NoSuchElementException in FairScheduler after failover 
causes RM crash
 Key: YARN-7382
 URL: https://issues.apache.org/jira/browse/YARN-7382
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.9.0, 3.0.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker


While running an MR job (e.g. sleep) and an RM failover occurs, once the maps 
gets to 100%, the now active RM will crash due to:
{noformat}
2017-10-18 15:02:05,347 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1508361403235_0001_01_02 Container Transitioned from RUNNING to 
COMPLETED
2017-10-18 15:02:05,347 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
APPID=application_1508361403235_0001
CONTAINERID=container_1508361403235_0001_01_02  RESOURCE=
2017-10-18 15:02:05,349 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_UPDATE to the Event Dispatcher
java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:371)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:901)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1326)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:371)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1019)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:887)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1104)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:128)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
2017-10-18 15:02:05,360 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
Exiting, bbye..
{noformat}
This leaves the cluster with no RMs!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7341) TestRouterWebServiceUtil#testMergeMetrics is flakey

2017-10-16 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7341:
---

 Summary: TestRouterWebServiceUtil#testMergeMetrics is flakey
 Key: YARN-7341
 URL: https://issues.apache.org/jira/browse/YARN-7341
 Project: Hadoop YARN
  Issue Type: Bug
  Components: federation
Affects Versions: 3.0.0-beta1, 2.9.0
Reporter: Robert Kanter
Assignee: Robert Kanter


{{TestRouterWebServiceUtil#testMergeMetrics}} is flakey.  It sometimes fails 
with something like:
{noformat}
Running org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil
Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil
testMergeMetrics(org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil)
  Time elapsed: 0.005 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1092> but was:<584>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServiceUtil.testMergeMetrics(TestRouterWebServiceUtil.java:473)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7310) TestAMRMProxy#testAMRMProxyE2E fails with FairScheduler

2017-10-09 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7310:
---

 Summary: TestAMRMProxy#testAMRMProxyE2E fails with FairScheduler
 Key: YARN-7310
 URL: https://issues.apache.org/jira/browse/YARN-7310
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Robert Kanter
Assignee: Robert Kanter


{{TestAMRMProxy#testAMRMProxyE2E}} fails with FairScheduler:
{noformat}
[ERROR] testAMRMProxyE2E(org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy)  
Time elapsed: 29.047 s  <<< FAILURE!
java.lang.AssertionError: expected:<2> but was:<1>
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.testAMRMProxyE2E(TestAMRMProxy.java:124)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7309) TestClientRMService#testUpdateApplicationPriorityRequest and TestClientRMService#testUpdatePriorityAndKillAppWithZeroClusterResource test functionality not supported by Fa

2017-10-09 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7309:
---

 Summary: TestClientRMService#testUpdateApplicationPriorityRequest 
and TestClientRMService#testUpdatePriorityAndKillAppWithZeroClusterResource 
test functionality not supported by FairScheduler
 Key: YARN-7309
 URL: https://issues.apache.org/jira/browse/YARN-7309
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Robert Kanter
Assignee: Robert Kanter


{{TestClientRMService#testUpdateApplicationPriorityRequest}} and 
{{TestClientRMService#testUpdatePriorityAndKillAppWithZeroClusterResource}} 
test functionality (i.e. Application Priorities) not supported by 
FairScheduler.  We should skip these two tests when using FairScheduler or 
they'll fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7138) Fix incompatible API change for YarnScheduler involved by YARN-5521

2017-10-09 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-7138.
-
Resolution: Won't Fix

Ok.  I've created YARN-7301.

> Fix incompatible API change for YarnScheduler involved by YARN-5521
> ---
>
> Key: YARN-7138
> URL: https://issues.apache.org/jira/browse/YARN-7138
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Junping Du
>Priority: Critical
>
> From JACC report for 2.8.2 against 2.7.4, it indicates that we have 
> incompatible changes happen in YarnScheduler:
> {noformat}
> hadoop-yarn-server-resourcemanager-2.7.4.jar, YarnScheduler.class
> package org.apache.hadoop.yarn.server.resourcemanager.scheduler
> YarnScheduler.allocate ( ApplicationAttemptId p1, List p2, 
> List p3, List p4, List p5 ) [abstract]  :  
> Allocation 
> {noformat}
> The root cause is YARN-5221. We should change it back or workaround this by 
> adding back original API (mark as deprecated if not used any more).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7301) Create stable Scheduler API

2017-10-09 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7301:
---

 Summary: Create stable Scheduler API
 Key: YARN-7301
 URL: https://issues.apache.org/jira/browse/YARN-7301
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Robert Kanter


Currently, it's not practical for a user to create their own scheduler.  
Besides it being a large undertaking, the API is a mess.  A few of the problems:
# We make incompatible changes to {{YarnScheduler}} sometimes (see YARN-7138).
# Many methods in {{YarnScheduler}} are marked as {{\@Public}} {{\@Stable}}, 
but the class itself has no annotations, which defaults to {{\@Private}}.
# We often cast a {{YarnScheduler}} to an {{AbstractYarnScheduler}}, which 
means that custom schedulers must also subclass {{AbstractYarnScheduler}} or 
they'll get a {{ClassCastException}}.  However, {{AbstractYarnScheduler}} is 
{{\@Private}} {{\@Unstable}}.

It could be useful to provide a proper usable API for custom schedulers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7262) Add a hierarchy into the ZKRMStateStore for delegation token znodes to prevent jute buffer overflow

2017-09-27 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7262:
---

 Summary: Add a hierarchy into the ZKRMStateStore for delegation 
token znodes to prevent jute buffer overflow
 Key: YARN-7262
 URL: https://issues.apache.org/jira/browse/YARN-7262
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We've seen users who are running into a problem where the RM is storing so many 
delegation tokens in the {{ZKRMStateStore}} that the _listing_ of those znodes 
is higher than the jute buffer. This is fine during operations, but becomes a 
problem on a fail over because the RM will try to read in all of the token 
znodes (i.e. call {{getChildren}} on the parent znode).  This is particularly 
bad because everything appears to be okay, but then if a failover occurs you 
end up with no active RMs.

There was a similar problem with the Yarn application data that was fixed in 
YARN-2962 by adding a (configurable) hierarchy of znodes so the RM could pull 
subchildren without overflowing the jute buffer (though it's off by default).
We should add a hierarchy similar to that of YARN-2962, but for the delegation 
token znodes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7162) Remove XML excludes file format

2017-09-05 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7162:
---

 Summary: Remove XML excludes file format
 Key: YARN-7162
 URL: https://issues.apache.org/jira/browse/YARN-7162
 Project: Hadoop YARN
  Issue Type: Bug
  Components: graceful
Affects Versions: 2.9.0, 3.0.0-beta1
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker


YARN-5536 aims to replace the XML format for the excludes file with a JSON 
format.  However, it looks like we won't have time for that for Hadoop 3 Beta 
1.  The concern is that if we release it as-is, we'll now have to support the 
XML format as-is for all of Hadoop 3.x, which we're either planning on 
removing, or rewriting using a pluggable framework.  

[This comment in 
YARN-5536|https://issues.apache.org/jira/browse/YARN-5536?focusedCommentId=16126194=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16126194]
 proposed two quick solutions to prevent this compat issue.  In this JIRA, 
we're going to remove the XML format.  If we later want to add it back in, 
YARN-5536 can add it back, rewriting it to be in the pluggable framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7146) Many RM unit tests failing with FairScheduler

2017-08-31 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7146:
---

 Summary: Many RM unit tests failing with FairScheduler
 Key: YARN-7146
 URL: https://issues.apache.org/jira/browse/YARN-7146
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0-beta1
Reporter: Robert Kanter
Assignee: Robert Kanter


Many of the RM unit tests are failing when using the FairScheduler.  

Here is a list of affected test classes:
{noformat}
TestYarnClient
TestApplicationCleanup
TestApplicationMasterLauncher
TestDecommissioningNodesWatcher
TestKillApplicationWithRMHA
TestNodeBlacklistingOnAMFailures
TestRM
TestRMAdminService
TestRMRestart
TestResourceTrackerService
TestWorkPreservingRMRestart
TestAMRMRPCNodeUpdates
TestAMRMRPCResponseId
TestAMRestart
TestApplicationLifetimeMonitor
TestNodesListManager
TestRMContainerImpl
TestAbstractYarnScheduler
TestSchedulerUtils
TestFairOrderingPolicy
TestAMRMTokens
TestDelegationTokenRenewer
{noformat}
Most of the test methods in these classes are failing, though some do succeed.


There's two main categories of issues:
# The test submits an application to the {{MockRM}} and waits for it to enter a 
specific state, which it never does, and the test times out.  We need to call 
{{update()}} on the scheduler.
# The test throws a {{ClassCastException}} on {{FSQueueMetrics}} to 
{{CSQueueMetrics}}.  This is because {{QueueMetrics}} metrics are static, and a 
previous test using FairScheduler initialized it, and the current test is using 
CapacityScheduler.  We need to reset the metrics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7094) Document that server-side graceful decom is currently not recommended

2017-08-24 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7094:
---

 Summary: Document that server-side graceful decom is currently not 
recommended
 Key: YARN-7094
 URL: https://issues.apache.org/jira/browse/YARN-7094
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: graceful
Affects Versions: 3.0.0-beta1
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker


Server-side NM graceful decom currently does not work correctly when an RM 
failover occurs because we don't persist the info in the state store (see 
YARN-5464).  Given time constraints for Hadoop 3 beta 1, we've decided to 
document this limitation and recommend client-side NM graceful decom in the 
meantime if you need this functionality (see [this 
comment|https://issues.apache.org/jira/browse/YARN-5464?focusedCommentId=16126119=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16126119]).
  Once YARN-5464 is done, we can undo this doc change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7020) TestAMRMProxy#testAMRMProxyTokenRenewal is flakey

2017-08-15 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-7020:
---

 Summary: TestAMRMProxy#testAMRMProxyTokenRenewal is flakey
 Key: YARN-7020
 URL: https://issues.apache.org/jira/browse/YARN-7020
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-beta1
Reporter: Robert Kanter
Assignee: Robert Kanter


{{TestAMRMProxy#testAMRMProxyTokenRenewal}} is flakey.  It infrequently fails 
with:
{noformat}
testAMRMProxyTokenRenewal(org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy) 
 Time elapsed: 19.036 sec  <<< ERROR!
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: 
Application attempt appattempt_1502837054903_0001_01 doesn't exist in 
ApplicationMasterService cache.
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:355)
at 
org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224)
at 
org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135)
at 
org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:279)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1490)
at org.apache.hadoop.ipc.Client.call(Client.java:1436)
at org.apache.hadoop.ipc.Client.call(Client.java:1346)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy90.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy91.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.testAMRMProxyTokenRenewal(TestAMRMProxy.java:190)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6993) getResourceCalculatorPlugin for the default should not intercept throwable

2017-08-11 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6993:
---

 Summary: getResourceCalculatorPlugin for the default should not 
intercept throwable
 Key: YARN-6993
 URL: https://issues.apache.org/jira/browse/YARN-6993
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha1, 2.8.0
Reporter: Robert Kanter


YARN-3917 made it so that when {{getResourceCalculatorPlugin}} tries to load 
the default calculator and something bad happens, it catches {{throwable}} and 
simply logs a warning.  This should be {{Exception}} - we don't want to eat 
things like an OOM Error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6974) Make CuratorBasedElectorService the default

2017-08-08 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6974:
---

 Summary: Make CuratorBasedElectorService the default
 Key: YARN-6974
 URL: https://issues.apache.org/jira/browse/YARN-6974
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 3.0.0-beta1
Reporter: Robert Kanter


YARN-4438 (and cleanup in YARN-5709) added the {{CuratorBasedElectorService}}, 
which does leader election via Curator.  The intention was to leave it off by 
default to allow time for it to bake, and eventually make it the default and 
remove the {{ActiveStandbyElectorBasedElectorService}}.  

We should do that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6643) TestRMFailover fails rarely due to port conflict

2017-05-24 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6643:
---

 Summary: TestRMFailover fails rarely due to port conflict
 Key: YARN-6643
 URL: https://issues.apache.org/jira/browse/YARN-6643
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.9.0, 3.0.0-alpha3
Reporter: Robert Kanter
Assignee: Robert Kanter


We've seen various tests in {{TestRMFailover}} fail very rarely with a message 
like "org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.io.IOException: ResourceManager failed to start. Final state is STOPPED".  

After some digging, it turns out that it's due to a port conflict with the 
embedded ZooKeeper in the tests.  The embedded ZooKeeper uses 
{{ServerSocketUtil#getPort}} to choose a free port, but the RMs are configured 
to 1 +  and 2 +  (e.g. the default port for 
the RM is 8032, so you'd use 18032 and 28032).

When I was able to reproduce this, I saw that ZooKeeper was using port 18033, 
which is 1 + 8033, the default RM Admin port.  It results in an error like 
this, causing the RM to be unable to start, and hence the original error 
message in the test failure:
{noformat}
2017-05-24 01:16:52,735 INFO  service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.net.BindException: Problem binding to [0.0.0.0:18033] 
java.net.BindException: Address already in use; For more details see:  
http://wiki.apache.org/hadoop/BindException
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: 
Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in 
use; For more details see:  http://wiki.apache.org/hadoop/BindException
at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:171)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:158)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1147)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.MiniYARNCluster$2.run(MiniYARNCluster.java:310)
Caused by: java.net.BindException: Problem binding to [0.0.0.0:18033] 
java.net.BindException: Address already in use; For more details see:  
http://wiki.apache.org/hadoop/BindException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:720)
at org.apache.hadoop.ipc.Server.bind(Server.java:482)
at org.apache.hadoop.ipc.Server$Listener.(Server.java:688)
at org.apache.hadoop.ipc.Server.(Server.java:2376)
at org.apache.hadoop.ipc.RPC$Server.(RPC.java:1042)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:535)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
... 9 more
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.apache.hadoop.ipc.Server.bind(Server.java:465)
... 17 more
2017-05-24 01:16:52,736 DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: ResourceManager entered state 
STOPPED
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (YARN-6602) Impersonation does not work if standby RM is contacted first

2017-05-15 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6602:
---

 Summary: Impersonation does not work if standby RM is contacted 
first
 Key: YARN-6602
 URL: https://issues.apache.org/jira/browse/YARN-6602
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 3.0.0-alpha3
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker


When RM HA is enabled, impersonation does not work correctly if the Yarn Client 
connects to the standby RM first.  When this happens, the impersonation is 
"lost" and the client does things on behalf of the impersonator user.  We saw 
this with the OOZIE-1770 Oozie on Yarn feature.

I need to investigate this some more, but it appears to be related to 
delegation tokens.  When this issue occurs, the tokens have the owner as 
"oozie" instead of the actual user.  On a hunch, we found a workaround that 
explicitly adding a correct RM HA delegation token fixes the problem:
{code:java}
org.apache.hadoop.yarn.api.records.Token token = 
yarnClient.getRMDelegationToken(ClientRMProxy.getRMDelegationTokenService(conf));
org.apache.hadoop.security.token.Token token2 = new 
org.apache.hadoop.security.token.Token(token.getIdentifier().array(), 
token.getPassword().array(), new Text(token.getKind()), new 
Text(token.getService()));
UserGroupInformation.getCurrentUser().addToken(token2);
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5894) fixed license warning caused by de.ruedigermoeller:fst:jar:2.24

2017-04-28 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-5894.
-
   Resolution: Fixed
Fix Version/s: 2.8.1
   2.9.0

Sure thing - committed to branch-2, branch-2.8, and branch-2.8.1!

> fixed license warning caused by de.ruedigermoeller:fst:jar:2.24
> ---
>
> Key: YARN-5894
> URL: https://issues.apache.org/jira/browse/YARN-5894
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0-alpha1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Blocker
> Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3
>
> Attachments: YARN-5894.00.patch, YARN-5894.01.patch
>
>
> The artifact de.ruedigermoeller:fst:jar:2.24, that ApplicationHistoryService 
> depends on,  shows its license being LGPL 2.1 in our license checking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6527) Provide a better out-of-the-box experience for SLS

2017-04-25 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6527:
---

 Summary: Provide a better out-of-the-box experience for SLS
 Key: YARN-6527
 URL: https://issues.apache.org/jira/browse/YARN-6527
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler-load-simulator
Affects Versions: 3.0.0-alpha3
Reporter: Robert Kanter


The example provided with SLS appears to be broken - I didn't see any jobs 
running.  On top of that, it seems like getting SLS to run properly requires a 
lot of hadoop site configs, scheduler configs, etc.  I was only able to get 
something running after [~yufeigu] provided a lot of config files.

We should provide a better out-of-the-box experience for SLS.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6359) TestRM#testApplicationKillAtAcceptedState fails rarely due to race condition

2017-03-16 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6359:
---

 Summary: TestRM#testApplicationKillAtAcceptedState fails rarely 
due to race condition
 Key: YARN-6359
 URL: https://issues.apache.org/jira/browse/YARN-6359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.9.0, 3.0.0-alpha3
Reporter: Robert Kanter
Assignee: Robert Kanter


We've seen (very rarely) a test failure in 
{{TestRM#testApplicationKillAtAcceptedState}}

{noformat}
java.lang.AssertionError: expected:<1> but was:<0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRM.testApplicationKillAtAcceptedState(TestRM.java:645)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6356) Allow different values of yarn.log-aggregation.retain-seconds for succeeded and failed jobs

2017-03-16 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6356:
---

 Summary: Allow different values of 
yarn.log-aggregation.retain-seconds for succeeded and failed jobs
 Key: YARN-6356
 URL: https://issues.apache.org/jira/browse/YARN-6356
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Reporter: Robert Kanter


It would be useful to have a value of {{yarn.log-aggregation.retain-seconds}} 
for succeeded jobs and a different value for failed/killed jobs.  For jobs that 
succeeded, you typically don't care about the logs, so a shorter retention time 
is fine (and saves space/blocks in HDFS).  For jobs that failed or were killed, 
the logs are much more important, and it's likely to want to keep them around 
for longer so you have time to look at them.

For instance, you could set it to keep logs for succeeded jobs for 1 day and 
logs for failed/killed jobs for 1 week.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6051) Create CS test for YARN-6050

2017-01-03 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6051:
---

 Summary: Create CS test for YARN-6050
 Key: YARN-6051
 URL: https://issues.apache.org/jira/browse/YARN-6051
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Kanter
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6050) AMs can't be scheduled on racks or nodes

2017-01-03 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-6050:
---

 Summary: AMs can't be scheduled on racks or nodes
 Key: YARN-6050
 URL: https://issues.apache.org/jira/browse/YARN-6050
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.9.0, 3.0.0-alpha2
Reporter: Robert Kanter
Assignee: Robert Kanter


Yarn itself supports rack/node aware scheduling for AMs; however, there 
currently are two problems:
# To specify hard or soft rack/node requests, you have to specify more than one 
{{ResourceRequest}}.  For example, if you want to schedule an AM only on 
"rackA", you have to create two {{ResourceRequest}}, like this:
{code}
ResourceRequest.newInstance(PRIORITY, ANY, CAPABILITY, NUM_CONTAINERS, false);
ResourceRequest.newInstance(PRIORITY, "rackA", CAPABILITY, NUM_CONTAINERS, 
true);
{code}
The problem is that the Yarn API doesn't actually allow you to specify more 
than one {{ResourceRequest}} in the {{ApplicationSubmissionContext}}.  The 
current behavior is to either build one from {{getResource}} or directly from 
{{getAMContainerResourceRequest}}, depending on if 
{{getAMContainerResourceRequest}} is null or not.  We'll need to add a third 
method, say {{getAMContainerResourceRequests}}, which takes a list of 
{{ResourceRequest}} so that clients can specify the multiple resource requests.
# There are some places where things are hardcoded to overwrite what the client 
specifies.  These are pretty straightforward to fix.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5837) NPE when getting node status of a decommissioned node after an RM restart

2016-11-04 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-5837:
---

 Summary: NPE when getting node status of a decommissioned node 
after an RM restart
 Key: YARN-5837
 URL: https://issues.apache.org/jira/browse/YARN-5837
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha1, 2.7.3
Reporter: Robert Kanter
Assignee: Robert Kanter


If you decommission a node, the {{yarn node}} command shows it like this:
{noformat}
>> bin/yarn node -list -all
2016-11-04 08:54:37,169 INFO client.RMProxy: Connecting to ResourceManager at 
0.0.0.0/0.0.0.0:8032
Total Nodes:1
 Node-Id Node-State Node-Http-Address   
Number-of-Running-Containers
192.168.1.69:57560   DECOMMISSIONED 192.168.1.69:8042   
   0
{noformat}
And a full report like this:
{noformat}
>> bin/yarn node -status 192.168.1.69:57560
2016-11-04 08:55:08,928 INFO client.RMProxy: Connecting to ResourceManager at 
0.0.0.0/0.0.0.0:8032
Node Report :
Node-Id : 192.168.1.69:57560
Rack : /default-rack
Node-State : DECOMMISSIONED
Node-Http-Address : 192.168.1.69:8042
Last-Health-Update : Fri 04/Nov/16 08:53:58:802PDT
Health-Report :
Containers : 0
Memory-Used : 0MB
Memory-Capacity : 8192MB
CPU-Used : 0 vcores
CPU-Capacity : 8 vcores
Node-Labels :
Resource Utilization by Node :
Resource Utilization by Containers : PMem:0 MB, VMem:0 MB, VCores:0.0
{noformat}

If you then restart the ResourceManager, you get this report:
{noformat}
>> bin/yarn node -list -all
2016-11-04 08:57:18,512 INFO client.RMProxy: Connecting to ResourceManager at 
0.0.0.0/0.0.0.0:8032
Total Nodes:4
 Node-Id Node-State Node-Http-Address   
Number-of-Running-Containers
 192.168.1.69:-1 DECOMMISSIONED   192.168.1.69:-1   
   0
{noformat}
And when you try to get the full report on the now "-1" node, you get an NPE:
{noformat}
>> bin/yarn node -status 192.168.1.69:-1
2016-11-04 08:57:57,385 INFO client.RMProxy: Connecting to ResourceManager at 
0.0.0.0/0.0.0.0:8032
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.hadoop.yarn.client.cli.NodeCLI.printNodeStatus(NodeCLI.java:296)
at org.apache.hadoop.yarn.client.cli.NodeCLI.run(NodeCLI.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.yarn.client.cli.NodeCLI.main(NodeCLI.java:63)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5750) YARN-4126 broke Oozie on unsecure cluster

2016-10-24 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-5750.
-
Resolution: Duplicate

YARN-4126 has been reverted from branch-2 and 2.8.  It's now only in 3, where 
it's okay to break this.

> YARN-4126 broke Oozie on unsecure cluster
> -
>
> Key: YARN-5750
> URL: https://issues.apache.org/jira/browse/YARN-5750
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: Peter Cseh
>
> Oozie is using a DummyRenewer on unsecure clusters and can't submit workflows 
> on an unsecure cluster after YARN-4126.
> {noformat}
> org.apache.oozie.action.ActionExecutorException: JA009: 
> org.apache.hadoop.yarn.exceptions.YarnException: java.io.IOException: 
> Delegation Token can be issued only with kerberos authentication
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1092)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:335)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:515)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:663)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2423)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2419)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1790)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2419)
> Caused by: java.io.IOException: Delegation Token can be issued only with 
> kerberos authentication
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1065)
>   ... 10 more
>   at 
> org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:457)
>   at 
> org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:437)
>   at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1128)
>   at 
> org.apache.oozie.action.hadoop.TestJavaActionExecutor.submitAction(TestJavaActionExecutor.java:343)
>   at 
> org.apache.oozie.action.hadoop.TestJavaActionExecutor.submitAction(TestJavaActionExecutor.java:363)
>   at 
> org.apache.oozie.action.hadoop.TestJavaActionExecutor.testKill(TestJavaActionExecutor.java:602)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at junit.framework.TestCase.runTest(TestCase.java:168)
>   at junit.framework.TestCase.runBare(TestCase.java:134)
>   at junit.framework.TestResult$1.protect(TestResult.java:110)
>   at junit.framework.TestResult.runProtected(TestResult.java:128)
>   at junit.framework.TestResult.run(TestResult.java:113)
>   at junit.framework.TestCase.run(TestCase.java:124)
>   at junit.framework.TestSuite.runTest(TestSuite.java:232)
>   at junit.framework.TestSuite.run(TestSuite.java:227)
>   at 
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:24)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 
> org.apache.hadoop.yarn.exceptions.YarnException: java.io.IOException: 
> Delegation Token can be issued only with kerberos authentication
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1092)
>   at 
> 

[jira] [Resolved] (YARN-3220) Create a Service in the RM to concatenate aggregated logs

2016-10-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-3220.
-
Resolution: Won't Fix

Closing this as "Won't Fix" given we have MAPREDUCE-6415.

> Create a Service in the RM to concatenate aggregated logs
> -
>
> Key: YARN-3220
> URL: https://issues.apache.org/jira/browse/YARN-3220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>
> Create an {{RMAggregatedLogsConcatenationService}} in the RM that will 
> concatenate the aggregated log files written by the NM (which are in the new 
> {{ConcatableAggregatedLogFormat}} format) when an application finishes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3728) Add an rmadmin command to compact concatenated aggregated logs

2016-10-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-3728.
-
Resolution: Won't Fix

Closing this as "Won't Fix" given we have MAPREDUCE-6415.

> Add an rmadmin command to compact concatenated aggregated logs
> --
>
> Key: YARN-3728
> URL: https://issues.apache.org/jira/browse/YARN-3728
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>
> Create an {{rmadmin}} command to compact any concatenated aggregated log 
> files it finds in the aggregated logs directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3729) Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat

2016-10-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-3729.
-
Resolution: Won't Fix

Closing this as "Won't Fix" given we have MAPREDUCE-6415.

> Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat
> 
>
> Key: YARN-3729
> URL: https://issues.apache.org/jira/browse/YARN-3729
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>
> When serving logs, the {{yarn}} CLI needs to be able to read the 
> ConcatenatableAggregatedLogFormat or the AggregatedLogFormat transparently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3219) Modify the NM to write logs using the ConcatenatableAggregatedLogFormat

2016-10-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-3219.
-
Resolution: Won't Fix

Closing this as "Won't Fix" given we have MAPREDUCE-6415.

> Modify the NM to write logs using the ConcatenatableAggregatedLogFormat
> ---
>
> Key: YARN-3219
> URL: https://issues.apache.org/jira/browse/YARN-3219
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>
> The NodeManager should use the {{ConcatenatableAggregatedLogFormat}} from 
> YARN-3218 instead of the {{AggregatedLogFormat}} for writing aggregated log 
> files to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-2942) Aggregated Log Files should be combined

2016-10-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-2942.
-
  Resolution: Won't Fix
Target Version/s:   (was: 2.8.0)

Closing this as "Won't Fix" given we have MAPREDUCE-6415.

> Aggregated Log Files should be combined
> ---
>
> Key: YARN-2942
> URL: https://issues.apache.org/jira/browse/YARN-2942
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: CombinedAggregatedLogsProposal_v3.pdf, 
> CombinedAggregatedLogsProposal_v6.pdf, CombinedAggregatedLogsProposal_v7.pdf, 
> CompactedAggregatedLogsProposal_v1.pdf, 
> CompactedAggregatedLogsProposal_v2.pdf, 
> ConcatableAggregatedLogsProposal_v4.pdf, 
> ConcatableAggregatedLogsProposal_v5.pdf, 
> ConcatableAggregatedLogsProposal_v8.pdf, YARN-2942-preliminary.001.patch, 
> YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
> YARN-2942.003.patch
>
>
> Turning on log aggregation allows users to easily store container logs in 
> HDFS and subsequently view them in the YARN web UIs from a central place.  
> Currently, there is a separate log file for each Node Manager.  This can be a 
> problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
> accumulating many (possibly small) files per YARN application.  The current 
> “solution” for this problem is to configure YARN (actually the JHS) to 
> automatically delete these files after some amount of time.  
> We should improve this by compacting the per-node aggregated log files into 
> one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3218) Implement ConcatenatableAggregatedLogFormat Reader and Writer

2016-10-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-3218.
-
Resolution: Won't Fix

Closing this as "Won't Fix" given we have MAPREDUCE-6415.

> Implement ConcatenatableAggregatedLogFormat Reader and Writer
> -
>
> Key: YARN-3218
> URL: https://issues.apache.org/jira/browse/YARN-3218
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-3218.001.patch, YARN-3218.002.patch
>
>
> We need to create a Reader and Writer for the 
> {{ConcatenatableAggregatedLogFormat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5566) client-side NM graceful decom doesn't trigger when jobs finish

2016-08-26 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-5566:
---

 Summary: client-side NM graceful decom doesn't trigger when jobs 
finish
 Key: YARN-5566
 URL: https://issues.apache.org/jira/browse/YARN-5566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
# JobA finishes at 6:00am and there are no other jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5515) Compatibility Docs should clarify the policy for what takes precedence when a conflict is found

2016-08-11 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-5515:
---

 Summary: Compatibility Docs should clarify the policy for what 
takes precedence when a conflict is found
 Key: YARN-5515
 URL: https://issues.apache.org/jira/browse/YARN-5515
 Project: Hadoop YARN
  Issue Type: Task
  Components: documentation
Affects Versions: 2.7.2
Reporter: Robert Kanter


The Compatibility Docs 
(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/Compatibility.html#Java_API)
 list the policies for Private, Public, not annotated, etc Classes and members, 
but it doesn't say what happens when there's a conflict.  We should try 
obviously try to avoid this situation, but it would be good to explicitly state 
what takes precedence.

As an example, until YARN-3225 made it consistent, {{RefreshNodesRequest}} 
looked like this:
{code:java}
@Private
@Stable
public abstract class RefreshNodesRequest {
  @Public
  @Stable
  public static RefreshNodesRequest newInstance() {
RefreshNodesRequest request = Records.newRecord(RefreshNodesRequest.class);
return request;
  }
}
{code}
Note that the class is marked {{\@Private}}, but the method is marked 
{{\@Public}}.

In this example, I'd say that the class level should have priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5465) Server-Side NM Graceful Decommissioning subsequent call behavior

2016-08-02 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-5465:
---

 Summary: Server-Side NM Graceful Decommissioning subsequent call 
behavior
 Key: YARN-5465
 URL: https://issues.apache.org/jira/browse/YARN-5465
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Kanter


The Server-Side NM Graceful Decommissioning feature added by YARN-4676 has the 
following behavior when subsequent calls are made:
# Start a long-running job that has containers running on nodeA
# Add nodeA to the exclude file
# Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully 
decommissioning nodeA
# Wait 30 seconds
# Add nodeB to the exclude file
# Run {{-refreshNodes -g 30 -server}} (30sec)
# After 30 seconds, both nodeA and nodeB shut down

In a nutshell, issuing a subsequent call to gracefully decommission nodes 
updates the timeout for any currently decommissioning nodes.  This makes it 
impossible to gracefully decommission different sets of nodes with different 
timeouts.  Though it does let you easily update the timeout of currently 
decommissioning nodes.

Another behavior we could do is this:
# {color:grey}Start a long-running job that has containers running on nodeA
# {color:grey}Add nodeA to the exclude file{color}
# {color:grey}Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully 
decommissioning nodeA{color}
# {color:grey}Wait 30 seconds{color}
# {color:grey}Add nodeB to the exclude file{color}
# {color:grey}Run {{-refreshNodes -g 30 -server}} (30sec){color}
# After 30 seconds, nodeB shuts down
# After 60 more seconds, nodeA shuts down

This keeps the nodes affected by each call to gracefully decommission nodes 
independent.  You can now have different sets of decommissioning nodes with 
different timeouts.  However, to update the timeout of a currently 
decommissioning node, you'd have to first recommission it, and then 
decommission it again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA

2016-08-02 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-5464:
---

 Summary: Server-Side NM Graceful Decommissioning with RM HA
 Key: YARN-5464
 URL: https://issues.apache.org/jira/browse/YARN-5464
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Kanter
Assignee: Robert Kanter






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5434) Add -client|server argument for graceful decom

2016-07-26 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-5434:
---

 Summary: Add -client|server argument for graceful decom
 Key: YARN-5434
 URL: https://issues.apache.org/jira/browse/YARN-5434
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: graceful
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We should add {{-client|server}} argument to allow the user to specify if they 
want to use the client-side graceful decom tracking, or the server-side 
tracking (YARN-4676).

Even though the server-side tracking won't go into 2.8, we should add the 
arguments to 2.8 for compatibility between 2.8 and 2.9, when it's added.  In 
2.8, using {{-server}} would just throw an Exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4366) Fix Lint Warnings in YARN Common

2016-07-12 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-4366.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.9.0

Thanks [~templedf].  Committed to trunk and branch-2!

> Fix Lint Warnings in YARN Common
> 
>
> Key: YARN-4366
> URL: https://issues.apache.org/jira/browse/YARN-4366
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
> Fix For: 2.9.0
>
> Attachments: YARN-4366.001.patch
>
>
> {noformat}
> [WARNING] 
> /Users/daniel/NetBeansProjects/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Router.java:[100,45]
>  non-varargs call of varargs method with inexact argument type for last 
> parameter;
>   cast to java.lang.Class for a varargs call
>   cast to java.lang.Class[] for a non-varargs call and to suppress this 
> warning
> [WARNING] 
> /Users/daniel/NetBeansProjects/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factory/providers/RpcFactoryProvider.java:[62,46]
>  non-varargs call of varargs method with inexact argument type for last 
> parameter;
>   cast to java.lang.Class for a varargs call
>   cast to java.lang.Class[] for a non-varargs call and to suppress this 
> warning
> [WARNING] 
> /Users/daniel/NetBeansProjects/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factory/providers/RpcFactoryProvider.java:[64,34]
>  non-varargs call of varargs method with inexact argument type for last 
> parameter;
>   cast to java.lang.Object for a varargs call
>   cast to java.lang.Object[] for a non-varargs call and to suppress this 
> warning
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4946) RM should write out Aggregated Log Completion file flag next to logs

2016-04-11 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-4946:
---

 Summary: RM should write out Aggregated Log Completion file flag 
next to logs
 Key: YARN-4946
 URL: https://issues.apache.org/jira/browse/YARN-4946
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Haibo Chen


MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
Yarn App into a HAR file.  When run, it seeds the list by looking at the 
aggregated logs directory, and then filters out ineligible apps.  One of the 
criteria involves checking with the RM that an Application's log aggregation 
status is not still running and has not failed.  When the RM "forgets" about an 
older completed Application (e.g. RM failover, enough time has passed, etc), 
the tool won't find the Application in the RM and will just assume that its log 
aggregation succeeded, even if it actually failed or is still running.

We can solve this problem by doing the following:
# When the RM sees that an Application has successfully finished aggregation 
it's logs, it will write a flag file next to that Application's log files
# The tool no longer talks to the RM at all.  When looking at the FileSystem, 
it now uses that flag file to determine if it should process those log files.  
If the file is there, it archives, otherwise it does not.
# As part of the archiving process, it will delete the flag file
# (If you don't run the tool, the flag file will eventually be cleaned up by 
the JHS when it cleans up the aggregated logs because it's in the same 
directory)

This improvement has several advantages:
# The edge case about "forgotten" Applications is fixed
# The tool no longer has to talk to the RM; it only has to consult HDFS.  This 
is simpler




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2736) Job.getHistoryUrl returns empty string

2016-02-23 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-2736.
-
Resolution: Fixed

> Job.getHistoryUrl returns empty string
> --
>
> Key: YARN-2736
> URL: https://issues.apache.org/jira/browse/YARN-2736
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 2.5.1
>Reporter: Kannan Rajah
>Priority: Critical
>
> getHistoryUrl() method in Job class is returning empty string. Example code:
> job = Job.getInstance(conf);
> job.setJobName("MapReduceApp");
> job.setJarByClass(MapReduceApp.class);
> job.setMapperClass(Mapper1.class);
> job.setCombinerClass(Reducer1.class);
> job.setReducerClass(Reducer1.class);
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(IntWritable.class);
> job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(IntWritable.class);
> job.setNumReduceTasks(1);
> job.setOutputFormatClass(TextOutputFormat.class);
> job.setInputFormatClass(TextInputFormat.class);
> FileInputFormat.addInputPath(job, inputPath);
> FileOutputFormat.setOutputPath(job, outputPath);
>  job.waitForCompletion(true);
>  job.getHistoryUrl();
> It is always returning empty string. Looks like getHistoryUrl() support was 
> removed in YARN-321.
> getTrackingURL() returns correct url though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4408) NodeManager still reports negative running containers

2015-12-01 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-4408:
---

 Summary: NodeManager still reports negative running containers
 Key: YARN-4408
 URL: https://issues.apache.org/jira/browse/YARN-4408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Robert Kanter
Assignee: Robert Kanter


YARN-1697 fixed a problem where the NodeManager metrics could report a negative 
number of running containers.  However, it missed a rare case where this can 
still happen.

YARN-1697 added a flag to indicate if the container was actually launched 
({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which is 
then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
{{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge if 
we actually ran the container and incremented the gauge .  However, this flag 
is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4086) Allow Aggregated Log readers to handle HAR files

2015-08-26 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-4086:
---

 Summary: Allow Aggregated Log readers to handle HAR files
 Key: YARN-4086
 URL: https://issues.apache.org/jira/browse/YARN-4086
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


This is for the YARN changes for MAPREDUCE-6415.  It allows the yarn CLI and 
web UIs to read aggregated logs from HAR files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4019) Add JvmPauseMonitor to ResourceManager and NodeManager

2015-08-04 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-4019:
---

 Summary: Add JvmPauseMonitor to ResourceManager and NodeManager
 Key: YARN-4019
 URL: https://issues.apache.org/jira/browse/YARN-4019
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the ResourceManager 
and NodeManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell

2015-07-21 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3950:
---

 Summary: Add unique SHELL_ID environment variable to 
DistributedShell
 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


As discussed in [this 
comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
 it would be useful to have a monotonically increasing and independent ID of 
some kind that is unique per shell in the distributed shell program.

We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-16 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3812:
---

 Summary: TestRollingLevelDBTimelineStore fails in trunk due to 
HADOOP-11347
 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 3.0.0
Reporter: Robert Kanter


{{TestRollingLevelDBTimelineStore}} is failing with the below errors in trunk.  
I did a git bisect and found that it was due to HADOOP-11347, which changed 
something with umasks in {{FsPermission}}.

{noformat}
Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec  
FAILURE! - in 
org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
  Time elapsed: 1.533 sec   ERROR!
java.lang.UnsupportedOperationException: null
at 
org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)

testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
  Time elapsed: 0.085 sec   ERROR!
java.lang.UnsupportedOperationException: null
at 
org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)

testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
  Time elapsed: 0.07 sec   ERROR!
java.lang.UnsupportedOperationException: null
at 
org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)

testGetEntitiesWithPrimaryFilters(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
  Time elapsed: 0.061 sec   ERROR!
java.lang.UnsupportedOperationException: null
at 
org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
at 

[jira] [Created] (YARN-3729) Modify the yarn CLI to be able to read the ConcatenatableAggregatedLogFormat

2015-05-27 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3729:
---

 Summary: Modify the yarn CLI to be able to read the 
ConcatenatableAggregatedLogFormat
 Key: YARN-3729
 URL: https://issues.apache.org/jira/browse/YARN-3729
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


When serving logs, the {{yarn}} CLI needs to be able to read the 
ConcatenatableAggregatedLogFormat or the AggregatedLogFormat transparently.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3728) Add an rmadmin command to compact concatenated aggregated logs

2015-05-27 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3728:
---

 Summary: Add an rmadmin command to compact concatenated aggregated 
logs
 Key: YARN-3728
 URL: https://issues.apache.org/jira/browse/YARN-3728
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


Create an {{rmadmin}} command to compact any concatenated aggregated log files 
it finds in the aggregated logs directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3580) [JDK 8] TestClientRMService.testGetLabelsToNodes fails

2015-05-05 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3580:
---

 Summary: [JDK 8] TestClientRMService.testGetLabelsToNodes fails
 Key: YARN-3580
 URL: https://issues.apache.org/jira/browse/YARN-3580
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
 Environment: JDK 8
Reporter: Robert Kanter
Assignee: Robert Kanter


When using JDK 8, {{TestClientRMService.testGetLabelsToNodes}} fails:
{noformat}
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetLabelsToNodes(TestClientRMService.java:1499)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3219) Use CombinedAggregatedLogFormat Writer to combine aggregated log files

2015-02-18 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3219:
---

 Summary: Use CombinedAggregatedLogFormat Writer to combine 
aggregated log files
 Key: YARN-3219
 URL: https://issues.apache.org/jira/browse/YARN-3219
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


The NodeManager should use the {{CombinedAggregatedLogFormat}} from YARN-3218 
to append its aggregated log to the per-app log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3220) JHS should display Combined Aggregated Logs when available

2015-02-18 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3220:
---

 Summary: JHS should display Combined Aggregated Logs when available
 Key: YARN-3220
 URL: https://issues.apache.org/jira/browse/YARN-3220
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


The JHS should read the Combined Aggregated Log files created by YARN-3219 when 
the user asks it for logs.  When unavailable, it should fallback to the regular 
Aggregated Log files (the current behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3218) Implement CombinedAggregatedLogFormat Reader and Writer

2015-02-18 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3218:
---

 Summary: Implement CombinedAggregatedLogFormat Reader and Writer
 Key: YARN-3218
 URL: https://issues.apache.org/jira/browse/YARN-3218
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We need to create a Reader and Writer for the CombinedAggregatedLogFormat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3183) Some classes define hashcode() but not equals()

2015-02-11 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-3183:
---

 Summary: Some classes define hashcode() but not equals()
 Key: YARN-3183
 URL: https://issues.apache.org/jira/browse/YARN-3183
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Minor


These files all define {{hashCode}}, but don't define {{equals}}:
{noformat}
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationAttemptFinishEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationAttemptStartEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationFinishEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingApplicationStartEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingContainerFinishEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/WritingContainerStartEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/AppAttemptFinishedEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/AppAttemptRegisteredEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationCreatedEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationFinishedEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ContainerCreatedEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ContainerFinishedEvent.java
{noformat}

This one unnecessarily defines {{equals}}:
{noformat}
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceRetentionSet.java
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2766) [JDK 8] TestApplicationHistoryClientService fails

2014-10-28 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-2766:
---

 Summary: [JDK 8] TestApplicationHistoryClientService fails
 Key: YARN-2766
 URL: https://issues.apache.org/jira/browse/YARN-2766
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Robert Kanter
Assignee: Robert Kanter


{{TestApplicationHistoryClientService.testContainers}} and 
{{TestApplicationHistoryClientService.testApplicationAttempts}} both fail 
because the test assertions are assuming a returned Collection is in a certain 
order.  The collection comes from a HashMap, so the order is not guaranteed, 
plus, according to [this 
page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html],
 there are situations where the iteration order of a HashMap will be different 
between Java 7 and 8.

We should fix the test code to not assume a specific ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2241) Show nicer messages when ZNodes already exist in ZKRMStateStore on startup

2014-07-01 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-2241:
---

 Summary: Show nicer messages when ZNodes already exist in 
ZKRMStateStore on startup
 Key: YARN-2241
 URL: https://issues.apache.org/jira/browse/YARN-2241
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Minor


When using the RMZKStateStore, if you restart the RM, you get a bunch of stack 
traces with messages like 
{{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /rmstore}}.  This is expected as these nodes already exist from 
before.  We should catch these and print nicer messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2199) FairScheduler: Allow max-AM-share to be specified in the root queue

2014-06-24 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-2199:
---

 Summary: FairScheduler: Allow max-AM-share to be specified in the 
root queue
 Key: YARN-2199
 URL: https://issues.apache.org/jira/browse/YARN-2199
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.0
Reporter: Robert Kanter
Assignee: Robert Kanter


If users want to specify the max-AM-share, they have to do it for each leaf 
queue individually.  It would be convenient if they could also specify it in 
the root queue so they'd only have to specify it once to apply to all queues.  
It could still be overridden in a specific leaf queue though.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2187) FairScheduler should have a way of disabling the max AM share check for launching new AMs

2014-06-20 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-2187:
---

 Summary: FairScheduler should have a way of disabling the max AM 
share check for launching new AMs
 Key: YARN-2187
 URL: https://issues.apache.org/jira/browse/YARN-2187
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.0
Reporter: Robert Kanter
Assignee: Robert Kanter


Say you have a small cluster with 8gb memory and 5 queues.  This means that 
equal queue can have 8gb / 5 = 1.6gb but an AM requires 2gb to start so no AMs 
can be started.  We should have a way of disabling this check to prevent this 
problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2015) HTTPS doesn't work properly for daemons (RM, JHS, NM)

2014-05-01 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-2015:
---

 Summary: HTTPS doesn't work properly for daemons (RM, JHS, NM)
 Key: YARN-2015
 URL: https://issues.apache.org/jira/browse/YARN-2015
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker


Enabling SSL in the site files and setting up a certificate, keystore, etc 
doesn't actually enable HTTPS.  The RM, NMs, and JHS will use their https port, 
but use only HTTP on them instead of only HTTPS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2015) HTTPS doesn't work properly for daemons (RM, JHS, NM)

2014-05-01 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-2015.
-

Resolution: Invalid

Nevermind, this appears to be fixed by YARN-1553

 HTTPS doesn't work properly for daemons (RM, JHS, NM)
 -

 Key: YARN-2015
 URL: https://issues.apache.org/jira/browse/YARN-2015
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.3.0, 2.4.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker

 Enabling SSL in the site files and setting up a certificate, keystore, etc 
 doesn't actually enable HTTPS.  The RM, NMs, and JHS will use their https 
 port, but use only HTTP on them instead of only HTTPS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens

2014-03-17 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-1795.
-

Resolution: Duplicate
  Assignee: Robert Kanter  (was: Karthik Kambatla)

I tried the patch posted at YARN-1839 and it fixes the problem.  Marking this 
as a duplicate of that.

 After YARN-713, using FairScheduler can cause an InvalidToken Exception for 
 NMTokens
 

 Key: YARN-1795
 URL: https://issues.apache.org/jira/browse/YARN-1795
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker
 Attachments: 
 org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog


 Running the Oozie unit tests against a Hadoop build with YARN-713 causes many 
 of the tests to be flakey.  Doing some digging, I found that they were 
 failing because some of the MR jobs were failing; I found this in the syslog 
 of the failed jobs:
 {noformat}
 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics 
 report from attempt_1394064846476_0013_m_00_0: Container launch failed 
 for container_1394064846476_0013_01_03 : 
 org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
 for 192.168.1.77:50759
at 
 org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206)
at 
 org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196)
at 
 org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 {noformat}
 I did some debugging and found that the NMTokenCache has a different port 
 number than what's being looked up.  For example, the NMTokenCache had one 
 token with address 192.168.1.77:58217 but 
 ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. 
 The 58213 address comes from ContainerLauncherImpl's constructor. So when the 
 Container is being launched it somehow has a different port than when the 
 token was created.
 Any ideas why the port numbers wouldn't match?
 Update: This also happens in an actual cluster, not just Oozie's unit tests



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1822) Revisit AM link being broken for work preserving restart

2014-03-13 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter resolved YARN-1822.
-

Resolution: Invalid

YARN-1811 is being done differently, and this is no longer needed

 Revisit AM link being broken for work preserving restart
 

 Key: YARN-1822
 URL: https://issues.apache.org/jira/browse/YARN-1822
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Robert Kanter

 We should revisit the issue in YARN-1811 as it may require changes once we 
 have work-preserving restarts.  
 Currently, the AmIpFilter is given the active RM at AM 
 initialization/startup, so when the RM fails over and the AM is restarted, 
 this gets recalculated properly.  However, with work-preserving restart, this 
 will now point to the inactive RM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1822) Revisit AM link being broken for RM restart

2014-03-11 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-1822:
---

 Summary: Revisit AM link being broken for RM restart
 Key: YARN-1822
 URL: https://issues.apache.org/jira/browse/YARN-1822
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Robert Kanter


We should revisit the issue in YARN-1811 as it may require changes once we have 
work-preserving restarts.  

Currently, the AmIpFilter is given the active RM at AM initialization/startup, 
so when the RM fails over and the AM is restarted, this gets recalculated 
properly.  However, with work-preserving restart, this will now point to the 
inactive RM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1811) Error 500 when clicking the Application Master link in the RM UI while a job is running with RM HA

2014-03-10 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-1811:
---

 Summary: Error 500 when clicking the Application Master link in 
the RM UI while a job is running with RM HA
 Key: YARN-1811
 URL: https://issues.apache.org/jira/browse/YARN-1811
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Robert Kanter
Assignee: Robert Kanter


When using RM HA, if you click on the Application Master link in the RM web 
UI while the job is running, you get an Error 500:
{noformat}
HTTP ERROR 500

Problem accessing /proxy/application_1381788742937_0003/. Reason:

Connection refused
Caused by:

java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.init(Socket.java:425)
at java.net.Socket.init(Socket.java:280)
at 
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at 
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at 
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:185)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:334)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1077)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at 

[jira] [Created] (YARN-1795) Oozie tests are flakey after YARN-713

2014-03-06 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-1795:
---

 Summary: Oozie tests are flakey after YARN-713
 Key: YARN-1795
 URL: https://issues.apache.org/jira/browse/YARN-1795
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Robert Kanter


Running the Oozie unit tests against a Hadoop build with YARN-713 causes many 
of the tests to be flakey.  Doing some digging, I found that they were failing 
because some of the MR jobs were failing; I found this in the syslog of the 
failed jobs:
{noformat}
2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1394064846476_0013_m_00_0: Container launch failed for 
container_1394064846476_0013_01_03 : 
org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
for 192.168.1.77:50759
   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206)
   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196)
   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117)
   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403)
   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
{noformat}

I did some debugging and found that the NMTokenCache has a different port 
number than what's being looked up.  For example, the NMTokenCache had one 
token with address 192.168.1.77:58217 but 
ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. 
The 58213 address comes from ContainerLauncherImpl's constructor. So when the 
Container is being launched it somehow has a different port than when the token 
was created.

Any ideas why the port numbers wouldn't match?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1731) ResourceManager should record killed ApplicationMasters for History

2014-02-13 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-1731:
---

 Summary: ResourceManager should record killed ApplicationMasters 
for History
 Key: YARN-1731
 URL: https://issues.apache.org/jira/browse/YARN-1731
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: YARN-1731.patch

Yarn changes required for MAPREDUCE-5641 to make the RM record when an AM is 
killed so the JHS (or something else) can know about it).  See MAPREDUCE-5641 
for the design I'm trying to follow.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)