date:20150505


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528908#comment-14528908
 ] 

Wangda Tan commented on YARN-3514:
--

[~cnauroth],

bq. I've seen a few mentions online that Active Directory is not case-sensitive 
but is case-preserving. That means it will preserve the case you used in 
usernames, but the case doesn't matter for comparisons. I've also seen 
references that DNS has similar behavior with regards to case.
Good point! I've found one post about this: 
https://msdn.microsoft.com/en-us/library/bb726984.aspx:
bq. Note: Although Windows 2000 stores user names in the case that you enter, 
user names aren't case sensitive. For example, you can access the Administrator 
account with the user name Administrator or administrator. Thus, user names are 
case aware but not case sensitive..

So I think it's safe to make this change too.

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService

2015-05-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528990#comment-14528990
 ] 

Hudson commented on YARN-3396:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #185 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/185/])
YARN-3396. Handle URISyntaxException in ResourceLocalizationService. 
(Contributed by Brahma Reddy Battula) (junping_du: rev 
38102420621308f5ba91cdeb6a18a63aa5acf640)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


 Handle URISyntaxException in ResourceLocalizationService
 

 Key: YARN-3396
 URL: https://issues.apache.org/jira/browse/YARN-3396
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chengbing Liu
Assignee: Brahma Reddy Battula
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3396-002.patch, YARN-3396.patch


 There are two occurrences of the following code snippet:
 {code}
 //TODO fail? Already translated several times...
 {code}
 It should be handled correctly in case that the resource URI is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2918) Don't fail RM if queue's configured labels are not existed in cluster-node-labels


[ 
https://issues.apache.org/jira/browse/YARN-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528991#comment-14528991
 ] 

Wangda Tan commented on YARN-2918:
--

Added more details to description, I plan to do following stuffs in the patch:
- Stop checking label's existence while init queue
- Continue check label's capacity setting ({{Σchild-queue.label.capacity = 
100}})
- Reject application/resource-request if label is not exist.


 Don't fail RM if queue's configured labels are not existed in 
 cluster-node-labels
 -

 Key: YARN-2918
 URL: https://issues.apache.org/jira/browse/YARN-2918
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Rohith
Assignee: Wangda Tan

 Currently, if admin setup labels on queues 
 {{queue-path.accessible-node-labels = ...}}. And the label is not added to 
 RM, queue's initialization will fail and RM will fail too:
 {noformat}
 2014-12-03 20:11:50,126 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
 ResourceManager
 ...
 Caused by: java.io.IOException: NodeLabelManager doesn't include label = x, 
 please check.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:324)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {noformat}
 This is not a good user experience, we should stop fail RM so that admin can 
 configure queue/labels in following steps:
 - Configure queue (with label)
 - Start RM
 - Add labels to RM
 - Submit applications
 Now admin has to:
 - Configure queue (without label)
 - Start RM
 - Add labels to RM
 - Refresh queue's config (with label)
 - Submit applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3552) RM Web UI shows -1 running containers for completed apps

2015-05-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528987#comment-14528987
 ] 

Hudson commented on YARN-3552:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #185 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/185/])
YARN-3552. RM Web UI shows -1 running containers for completed apps. 
Contributed by Rohith (jlowe: rev 9356cf8676fd18f78655e8a6f2e6c946997dbd40)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerAppsBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java


 RM Web UI shows -1 running containers for completed apps
 

 Key: YARN-3552
 URL: https://issues.apache.org/jira/browse/YARN-3552
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Rohith
Assignee: Rohith
Priority: Trivial
  Labels: newbie
 Fix For: 2.8.0

 Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, 
 0002-YARN-3552.patch, yarn-3352.PNG


 In the RMServerUtils, the default values are negative number which results in 
 the displayiing the RM web UI also negative number. 
 {code}
   public static final ApplicationResourceUsageReport
 DUMMY_APPLICATION_RESOURCE_USAGE_REPORT =
   BuilderUtils.newApplicationResourceUsageReport(-1, -1,
   Resources.createResource(-1, -1), Resources.createResource(-1, -1),
   Resources.createResource(-1, -1), 0, 0);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk

2015-05-05 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529004#comment-14529004
 ] 

Jian He commented on YARN-3343:
---

cool, thanks for the testing ! committing this.

 TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
 ---

 Key: YARN-3343
 URL: https://issues.apache.org/jira/browse/YARN-3343
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Xuan Gong
Assignee: Rohith
Priority: Minor
 Attachments: 0001-YARN-3343.patch


 Error Message
 test timed out after 3 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 3 milliseconds
   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
   at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
   at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
   at java.net.InetAddress.getAllByName(InetAddress.java:1162)
   at java.net.InetAddress.getAllByName(InetAddress.java:1098)
   at java.net.InetAddress.getByName(InetAddress.java:1048)
   at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3526) ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster

2015-05-05 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-3526:
-

Assignee: Yang Weiwei

[~cheersyang], assigning this to you. Please consider writing a test-case..

 ApplicationMaster tracking URL is incorrectly redirected on a QJM cluster
 -

 Key: YARN-3526
 URL: https://issues.apache.org/jira/browse/YARN-3526
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.6.0
 Environment: Red Hat Enterprise Linux Server 6.4 
Reporter: Yang Weiwei
Assignee: Yang Weiwei
 Attachments: YARN-3526.patch


 On a QJM HA cluster, view RM web UI to track job status, it shows
 This is standby RM. Redirecting to the current active RM: 
 http://active-RM:8088/proxy/application_1427338037905_0008/mapreduce
 it refreshes every 3 sec but never going to the correct tracking page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-05-05 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528891#comment-14528891
 ] 

zhihai xu commented on YARN-3385:
-

By the way, I forget to mention, if NoNodeException happened due to this race 
condition, it means one of the delete operations was done, because 
{{zkClient.multi}} will either execute all of the Op's or none of them, all of 
the delete operations must be done. 
[This|http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html]
 is a good article which talks about [multi update for 
zookeeper|http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html]

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch, YARN-3385.001.patch, 
 YARN-3385.002.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1878) Yarn standby RM taking long to transition to active


[ 
https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528934#comment-14528934
 ] 

Xuan Gong commented on YARN-1878:
-

Cancel the patch. Looks like that we need more discussion on this one

 Yarn standby RM taking long to transition to active
 ---

 Key: YARN-1878
 URL: https://issues.apache.org/jira/browse/YARN-1878
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Xuan Gong
 Attachments: YARN-1878.1.patch


 In our HA tests we are noticing that some times it can take upto 10s for the 
 standby RM to transition to active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control


[ 
https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528952#comment-14528952
 ] 

Xuan Gong commented on YARN-1912:
-

Cancel the patch since current patch does not apply anymore. 
[~iwasakims] Could you re-base the patch, please ?

 ResourceLocalizer started without any jvm memory control
 

 Key: YARN-1912
 URL: https://issues.apache.org/jira/browse/YARN-1912
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: stanley shi
 Attachments: YARN-1912-0.patch, YARN-1912-1.patch


 In the LinuxContainerExecutor.java#startLocalizer, it does not specify any 
 -Xmx configurations in the command, this caused the ResourceLocalizer to be 
 started with default memory setting.
 In an server-level hardware, it will use 25% of the system memory as the max 
 heap size, this will cause memory issue in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2918) Don't fail RM if queue's configured labels are not existed in cluster-node-labels


 [ 
https://issues.apache.org/jira/browse/YARN-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2918:
-
Summary: Don't fail RM if queue's configured labels are not existed in 
cluster-node-labels  (was: RM starts up fails if accessible-node-labels are 
configured to queue without cluster lables)

 Don't fail RM if queue's configured labels are not existed in 
 cluster-node-labels
 -

 Key: YARN-2918
 URL: https://issues.apache.org/jira/browse/YARN-2918
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Rohith
Assignee: Wangda Tan

 I configured accessible-node-labels to queue. But RM startup fails with below 
 exception. I see current steps to configure NodeLabel is first need to add 
 via rmadmin and later need to configure for queues. But it will be good if 
 both cluster and queue node labels has consitency in configuring it. 
 {noformat}
 2014-12-03 20:11:50,126 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
 ResourceManager
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
 NodeLabelManager doesn't include label = x, please check.
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:982)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:249)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1203)
 Caused by: java.io.IOException: NodeLabelManager doesn't include label = x, 
 please check.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:324)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3578) HistoryFileManager.scanDirectory() should check if the dateString path exists else it throw FileNotFoundException

2015-05-05 Thread Siddhi Mehta (JIRA)

Siddhi Mehta created YARN-3578:
--

 Summary: HistoryFileManager.scanDirectory() should check if the 
dateString path exists else it throw FileNotFoundException
 Key: YARN-3578
 URL: https://issues.apache.org/jira/browse/YARN-3578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.3.0
Reporter: Siddhi Mehta


When the job client tries to access counters for a recently completed job.
Here is what I think is happening.

1. The job in question started an completed on 05/02/2015. So ideally the 
history file location should be /mapred/history/done/2015/05/02/{02}/
2. But instead HistoryFileManager looks at directory 
/mapred/history/done/2015/04/02/{02}/ and fails

Looking at the logic in 
{code}org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanOldDirsForJob(JobId)
 {code}

 of how the idtoDateString cache is created looks like the key is independent 
of the RM start  time,

So if you had 2 jobs job_RMstarttime1_0001 and job_RMstarttime2_0001, the 
idtoDateString cache will have the following entries

1  -  { job_RMstarttime1_0001historydir, 
job_RMstarttime2_0001historyDir).

3. If job_RMstarttime1_0001 is older than mapreduce.jobhistory.max-age-ms we 
delete the history info from HDFS.

4. For job_RMstarttime2_0001historyDir when we try and query it fails with a 
filenotFoundException.

Either the keys should be aware of RM starttime or before  
HistoryFileManager.scanDirectory does a list status it should check if the path 
exists to avoid file not found exception.

{code}
private static ListFileStatus scanDirectory(Path path, FileContext fc,
  PathFilter pathFilter) throws IOException {
path = fc.makeQualified(path);
ListFileStatus jhStatusList = new ArrayListFileStatus();
if(!fc.exists(path)) {
 return jhStatusList
}
RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
while (fileStatusIter.hasNext()) {
  FileStatus fileStatus = fileStatusIter.next();
  Path filePath = fileStatus.getPath();
  if (fileStatus.isFile()  pathFilter.accept(filePath)) {
jhStatusList.add(fileStatus);
  }
}
return jhStatusList;
  }
{code}

Complete stack trace: 
gslog`20150504141445.816``263424`0`0189246-10858515`754671855`/ex/UnhandledException.jsp`JAVA.FileNotFoundException
 - java.io.IOException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException):
 java.io.FileNotFoundException: File /mapred/history/done/2015/04/02/02 
does not exist.
at 
org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:147)
at 
org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:217)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:203)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:199)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:199)
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:231)
at 
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
at 
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
Caused by: java.io.FileNotFoundException: File 
/mapred/history/done/2015/04/02/02 does not exist.
at org.apache.hadoop.fs.Hdfs$DirListingIterator.init(Hdfs.java:205)
at org.apache.hadoop.fs.Hdfs$DirListingIterator.init(Hdfs.java:189)
at org.apache.hadoop.fs.Hdfs$2.init(Hdfs.java:171)
at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:171)
at org.apache.hadoop.fs.FileContext$20.next(FileContext.java:1392)
at

[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped

2015-05-05 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529030#comment-14529030
 ] 

Vinod Kumar Vavilapalli commented on YARN-3561:
---

bq. Could this be OS specific (debian 7)?
Possible. Can you post the full NM logs?

 Non-AM Containers continue to run even after AM is stopped
 --

 Key: YARN-3561
 URL: https://issues.apache.org/jira/browse/YARN-3561
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, yarn
Affects Versions: 2.6.0
 Environment: debian 7
Reporter: Gour Saha
Priority: Critical

 Non-AM containers continue to run even after application is stopped. This 
 occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a 
 Hadoop 2.6 deployment. 
 Following are the NM logs from 2 different nodes:
 *host-07* - where Slider AM was running
 *host-03* - where Storm NIMBUS container was running.
 *Note:* The logs are partial, starting with the time when the relevant Slider 
 AM and NIMBUS containers were allocated, till the time when the Slider AM was 
 stopped. Also, the large number of Memory usage log lines were removed 
 keeping only a few starts and ends of every segment.
 *NM log from host-07 where Slider AM container was running:*
 {noformat}
 2015-04-29 00:39:24,614 INFO  monitor.ContainersMonitorImpl 
 (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for 
 container_1428575950531_0020_02_01
 2015-04-29 00:41:10,310 INFO  ipc.Server (Server.java:saslProcess(1306)) - 
 Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE)
 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
 (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for 
 container_1428575950531_0021_01_01 by user yarn
 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
 (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new 
 application reference for app application_1428575950531_0021
 2015-04-29 00:41:10,323 INFO  application.Application 
 (ApplicationImpl.java:handle(464)) - Application 
 application_1428575950531_0021 transitioned from NEW to INITING
 2015-04-29 00:41:10,325 INFO  nodemanager.NMAuditLogger 
 (NMAuditLogger.java:logSuccess(89)) - USER=yarn   IP=10.84.105.162
 OPERATION=Start Container Request   TARGET=ContainerManageImpl  
 RESULT=SUCCESS  APPID=application_1428575950531_0021
 CONTAINERID=container_1428575950531_0021_01_01
 2015-04-29 00:41:10,328 WARN  logaggregation.LogAggregationService 
 (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root 
 Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: 
 [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple 
 users.
 2015-04-29 00:41:10,328 WARN  logaggregation.AppLogAggregatorImpl 
 (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as 
 -1. The log rolling mornitoring interval is disabled. The logs will be 
 aggregated after this application is finished.
 2015-04-29 00:41:10,351 INFO  application.Application 
 (ApplicationImpl.java:transition(304)) - Adding 
 container_1428575950531_0021_01_01 to application 
 application_1428575950531_0021
 2015-04-29 00:41:10,352 INFO  application.Application 
 (ApplicationImpl.java:handle(464)) - Application 
 application_1428575950531_0021 transitioned from INITING to RUNNING
 2015-04-29 00:41:10,356 INFO  container.Container 
 (ContainerImpl.java:handle(999)) - Container 
 container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING
 2015-04-29 00:41:10,357 INFO  containermanager.AuxServices 
 (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId 
 application_1428575950531_0021
 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(203)) - Resource 
 hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar
  transitioned from INIT to DOWNLOADING
 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(203)) - Resource 
 hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar
  transitioned from INIT to DOWNLOADING
 2015-04-29 00:41:10,358 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(203)) - Resource 
 hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar
  transitioned from INIT to DOWNLOADING
 2015-04-29 00:41:10,358 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(203)) - Resource 
 hdfs://zsexp/user/yarn/.slider/cluster/storm1/confdir/log4j-server.properties 
 transitioned from INIT to

[jira] [Commented] (YARN-2123) Progress bars in Web UI always at 100% due to non-US locale


[ 
https://issues.apache.org/jira/browse/YARN-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528855#comment-14528855
 ] 

Xuan Gong commented on YARN-2123:
-

+1 LGTM. Will commit

 Progress bars in Web UI always at 100% due to non-US locale
 ---

 Key: YARN-2123
 URL: https://issues.apache.org/jira/browse/YARN-2123
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.3.0
Reporter: Johannes Simon
Assignee: Akira AJISAKA
 Attachments: NaN_after_launching_RM.png, YARN-2123-001.patch, 
 YARN-2123-002.patch, YARN-2123-003.patch, YARN-2123-004.patch, 
 YARN-2123-branch-2.7.001.patch, fair-scheduler-ajisaka.xml, 
 screenshot-noPatch.png, screenshot-patch.png, screenshot.png, 
 yarn-site-ajisaka.xml


 In our cluster setup, the YARN web UI always shows progress bars at 100% (see 
 screenshot, progress of the reduce step is roughly at 32.82%). I opened the 
 HTML source code to check (also see screenshot), and it seems the problem is 
 that it uses a comma as decimal mark, where most browsers expect a dot for 
 floating-point numbers. This could possibly be due to localized number 
 formatting being used in the wrong place, which would also explain why this 
 bug is not always visible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1843) LinuxContainerExecutor should always log output


[ 
https://issues.apache.org/jira/browse/YARN-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528913#comment-14528913
 ] 

Xuan Gong commented on YARN-1843:
-

[~liangly] Could you rebase the patch, please ? Current patch does not apply 
anymore.

 LinuxContainerExecutor should always log output
 ---

 Key: YARN-1843
 URL: https://issues.apache.org/jira/browse/YARN-1843
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Liyin Liang
Assignee: Liyin Liang
Priority: Trivial
 Attachments: YARN-1843-1.diff, YARN-1843-2.diff, YARN-1843.diff


 If debug is enable, LinuxContainerExecutor should aloways log output after 
 shExec.execute().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts


[ 
https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528966#comment-14528966
 ] 

Xuan Gong commented on YARN-2684:
-

Cancel the patch since it does not apply anymore.
[~rohithsharma] Could you re-base the patch, please ?

 FairScheduler should tolerate queue configuration changes across RM restarts
 

 Key: YARN-2684
 URL: https://issues.apache.org/jira/browse/YARN-2684
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-2684.patch


 YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities

2015-05-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528969#comment-14528969
 ] 

Hadoop QA commented on YARN-3448:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 42s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 40s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 50s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | native |   3m 14s | Pre-build of native portion |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   3m 26s | Tests passed in 
hadoop-yarn-server-applicationhistoryservice. |
| {color:green}+1{color} | hdfs tests |   0m 15s | Tests passed in 
hadoop-hdfs-client. |
| | |  46m 17s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730542/YARN-3448.16.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 9b01f81 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7710/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-applicationhistoryservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7710/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt
 |
| hadoop-hdfs-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7710/artifact/patchprocess/testrun_hadoop-hdfs-client.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7710/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7710/console |


This message was automatically generated.

 Add Rolling Time To Lives Level DB Plugin Capabilities
 --

 Key: YARN-3448
 URL: https://issues.apache.org/jira/browse/YARN-3448
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-3448.1.patch, YARN-3448.10.patch, 
 YARN-3448.12.patch, YARN-3448.13.patch, YARN-3448.14.patch, 
 YARN-3448.15.patch, YARN-3448.16.patch, YARN-3448.2.patch, YARN-3448.3.patch, 
 YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, YARN-3448.8.patch, 
 YARN-3448.9.patch


 For large applications, the majority of the time in LeveldbTimelineStore is 
 spent deleting old entities record at a time. An exclusive write lock is held 
 during the entire deletion phase which in practice can be hours. If we are to 
 relax some of the consistency constraints, other performance enhancing 
 techniques can be employed to maximize the throughput and minimize locking 
 time.
 Split the 5 sections of the leveldb database (domain, owner, start time, 
 entity, index) into 5 separate databases. This allows each database to 
 maximize the read cache effectiveness based on the unique usage patterns of 
 each database. With 5 separate databases each lookup is much faster. This can 
 also help with I/O to have the entity and index databases on separate disks.
 Rolling DBs for entity and index DBs. 99.9% of the data are in these two 
 sections 4:1 ration (index to entity) at least for tez. We replace DB record 
 removal with file system removal if we create a rolling set of databases that 
 age out and can be efficiently removed. To do this we must place a constraint 
 to always place an entity's events into it's correct rolling db instance 
 based on start time. This allows us to stitching the data back together while 
 reading and artificial paging.
 Relax the synchronous writes constraints. If we are willing to accept losing 
 some

[jira] [Updated] (YARN-3578) Accessing Counters after RM restart results in stale cache (fails with FileNotFoundException)

2015-05-05 Thread Siddhi Mehta (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddhi Mehta updated YARN-3578:
---
Summary: Accessing Counters after RM restart results in stale cache (fails 
with FileNotFoundException)  (was: Accessing Counters after RM restart fails 
with FileNotFoundException)

 Accessing Counters after RM restart results in stale cache (fails with 
 FileNotFoundException)
 -

 Key: YARN-3578
 URL: https://issues.apache.org/jira/browse/YARN-3578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.3.0
Reporter: Siddhi Mehta

 When the job client tries to access counters for a recently completed job.
 Here is what I think is happening.
 1. The job in question started an completed on 05/02/2015. So ideally the 
 history file location should be /mapred/history/done/2015/05/02/{02}/
 2. But instead HistoryFileManager looks at directory 
 /mapred/history/done/2015/04/02/{02}/ and fails
 Looking at the logic in 
 {code}org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanOldDirsForJob(JobId)
  {code}
  of how the idtoDateString cache is created looks like the key is independent 
 of the RM start  time,
 So if you had 2 jobs job_RMstarttime1_0001 and job_RMstarttime2_0001, the 
 idtoDateString cache will have the following entries
 1  -  { job_RMstarttime1_0001historydir, 
 job_RMstarttime2_0001historyDir).
 3. If job_RMstarttime1_0001 is older than mapreduce.jobhistory.max-age-ms 
 we delete the history info from HDFS.
 4. For job_RMstarttime2_0001historyDir when we try and query it fails with a 
 filenotFoundException.
 Either the keys should be aware of RM starttime or before  
 HistoryFileManager.scanDirectory does a list status it should check if the 
 path exists to avoid file not found exception.
 {code}
 private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 if(!fc.exists(path)) {
  return jhStatusList
 }
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {code}
 Complete stack trace: 
 gslog`20150504141445.816``263424`0`0189246-10858515`754671855`/ex/UnhandledException.jsp`JAVA.FileNotFoundException
  - java.io.IOException: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException):
  java.io.FileNotFoundException: File /mapred/history/done/2015/04/02/02 
 does not exist.
   at 
 org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:147)
   at 
 org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:217)
   at 
 org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:203)
   at 
 org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:199)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:199)
   at 
 org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:231)
   at 
 org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
   at 
 org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
 Caused by: java.io.FileNotFoundException: File 
 /mapred/history/done/2015/04/02/02 does

[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk

2015-05-05 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529037#comment-14529037
 ] 

Hudson commented on YARN-3343:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7739 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7739/])
YARN-3343. Increased TestCapacitySchedulerNodeLabelUpdate#testNodeUpdate 
timeout. Contributed by Rohith Sharmaks (jianhe: rev 
e4c3b52c896291012f869ebc0a21e85e643fadd1)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerNodeLabelUpdate.java
* hadoop-yarn-project/CHANGES.txt


 TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
 ---

 Key: YARN-3343
 URL: https://issues.apache.org/jira/browse/YARN-3343
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Xuan Gong
Assignee: Rohith
Priority: Minor
 Fix For: 2.8.0

 Attachments: 0001-YARN-3343.patch


 Error Message
 test timed out after 3 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 3 milliseconds
   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
   at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
   at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
   at java.net.InetAddress.getAllByName(InetAddress.java:1162)
   at java.net.InetAddress.getAllByName(InetAddress.java:1098)
   at java.net.InetAddress.getByName(InetAddress.java:1048)
   at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels


[ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529052#comment-14529052
 ] 

Wangda Tan commented on YARN-3521:
--

[~sunilg],
Thanks for updating, I have a offline sync with Vinod about using object or 
string in API, some suggestions:
- addToClusterNodeLabel should be object, (you've done this in your patch)
- getLabelsOnNode, getNodeToLabels, getLabelsToNodes should use object, this 
will make user can easily understand attributes of labels on nodes without 
calling getClusterNodeLabels. (You have done some of them, but getLabelsToNodes 
should be updated as well)
- replace/remove should use list of label name only, label name is unique key 
of node labels, using NodeLabelInfo object here is unnecessary.
- I found in your patch, when calling getNodeToLabels, it returns NodeLabelInfo 
with default attributes, we can fix this in separated patch (we need make 
changes to NodeLabelsManager too)
- RPC API should be consistent with this, should be addressed in a separated 
JIRA.

I'm fine with dropping NodeLabelNames as well, if it can keep the REST returned 
structure clean :).

 Support return structured NodeLabel objects in REST API when call 
 getClusterNodeLabels
 --

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch, 0004-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.


[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528194#comment-14528194
 ] 

Rohith commented on YARN-3543:
--

It would be good if it can be done in different JIRA since it is different 
module. I feel it need not to mix with this.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.


 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0001-YARN-3543.patch

Updated the patch fixing test failures.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2442) ResourceManager JMX UI does not give HA State


 [ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2442:
-
 Target Version/s: 2.8.0
Affects Version/s: 2.6.0
   2.7.0

 ResourceManager JMX UI does not give HA State
 -

 Key: YARN-2442
 URL: https://issues.apache.org/jira/browse/YARN-2442
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0, 2.6.0, 2.7.0
Reporter: Nishan Shetty
Assignee: Rohith

 ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
 STOPPED)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528230#comment-14528230
 ] 

Rohith commented on YARN-2442:
--

Apologies for looking back after very long time. 
IMHO, some customers has use case of using JMX for monitoring the daemons. When 
there is system similar to Ambari i.e Hadoop Cluster Operation Manager uses JMX 
for monitoring the daemons like healtch check service. Such systems start the 
daemons with JMX enabled mode default. Currently, ClusterMetrics, QueueMetrics 
and RMNMInfor are registered with JMX and these metrics are able to retrieve 
from using JMX. Similarly, If there there is another MBean for RMInfoMbean and 
register with basic RM info such as state,securityEnabled and other required RM 
attributes would be helpfull for JMX dependent users. This is very similar to 
HDFS NameNodeStatusMXBean.

Kindly give your opinion, thoughts on this. 

 ResourceManager JMX UI does not give HA State
 -

 Key: YARN-2442
 URL: https://issues.apache.org/jira/browse/YARN-2442
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Assignee: Rohith

 ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
 STOPPED)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService


[ 
https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528247#comment-14528247
 ] 

Junping Du commented on YARN-3396:
--

Thanks [~brahmareddy] for delivering a patch here. Several quick feedback here: 
1. Why we are setting log level to INFO in the first case while setting ERROR 
for the second case? I think we should keep consistent here, probably ERROR is 
suitable for both cases.
2. This particular exception get thrown when decoding path from URL it 
contains, so we should put rsrc.getResource() and next.getResource() there 
instead of others you are putting now.
3. More informative words than just Got exception parsing. - may be something 
like Got exception in parsing URL of LocalResource:  + next.getResource()?

 Handle URISyntaxException in ResourceLocalizationService
 

 Key: YARN-3396
 URL: https://issues.apache.org/jira/browse/YARN-3396
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Brahma Reddy Battula
 Attachments: YARN-3396.patch


 There are two occurrences of the following code snippet:
 {code}
 //TODO fail? Already translated several times...
 {code}
 It should be handled correctly in case that the resource URI is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2918) RM starts up fails if accessible-node-labels are configured to queue without cluster lables


[ 
https://issues.apache.org/jira/browse/YARN-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528202#comment-14528202
 ] 

Rohith commented on YARN-2918:
--

Sure.. thank for your interest..

 RM starts up fails if accessible-node-labels are configured to queue without 
 cluster lables
 ---

 Key: YARN-2918
 URL: https://issues.apache.org/jira/browse/YARN-2918
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Rohith
Assignee: Rohith

 I configured accessible-node-labels to queue. But RM startup fails with below 
 exception. I see current steps to configure NodeLabel is first need to add 
 via rmadmin and later need to configure for queues. But it will be good if 
 both cluster and queue node labels has consitency in configuring it. 
 {noformat}
 2014-12-03 20:11:50,126 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
 ResourceManager
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
 NodeLabelManager doesn't include label = x, please check.
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:556)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:982)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:249)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1203)
 Caused by: java.io.IOException: NodeLabelManager doesn't include label = x, 
 please check.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:324)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528216#comment-14528216
 ] 

Hadoop QA commented on YARN-3543:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 43s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:red}-1{color} | javac |   3m 27s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730462/0001-YARN-3543.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 318081c |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7705/console |


This message was automatically generated.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2267) Auxiliary Service support in RM


[ 
https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528248#comment-14528248
 ] 

Rohith commented on YARN-2267:
--

Thanks [~sunilg] for your views and [~zjshen] for your inputs. As of now we are 
not working on this, I prefer to close this. Will make a note that when we 
reopen the JIRA will come up with better proposal document. 

 Auxiliary Service support in RM
 ---

 Key: YARN-2267
 URL: https://issues.apache.org/jira/browse/YARN-2267
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Rohith

 Currently RM does not have a provision to run any Auxiliary services. For 
 health/monitoring in RM, its better to make a plugin mechanism in RM itself, 
 similar to NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-05-05 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3385:

Attachment: YARN-3385.003.patch

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch, YARN-3385.001.patch, 
 YARN-3385.002.patch, YARN-3385.003.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-05-05 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529278#comment-14529278
 ] 

zhihai xu commented on YARN-3385:
-

I attached a new patch YARN-3385.003.patch which is to fix the check style 
issue. Also it is strange the test report log didn't show any test failure.


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch, YARN-3385.001.patch, 
 YARN-3385.002.patch, YARN-3385.003.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities

2015-05-05 Thread Jonathan Eagles (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529314#comment-14529314
]

Jonathan Eagles commented on YARN-3448:
---

[~zjshen], [~jlowe], Can you have another look now that I have gotten my
hadoopqa +1?

Add Rolling Time To Lives Level DB Plugin Capabilities
--

Key: YARN-3448
URL: https://issues.apache.org/jira/browse/YARN-3448
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Attachments: YARN-3448.1.patch, YARN-3448.10.patch,
YARN-3448.12.patch, YARN-3448.13.patch, YARN-3448.14.patch,
YARN-3448.15.patch, YARN-3448.16.patch, YARN-3448.2.patch, YARN-3448.3.patch,
YARN-3448.4.patch, YARN-3448.5.patch, YARN-3448.7.patch, YARN-3448.8.patch,
YARN-3448.9.patch

For large applications, the majority of the time in LeveldbTimelineStore is
spent deleting old entities record at a time. An exclusive write lock is held
during the entire deletion phase which in practice can be hours. If we are to
relax some of the consistency constraints, other performance enhancing
techniques can be employed to maximize the throughput and minimize locking
time.
Split the 5 sections of the leveldb database (domain, owner, start time,
entity, index) into 5 separate databases. This allows each database to
maximize the read cache effectiveness based on the unique usage patterns of
each database. With 5 separate databases each lookup is much faster. This can
also help with I/O to have the entity and index databases on separate disks.
Rolling DBs for entity and index DBs. 99.9% of the data are in these two
sections 4:1 ration (index to entity) at least for tez. We replace DB record
removal with file system removal if we create a rolling set of databases that
age out and can be efficiently removed. To do this we must place a constraint
to always place an entity's events into it's correct rolling db instance
based on start time. This allows us to stitching the data back together while
reading and artificial paging.
Relax the synchronous writes constraints. If we are willing to accept losing
some records that we not flushed in the operating system during a crash, we
can use async writes that can be much faster.
Prefer Sequential writes. sequential writes can be several times faster than
random writes. Spend some small effort arranging the writes in such a way
that will trend towards sequential write performance over random write
performance.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2267) Auxiliary Service support in RM


 [ 
https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-2267.
--
Resolution: Won't Fix

 Auxiliary Service support in RM
 ---

 Key: YARN-2267
 URL: https://issues.apache.org/jira/browse/YARN-2267
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Rohith

 Currently RM does not have a provision to run any Auxiliary services. For 
 health/monitoring in RM, its better to make a plugin mechanism in RM itself, 
 similar to NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService


 [ 
https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3396:
-
Affects Version/s: (was: 2.7.0)
   Labels: newbie  (was: )

 Handle URISyntaxException in ResourceLocalizationService
 

 Key: YARN-3396
 URL: https://issues.apache.org/jira/browse/YARN-3396
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chengbing Liu
Assignee: Brahma Reddy Battula
  Labels: newbie
 Attachments: YARN-3396.patch


 There are two occurrences of the following code snippet:
 {code}
 //TODO fail? Already translated several times...
 {code}
 It should be handled correctly in case that the resource URI is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-05 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528256#comment-14528256
 ] 

Naganarasimha G R commented on YARN-3543:
-

bq. It would be good if it can be done in different JIRA since it is different 
module. I feel it need not to mix with this.
Well as its only a small change to store it in ATS and also in earlier jira's 
where in most of the data displayed in RM web ui is also tried to be shown in 
ATS, so i feel better to capture ATS modifications in this jira itself...

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3423) RM HA setup, Cluster tab links populated with AM hostname instead of RM


[ 
https://issues.apache.org/jira/browse/YARN-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528268#comment-14528268
 ] 

Junping Du commented on YARN-3423:
--

The latest patch LGTM. getResolvedRemoteRMWebAppURLWithoutScheme() make more 
sense in HA case, also it should be work in non-HA case too. 
[~kasha] and [~xgong], do you think we should replace all 
getResolvedRMWebAppURLWithoutScheme with 
getResolvedRemoteRMWebAppURLWithoutScheme for RM HA case?


 RM HA setup, Cluster tab links populated with AM hostname instead of RM 
 --

 Key: YARN-3423
 URL: https://issues.apache.org/jira/browse/YARN-3423
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: centOS-6.x 
Reporter: Aroop Maliakkal
Priority: Minor
 Attachments: YARN-3423.patch


 In RM HA setup ( e.g. 
 http://rm-1.vip.abc.com:50030/proxy/application_1427789305393_0002/ ), go to 
 the job details and click on the Cluster tab on left top side. Click on any 
 of the links , About, Applications , Scheduler. You can see that the 
 hyperlink is pointing to http://am-1.vip.abc.com:port/cluster ).
 The port details for secure and unsecure cluster is given below :-
   8088 ( DEFAULT_RM_WEBAPP_PORT = 8088 )
   8090  ( DEFAULT_RM_WEBAPP_HTTPS_PORT = 8090 )
 Ideally, it should have pointed to resourcemanager hostname instead of AM 
 hostname.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3460) Test TestSecureRMRegistryOperations failed with IBM_JAVA JVM