date:20140818


 [ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1919:
-

Attachment: YARN-1919.2.patch

Refreshed a patch on trunk.

 Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
 --

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues


[ 
https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100354#comment-14100354
 ] 

Hudson commented on YARN-2411:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6084 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6084/])
YARN-2411. Support simple user and group mappings to queues. Contributed by Ram 
Venkatesh (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java


 [Capacity Scheduler] support simple user and group mappings to queues
 -

 Key: YARN-2411
 URL: https://issues.apache.org/jira/browse/YARN-2411
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Ram Venkatesh
Assignee: Ram Venkatesh
 Fix For: 2.6.0

 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, 
 YARN-2411.4.patch, YARN-2411.5.patch


 YARN-2257 has a proposal to extend and share the queue placement rules for 
 the fair scheduler and the capacity scheduler. This is a good long term 
 solution to streamline queue placement of both schedulers but it has core 
 infra work that has to happen first and might require changes to current 
 features in all schedulers along with corresponding configuration changes, if 
 any. 
 I would like to propose a change with a smaller scope in the capacity 
 scheduler that addresses the core use cases for implicitly mapping jobs that 
 have the default queue or no queue specified to specific queues based on the 
 submitting user and user groups. It will be useful in a number of real-world 
 scenarios and can be migrated over to the unified scheme when YARN-2257 
 becomes available.
 The proposal is to add two new configuration options:
 yarn.scheduler.capacity.queue-mappings-override.enable 
 A boolean that controls if user-specified queues can be overridden by the 
 mapping, default is false.
 and,
 yarn.scheduler.capacity.queue-mappings
 A string that specifies a list of mappings in the following format (default 
 is  which is the same as no mapping)
 map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]*
 map_specifier := user (u) | group (g)
 source_attribute := user | group | %user
 queue_name := the name of the mapped queue | %user | %primary_group
 The mappings will be evaluated left to right, and the first valid mapping 
 will be used. If the mapped queue does not exist, or the current user does 
 not have permissions to submit jobs to the mapped queue, the submission will 
 fail.
 Example usages:
 1. user1 is mapped to queue1, group1 is mapped to queue2
 u:user1:queue1,g:group1:queue2
 2. To map users to queues with the same name as the user:
 u:%user:%user
 I am happy to volunteer to take this up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100356#comment-14100356
]

Zhijie Shen commented on YARN-2033:
---

[~djp], thanks for your comments. I addressed most of your comments in the new
patch, and fix one bug I've found locally. Bellow are some response w.r.t to
your concerns.

bq. Why do we need getApplication(appAttemptId.getApplicationId(),
ApplicationReportField.NONE) here?

Because I want to check whether the application exists in the timeline store or
not, before retrieving the application attempt information. If the application
doesn't exist, we need to throw ApplicationNotFoundException.

BTW, in YARN-1250, getting app is going to be required for each API, because we
need to check whether the user has access to this application or not.

bq. If user's config is slightly wrong (let's assume:
YarnConfiguration.APPLICATION_HISTORY_STORE != null,
YarnConfiguration.RM_METRICS_PUBLISHER_ENABLED=true), then here we disable
yarnMetricsEnabled sliently which make trouble-shooting effort a little harder.
Suggest to log warn messages when user's wrong configuration happens. Better to
move logic operations inside of if() to a separated method and log the error
for wrong configuration.

I rethink about the backward compatibility, and I think it's not good to reply
on checking APPLICATION_HISTORY_STORE, because its default is already the
FS-based history store. The users may use this store without explicitly setting
it in their config file. Instead, I think it's more reasonable to check
APPLICATION_HISTORY_ENABLED to determine whether the user is using old history
store, because it is false by default. Unless the user sets it explicitly in
the config file, he's not able to use the old history store. Therefore I
changed the logic in YarnClientImpl, ApplicationHistoryServer,
YarnMetricsPublisher to reply on APPLICATION_HISTORY_ENABLED for backward
compatibility.

Per the suggestion, if if the old history service stack is used, a warn level
log will be recorded. In addition, when APPLICATION_HISTORY_ENABLED = true,
YarnMetricsPublisher cannot be enabled, preventing RMApplicationHistoryWriter
and YarnMetricsPublisher writing the application history simultaneously.

bq. The method of convertToApplicationReport seems a little too sophisticated
in creating applicationReport. Another option is to wrapper it as Builder
pattern (plz refer in MiniDFSCluster) should be better.

I agree the builder model should be more decent, but it seems that it needs to
change Report classes, which currently use newInstance to construct the
instance. Let's file a separate Jira to deal with building a big record with
quite a few fields.

bq. We should replace hadoop.tmp.dir and /yarn/timeline/generic-history
with constant string in YarnConfiguration. BTW, hadoop.tmp.dir may not be
necessary?

This is because conf.get(hadoop.tmp.dir) cannot be determined in advance.

bq. For public API (although marked as unstable), adding a new exception will
break compatibility of RPC as old version client don't know how to deal with
new exception.

ApplicationContext is actually not an RPC interface, but is used internally in
the server daemons. We previously refactored the code and created such common
interface for RM and GHS to source the application/attempt/container report(s)
(although RM still pulls the information from RMContext directly, such that we
could use the same CLI/webUI/service, but hook on different data source.
Anyway, the annotations here are misleading, such that I deleted them.

bq. I am not sure if this change (and other changes in this class) is
necessary. If not, we can remove it.

I did this intentionally. In fact, I wanted to discard
{code}
protected int allocatedMB;
protected int allocatedVCores;
{code}
Because history information doesn't include the runtime resource usage
information. If we keep the two fields here, in the web services output, we
will always see allocatedMB=0, and allocatedVCores=0.

bq. We already have the same implementation of MultiThreadedDispatcher in
RMApplicationHistoryWriter.java.

That's right. Again it's duplicated by purpose. After this patch, I'm going to
deprecate the classes of the old generic history read/write layer, including
RMApplicationHistoryWriter (YARN-2320), such that in the next big release (e.g.
Hadoop 3.0), we can remove the deprecated code. MultiThreadedDispatcher should
be the sub-component of YarnMetricsPublisher unless it is going be used by
other components. It it happens, we can promote it to the first-citizen class.

Investigate merging generic-history into the Timeline Store
---

Key: YARN-2033
URL: https://issues.apache.org/jira/browse/YARN-2033
Project: Hadoop YARN
Issue

[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store


 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--

Attachment: YARN-2033.6.patch

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, 
 YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, 
 YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


 [ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1514:
-

Attachment: YARN-1514.3.patch

Refreshed the v2 patch on trunk.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.wip-2.patch, YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-1348) Batching optimization for ZKRMStateStore


 [ 
https://issues.apache.org/jira/browse/YARN-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA resolved YARN-1348.
--

Resolution: Fixed

 Batching optimization for ZKRMStateStore
 

 Key: YARN-1348
 URL: https://issues.apache.org/jira/browse/YARN-1348
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
  Labels: ha

 We rethought znodes structure on YARN-1307. We can optimize to reduce znodes 
 about DelegationKey and DelegationToken by using batching store them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1326) RM should log using RMStore at startup time


 [ 
https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1326:
-

Affects Version/s: 2.5.0

 RM should log using RMStore at startup time
 ---

 Key: YARN-1326
 URL: https://issues.apache.org/jira/browse/YARN-1326
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.5.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1326.1.patch

   Original Estimate: 3h
  Remaining Estimate: 3h

 Currently there are no way to know which RMStore RM uses. It's useful to log 
 the information at RM's startup time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1753) FileSystemApplicationHistoryStore#HistoryFileReader#next() should check return value of dis.read()


 [ 
https://issues.apache.org/jira/browse/YARN-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1753:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-321

 FileSystemApplicationHistoryStore#HistoryFileReader#next() should check 
 return value of dis.read()
 --

 Key: YARN-1753
 URL: https://issues.apache.org/jira/browse/YARN-1753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ted Yu
Priority: Minor
 Attachments: YARN-1753.patch


 Here is related code:
 {code}
   byte[] value = new byte[entry.getValueLength()];
   dis.read(value);
 {code}
 entry.getValueLength() bytes are expected to be read.
 The return value from dis.read() should be checked against value length.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1348) Batching optimization for ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100394#comment-14100394
 ] 

Tsuyoshi OZAWA commented on YARN-1348:
--

This ticket looks deprecated and has been implemented already. Close this as 
resolved.

 Batching optimization for ZKRMStateStore
 

 Key: YARN-1348
 URL: https://issues.apache.org/jira/browse/YARN-1348
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
  Labels: ha

 We rethought znodes structure on YARN-1307. We can optimize to reduce znodes 
 about DelegationKey and DelegationToken by using batching store them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1753) FileSystemApplicationHistoryStore#HistoryFileReader#next() should check return value of dis.read()


[ 
https://issues.apache.org/jira/browse/YARN-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100404#comment-14100404
 ] 

Hadoop QA commented on YARN-1753:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12662438/YARN-1753.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4659//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4659//console

This message is automatically generated.

 FileSystemApplicationHistoryStore#HistoryFileReader#next() should check 
 return value of dis.read()
 --

 Key: YARN-1753
 URL: https://issues.apache.org/jira/browse/YARN-1753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Ted Yu
Priority: Minor
 Attachments: YARN-1753.patch


 Here is related code:
 {code}
   byte[] value = new byte[entry.getValueLength()];
   dis.read(value);
 {code}
 entry.getValueLength() bytes are expected to be read.
 The return value from dis.read() should be checked against value length.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE


[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100406#comment-14100406
 ] 

Hadoop QA commented on YARN-1919:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12662434/YARN-1919.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4658//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4658//console

This message is automatically generated.

 Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
 --

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100442#comment-14100442
 ] 

Hadoop QA commented on YARN-1514:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12662441/YARN-1514.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4661//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4661//console

This message is automatically generated.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


 [ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1514:
-

Attachment: YARN-1514.4.patch

Forgot to add YarnTestDriver.java. This patch includes it.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time


[ 
https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100444#comment-14100444
 ] 

Tsuyoshi OZAWA commented on YARN-1326:
--

Thanks for your review, Karthik and Vinod. I'll update it.

 RM should log using RMStore at startup time
 ---

 Key: YARN-1326
 URL: https://issues.apache.org/jira/browse/YARN-1326
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.5.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1326.1.patch

   Original Estimate: 3h
  Remaining Estimate: 3h

 Currently there are no way to know which RMStore RM uses. It's useful to log 
 the information at RM's startup time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store


[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100450#comment-14100450
 ] 

Hadoop QA commented on YARN-2033:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12662437/YARN-2033.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 17 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
  org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
  org.apache.hadoop.yarn.client.TestRMFailover
  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
  
org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4660//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4660//console

This message is automatically generated.

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, 
 YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, 
 YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


 [ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1514:
-

Attachment: YARN-1514.4.patch

Fixed the test failure.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.wip-2.patch, 
 YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100499#comment-14100499
 ] 

Hadoop QA commented on YARN-1514:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12662456/YARN-1514.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4662//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4662//console

This message is automatically generated.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.wip-2.patch, 
 YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues


[ 
https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100526#comment-14100526
 ] 

Hudson commented on YARN-2411:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #650 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/650/])
YARN-2411. Support simple user and group mappings to queues. Contributed by Ram 
Venkatesh (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java


 [Capacity Scheduler] support simple user and group mappings to queues
 -

 Key: YARN-2411
 URL: https://issues.apache.org/jira/browse/YARN-2411
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Ram Venkatesh
Assignee: Ram Venkatesh
 Fix For: 2.6.0

 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, 
 YARN-2411.4.patch, YARN-2411.5.patch


 YARN-2257 has a proposal to extend and share the queue placement rules for 
 the fair scheduler and the capacity scheduler. This is a good long term 
 solution to streamline queue placement of both schedulers but it has core 
 infra work that has to happen first and might require changes to current 
 features in all schedulers along with corresponding configuration changes, if 
 any. 
 I would like to propose a change with a smaller scope in the capacity 
 scheduler that addresses the core use cases for implicitly mapping jobs that 
 have the default queue or no queue specified to specific queues based on the 
 submitting user and user groups. It will be useful in a number of real-world 
 scenarios and can be migrated over to the unified scheme when YARN-2257 
 becomes available.
 The proposal is to add two new configuration options:
 yarn.scheduler.capacity.queue-mappings-override.enable 
 A boolean that controls if user-specified queues can be overridden by the 
 mapping, default is false.
 and,
 yarn.scheduler.capacity.queue-mappings
 A string that specifies a list of mappings in the following format (default 
 is  which is the same as no mapping)
 map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]*
 map_specifier := user (u) | group (g)
 source_attribute := user | group | %user
 queue_name := the name of the mapped queue | %user | %primary_group
 The mappings will be evaluated left to right, and the first valid mapping 
 will be used. If the mapped queue does not exist, or the current user does 
 not have permissions to submit jobs to the mapped queue, the submission will 
 fail.
 Example usages:
 1. user1 is mapped to queue1, group1 is mapped to queue2
 u:user1:queue1,g:group1:queue2
 2. To map users to queues with the same name as the user:
 u:%user:%user
 I am happy to volunteer to take this up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2425) When Application submitted by via Yarn RM WS, log aggregation does not happens

2014-08-18 Thread Karam Singh (JIRA)

Karam Singh created YARN-2425:
-

 Summary: When Application submitted by via Yarn RM WS, log 
aggregation does not happens
 Key: YARN-2425
 URL: https://issues.apache.org/jira/browse/YARN-2425
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.5.0, 2.6.0
 Environment: Secure (Kerberos enabled) hadoop cluster. With SPNEGO for 
Yarn RM enabled

Reporter: Karam Singh


When submit App to Yarn RM using Web service we need to pass credentials/tokens 
in json object/xml object to webservice
As HDFS namenode does not provides any DT over WS (base64 encoded) like 
webhdfs/timeline server does. (HDFS fetch dt commad fetch java writable object 
and writes it to target file, we we cannot forward via application Submission 
WS objects)
Looks like there is not way to pass HDFS token to NodeManager. 
While starting Application container also tries to create Application log 
aggregation dir and fails with following type exception
{code}
java.io.IOException: Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]; Host Details : local host is: hostname/ip; 
destination host is: NameNodeHost:FSPort;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:725)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy35.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1781)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1069)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1065)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1065)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:240)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:64)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:253)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:344)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:310)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:421)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:64)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:679)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at

[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2014-08-18 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100593#comment-14100593
 ] 

Junping Du commented on YARN-160:
-

Thanks [~vvasudev] for working on this. Just take a quick glance, a few 
comments:
- The old way to configure resource of NM is still useful, especially when 
there are other agents running (like: HBase RegionServer). Thus, user need 
flexibility to calculate resource themselves in some cases, so we should 
provide another new option instead of removing old way completely.
- Given this is a new feature, we shouldn't change cluster's behavior with old 
configuration in upgrade prospective. We should keep previous configuration 
work as usual especially when user use some default settings. 

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-160.0.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2426) NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while localizing

2014-08-18 Thread Karam Singh (JIRA)

Karam Singh created YARN-2426:
-

 Summary: NodeManger is not able use WebHDFS token properly to 
tallk to WebHDFS while localizing 
 Key: YARN-2426
 URL: https://issues.apache.org/jira/browse/YARN-2426
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.6.0
 Environment: Hadoop Keberos (Secure) cluster with 
LinuxContainerExcutor is enabled
With SPNEGO on for Yarn new RM web services for application submission
While using kinit we are using -C (to specify cachepath).
Then while executing set export KRB5CCNAME = path provided with -C option

There is no kerberos ticket in default KRB5 cache path with is /tmp

Reporter: Karam Singh


Encountered this issue during using new YARN's RM WS for application 
submission, on single node cluster while submitting Distributed Shell 
application using RM WS(webservice).
For this we need  pass custom script and AppMaster jar along with webhdfs token 
to NodeManager for localization.

Distributed Shell Application was failing as Node was failing to localise 
AppMaster jar .
Following is the NM log while localizing AppMaster jar:
{code}
2014-08-18 01:53:52,434 INFO  authorize.ServiceAuthorizationManager 
(ServiceAuthorizationManager.java:authorize(114)) - Authorization successful 
for testing (auth:TOKEN) for protocol=interface 
org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
2014-08-18 01:53:52,757 INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { 
webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARpPATH, 1408352019488, 
FILE, null }, Authentication required
2014-08-18 01:53:52,758 INFO  localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARPATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408351986532_0001/filecache/10/DshellAppMaster.jar)
 transitioned from DOWNLOADING to FAILED
2014-08-18 01:53:52,758 INFO  container.Container 
(ContainerImpl.java:handle(999)) - Container 
container_1408351986532_0001_01_01 transitioned from LOCALIZING to 
LOCALIZATION_FAILED
{code}  

Which is similar to what we get is when we try access webhdfs in secure 
(kerberos) cluster without doing kinit
Whereas if we do curl -i -k -s 
'http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH?op=listStatusdelegation=same
 webhdfs token used in app submission structure
works properly
I also tried using 
http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/hadoopqa/JAR_PATH in 
app submission object instead of webhdfs:// uri format
Then NodeManger fail to localize as there is http filesystem scheme
{code}
14-08-18 02:03:31,343 INFO  authorize.ServiceAuthorizationManager 
(ServiceAuthorizationManager.java:authorize(114)) - Authorization successful 
for testing (auth:TOKEN) for protocol=interface org.apache.
hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
2014-08-18 02:03:31,583 INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { 
http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH 
1408352576841, FILE, null }, No FileSystem for scheme: http
2014-08-18 02:03:31,583 INFO  localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408352544163_0002/filecache/11/DshellAppMaster.jar)
 transitioned from DOWNLOADING to FAILED
{code}

Now do kinit without providing -C option for KRB5 cache path. So Ticket to goes 
to default KRB5 cache /tmp
Again submit same application object to Yarn WS, with webhdfs:// uri format 
paths and webhdfs token
This time NM is able download jar and custom shell script and application runs 
fine
Looks like following is happening:
webhdfs is trying look for krb ticket in NM while localising 
1. As 1st case there was to krb ticket there in default cache. Application 
failing while localising AppMaster jar
2. In second case as already kinit and krb ticket was present in /tmp (default 
KRB5 cache). AppMaster got localized successfully



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

[
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100621#comment-14100621
]

Varun Vasudev commented on YARN-160:

[~djp]
{quote}
The old way to configure resource of NM is still useful, especially when there
are other agents running (like: HBase RegionServer). Thus, user need
flexibility to calculate resource themselves in some cases, so we should
provide another new option instead of removing old way completely.
{quote}

The patch supports the old way. If a user has set values for memory and vcores,
they're used without looking at the underlying hardware. I've added test cases
to verify that behaviour as well. Have I missed a use case?

{quote}
Given this is a new feature, we shouldn't change cluster's behavior with old
configuration in upgrade prospective. We should keep previous configuration
work as usual especially when user use some default settings.
{quote}

There are two scenarios here -
1. A configuration file with custom settings for memory and cpu - nothing will
change for these users.
2. A configuration file with no settings for memory and cpu - in this case, the
memory and cpu resources will be calculated based on the underlying hardware
instead of them being set to 8192 and 8 respectively. Isn't calculating the
values from the hardware a better option? If people feel strongly about
sticking to 8192 and 8, I don't have any problems changing them but it seems a
bit odd.

nodemanagers should obtain cpu/memory values from underlying OS
---

Key: YARN-160
URL: https://issues.apache.org/jira/browse/YARN-160
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
Fix For: 2.6.0

Attachments: apache-yarn-160.0.patch

As mentioned in YARN-2
*NM memory and CPU configs*
Currently these values are coming from the config of the NM, we should be
able to obtain those values from the OS (ie, in the case of Linux from
/proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have
an interface that obtains this information. In addition implementations of
this interface should be able to specify a mem/cpu offset (amount of mem/cpu
not to be avail as YARN resource), this would allow to reserve mem/cpu for
the OS and other services outside of YARN containers.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-08-18 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100623#comment-14100623
]

Junping Du commented on YARN-2033:
--

bq. Because I want to check whether the application exists in the timeline
store or not, before retrieving the application attempt information. If the
application doesn't exist, we need to throw ApplicationNotFoundException.
IMO, this is not necessary as application should exist in most cases and we
don't need to duplicated visit LevelDB twice. If application doesn't exist, we
can throw ApplicationNotFoundException in retrieving app attempt info. Isn't it?

bq. I rethink about the backward compatibility, and I think it's not good to
reply on checking APPLICATION_HISTORY_STORE, because its default is already the
FS-based history store. The users may use this store without explicitly setting
it in their config file. Instead, I think it's more reasonable to check
APPLICATION_HISTORY_ENABLED to determine whether the user is using old history
store, because it is false by default.
Backward compatibility is only one concern I had. Another concern here is on
usability of these (old and new) configurations. I just list one possible wrong
configuration above but didn't want to judge which wrong configuration is more
likely to happen. The point here is we should check on the combination of
related configurations and make all wrong combinations get warned. Any concern
on doing this?

bq. This is because conf.get(hadoop.tmp.dir) cannot be determined in advance.
I mean to define hadoop.tmp.dir in YarnConfiguration to be something like:
HADOOP_TMP_DIR which sounds more uniform when dealing with config.

Investigate merging generic-history into the Timeline Store
---

Key: YARN-2033
URL: https://issues.apache.org/jira/browse/YARN-2033
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf,
YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch,
YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch,
YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch,
YARN-2033_ALL.4.patch

Having two different stores isn't amicable to generic insights on what's
happening with applications. This is to investigate porting generic-history
into the Timeline Store.
One goal is to try and retain most of the client side interfaces as close to
what we have today.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2425) When Application submitted by via Yarn RM WS, log aggregation does not happens


 [ 
https://issues.apache.org/jira/browse/YARN-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev reassigned YARN-2425:
---

Assignee: Varun Vasudev

 When Application submitted by via Yarn RM WS, log aggregation does not happens
 --

 Key: YARN-2425
 URL: https://issues.apache.org/jira/browse/YARN-2425
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.5.0, 2.6.0
 Environment: Secure (Kerberos enabled) hadoop cluster. With SPNEGO 
 for Yarn RM enabled
Reporter: Karam Singh
Assignee: Varun Vasudev

 When submit App to Yarn RM using Web service we need to pass 
 credentials/tokens in json object/xml object to webservice
 As HDFS namenode does not provides any DT over WS (base64 encoded) like 
 webhdfs/timeline server does. (HDFS fetch dt commad fetch java writable 
 object and writes it to target file, we we cannot forward via application 
 Submission WS objects)
 Looks like there is not way to pass HDFS token to NodeManager. 
 While starting Application container also tries to create Application log 
 aggregation dir and fails with following type exception
 {code}
 java.io.IOException: Failed on local exception: java.io.IOException: 
 org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
 via:[TOKEN, KERBEROS]; Host Details : local host is: hostname/ip; 
 destination host is: NameNodeHost:FSPort;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1415)
 at org.apache.hadoop.ipc.Client.call(Client.java:1364)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:725)
 at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy35.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1781)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1069)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1065)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1065)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:240)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:64)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:253)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:344)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:310)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:421)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:64)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: 
 org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
 via:[TOKEN, KERBEROS]
 at

[jira] [Assigned] (YARN-2426) NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while localizing


 [ 
https://issues.apache.org/jira/browse/YARN-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev reassigned YARN-2426:
---

Assignee: Varun Vasudev

 NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while 
 localizing 
 ---

 Key: YARN-2426
 URL: https://issues.apache.org/jira/browse/YARN-2426
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.6.0
 Environment: Hadoop Keberos (Secure) cluster with 
 LinuxContainerExcutor is enabled
 With SPNEGO on for Yarn new RM web services for application submission
 While using kinit we are using -C (to specify cachepath).
 Then while executing set export KRB5CCNAME = path provided with -C option
 There is no kerberos ticket in default KRB5 cache path with is /tmp
Reporter: Karam Singh
Assignee: Varun Vasudev

 Encountered this issue during using new YARN's RM WS for application 
 submission, on single node cluster while submitting Distributed Shell 
 application using RM WS(webservice).
 For this we need  pass custom script and AppMaster jar along with webhdfs 
 token to NodeManager for localization.
 Distributed Shell Application was failing as Node was failing to localise 
 AppMaster jar .
 Following is the NM log while localizing AppMaster jar:
 {code}
 2014-08-18 01:53:52,434 INFO  authorize.ServiceAuthorizationManager 
 (ServiceAuthorizationManager.java:authorize(114)) - Authorization successful 
 for testing (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
 2014-08-18 01:53:52,757 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { 
 webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARpPATH, 1408352019488, 
 FILE, null }, Authentication required
 2014-08-18 01:53:52,758 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(203)) - Resource 
 webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARPATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408351986532_0001/filecache/10/DshellAppMaster.jar)
  transitioned from DOWNLOADING to FAILED
 2014-08-18 01:53:52,758 INFO  container.Container 
 (ContainerImpl.java:handle(999)) - Container 
 container_1408351986532_0001_01_01 transitioned from LOCALIZING to 
 LOCALIZATION_FAILED
 {code}  
 Which is similar to what we get is when we try access webhdfs in secure 
 (kerberos) cluster without doing kinit
 Whereas if we do curl -i -k -s 
 'http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH?op=listStatusdelegation=same
  webhdfs token used in app submission structure
 works properly
 I also tried using 
 http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/hadoopqa/JAR_PATH 
 in app submission object instead of webhdfs:// uri format
 Then NodeManger fail to localize as there is http filesystem scheme
 {code}
 14-08-18 02:03:31,343 INFO  authorize.ServiceAuthorizationManager 
 (ServiceAuthorizationManager.java:authorize(114)) - Authorization successful 
 for testing (auth:TOKEN) for protocol=interface org.apache.
 hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
 2014-08-18 02:03:31,583 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { 
 http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH 
 1408352576841, FILE, null }, No FileSystem for scheme: http
 2014-08-18 02:03:31,583 INFO  localizer.LocalizedResource 
 (LocalizedResource.java:handle(203)) - Resource 
 http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408352544163_0002/filecache/11/DshellAppMaster.jar)
  transitioned from DOWNLOADING to FAILED
 {code}
 Now do kinit without providing -C option for KRB5 cache path. So Ticket to 
 goes to default KRB5 cache /tmp
 Again submit same application object to Yarn WS, with webhdfs:// uri format 
 paths and webhdfs token
 This time NM is able download jar and custom shell script and application 
 runs fine
 Looks like following is happening:
 webhdfs is trying look for krb ticket in NM while localising 
 1. As 1st case there was to krb ticket there in default cache. Application 
 failing while localising AppMaster jar
 2. In second case as already kinit and krb ticket was present in /tmp 
 (default KRB5 cache). AppMaster got localized successfully



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state

2014-08-18 Thread chang li (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chang li reassigned YARN-2421:
--

Assignee: chang li

 CapacityScheduler still allocates containers to an app in the FINISHING state
 -

 Key: YARN-2421
 URL: https://issues.apache.org/jira/browse/YARN-2421
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.1
Reporter: Thomas Graves
Assignee: chang li

 I saw an instance of a bad application master where it unregistered with the 
 RM but then continued to call into allocate.  The RMAppAttempt went to the 
 FINISHING state, but the capacity scheduler kept allocating it containers.   
 We should probably have the capacity scheduler check that the application 
 isn't in one of the terminal states before giving it containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS


[ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100671#comment-14100671
 ] 

Hadoop QA commented on YARN-160:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12662475/apache-yarn-160.0.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-gridmix hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4664//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4664//console

This message is automatically generated.

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-160.0.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2014-08-18 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100680#comment-14100680
]

Junping Du commented on YARN-160:
-

bq. The patch supports the old way.
Thanks for clarification here. Yes. I saw the details of
getYARNContainerMemoryMB() which sounds to honor previous NM resource
configuration.

bq. Isn't calculating the values from the hardware a better option?
Agree. But if the calculating results is not reasonable (like 0 or minus
value), shall we use previous NM default value instead? At least, experienced
users (especially with test purpose) already had some expectations even when
they don't set any resource value here.

nodemanagers should obtain cpu/memory values from underlying OS
---

Attachments: apache-yarn-160.0.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2390) Investigating whehther generic history service needs to support queue-acls


[ 
https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100724#comment-14100724
 ] 

Sunil G commented on YARN-2390:
---

Hi [~zjshen]

bq. is the right fix to be correcting the ACLs on RM side?
+1. Yes, I also feel it will be better if we remove the ACL checks for those 
apps which are completed from RM side.

If the rmApp state is not *FinalApplicationStatus.UNDEFINED*, such applications 
must have been moved to FAILED/SUCCEEDED/KILLED. queue ACLs for such 
applications  need not have to be checked. *ClientRMService#checkAccess* can be 
modified with this change. If this approach is fine, I would like to take over 
this JIRA. Kindly let me know your suggestion.


 Investigating whehther generic history service needs to support queue-acls
 --

 Key: YARN-2390
 URL: https://issues.apache.org/jira/browse/YARN-2390
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen

 According YARN-1250,  it's arguable whether queue-acls should be applied to 
 the generic history service as well, because the queue admin may not need the 
 access to the completed application that is removed from the queue. Create 
 this ticket to tackle the discussion around.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-08-18 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100749#comment-14100749
]

Karthik Kambatla commented on YARN-415:
---

[~eepayne] - Sorry again for coming in so late. I am not completely sure myself
(yet) how we can use the timeline server or if it makes sense to do that. I
guess I need to first understand what we are trying to accomplish here. Could
you please correct me/comment on the following items.
# The goal is to capture memory utilization at the app-level for chargeback. I
like the goal, but would like to understand the usecases we have in mind. Is
the chargeback simply to track the usage and may be financially charge the
users. Or, is to influence future scheduling decisions? I agree that the RM
should facilitate collecting this information, but should the collected info be
available to the RM for future use? If not, do we want the RM to serve this
info?
# Do we want to charge the app only for the resources used to do meaningful
work or do we also want to include failed/preempted containers? If we don't
charge the app for failed containers, who are they charged to? Are we okay with
letting some resources go uncharged?
# How soon do we want this usage information? It might make sense to
collect/expose this once the app is finished for certain kinds of applications.
What is our story for long-running applications?

As Jian suggested, I would be up for getting in those parts that we are clear
about and file follow-up JIRAs for those that need more discussion.

Capture memory utilization at the app-level for chargeback
--

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n10.patch, YARN-415--n2.patch,
YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch,
YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch,
YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt,
YARN-415.201406262136.txt, YARN-415.201407042037.txt,
YARN-415.201407071542.txt, YARN-415.201407171553.txt,
YARN-415.201407172144.txt, YARN-415.201407232237.txt,
YARN-415.201407242148.txt, YARN-415.201407281816.txt,
YARN-415.201408062232.txt, YARN-415.201408080204.txt,
YARN-415.201408092006.txt, YARN-415.201408132109.txt,
YARN-415.201408150030.txt, YARN-415.patch

For the purpose of chargeback, I'd like to be able to compute the cost of an
application in terms of cluster resource usage. To start out, I'd like to
get the memory utilization of an application. The unit should be MB-seconds
or something similar and, from a chargeback perspective, the memory amount
should be the memory reserved for the application, as even if the app didn't
use all that memory, no one else was able to use it.
(reserved ram for container 1 * lifetime of container 1) + (reserved ram for
container 2 * lifetime of container 2) + ... + (reserved ram for container n
* lifetime of container n)
It'd be nice to have this at the app level instead of the job level because:
1. We'd still be able to get memory usage for jobs that crashed (and wouldn't
appear on the job history server).
2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
This new metric should be available both through the RM UI and RM Web
Services REST API.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues


[ 
https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100763#comment-14100763
 ] 

Hudson commented on YARN-2411:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1841 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1841/])
YARN-2411. Support simple user and group mappings to queues. Contributed by Ram 
Venkatesh (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java


 [Capacity Scheduler] support simple user and group mappings to queues
 -

 Key: YARN-2411
 URL: https://issues.apache.org/jira/browse/YARN-2411
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Ram Venkatesh
Assignee: Ram Venkatesh
 Fix For: 2.6.0

 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, 
 YARN-2411.4.patch, YARN-2411.5.patch


 YARN-2257 has a proposal to extend and share the queue placement rules for 
 the fair scheduler and the capacity scheduler. This is a good long term 
 solution to streamline queue placement of both schedulers but it has core 
 infra work that has to happen first and might require changes to current 
 features in all schedulers along with corresponding configuration changes, if 
 any. 
 I would like to propose a change with a smaller scope in the capacity 
 scheduler that addresses the core use cases for implicitly mapping jobs that 
 have the default queue or no queue specified to specific queues based on the 
 submitting user and user groups. It will be useful in a number of real-world 
 scenarios and can be migrated over to the unified scheme when YARN-2257 
 becomes available.
 The proposal is to add two new configuration options:
 yarn.scheduler.capacity.queue-mappings-override.enable 
 A boolean that controls if user-specified queues can be overridden by the 
 mapping, default is false.
 and,
 yarn.scheduler.capacity.queue-mappings
 A string that specifies a list of mappings in the following format (default 
 is  which is the same as no mapping)
 map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]*
 map_specifier := user (u) | group (g)
 source_attribute := user | group | %user
 queue_name := the name of the mapped queue | %user | %primary_group
 The mappings will be evaluated left to right, and the first valid mapping 
 will be used. If the mapped queue does not exist, or the current user does 
 not have permissions to submit jobs to the mapped queue, the submission will 
 fail.
 Example usages:
 1. user1 is mapped to queue1, group1 is mapped to queue2
 u:user1:queue1,g:group1:queue2
 2. To map users to queues with the same name as the user:
 u:%user:%user
 I am happy to volunteer to take this up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2427) Add support for moving apps between queues in RM web services

Varun Vasudev created YARN-2427:
---

 Summary: Add support for moving apps between queues in RM web 
services
 Key: YARN-2427
 URL: https://issues.apache.org/jira/browse/YARN-2427
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev


Support for moving apps from one queue to another is now present in 
CapacityScheduler and FairScheduler. We should expose the functionality via RM 
web services as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues


[ 
https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100822#comment-14100822
 ] 

Hudson commented on YARN-2411:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1867 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1867/])
YARN-2411. Support simple user and group mappings to queues. Contributed by Ram 
Venkatesh (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java


 [Capacity Scheduler] support simple user and group mappings to queues
 -

 Key: YARN-2411
 URL: https://issues.apache.org/jira/browse/YARN-2411
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Ram Venkatesh
Assignee: Ram Venkatesh
 Fix For: 2.6.0

 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, 
 YARN-2411.4.patch, YARN-2411.5.patch


 YARN-2257 has a proposal to extend and share the queue placement rules for 
 the fair scheduler and the capacity scheduler. This is a good long term 
 solution to streamline queue placement of both schedulers but it has core 
 infra work that has to happen first and might require changes to current 
 features in all schedulers along with corresponding configuration changes, if 
 any. 
 I would like to propose a change with a smaller scope in the capacity 
 scheduler that addresses the core use cases for implicitly mapping jobs that 
 have the default queue or no queue specified to specific queues based on the 
 submitting user and user groups. It will be useful in a number of real-world 
 scenarios and can be migrated over to the unified scheme when YARN-2257 
 becomes available.
 The proposal is to add two new configuration options:
 yarn.scheduler.capacity.queue-mappings-override.enable 
 A boolean that controls if user-specified queues can be overridden by the 
 mapping, default is false.
 and,
 yarn.scheduler.capacity.queue-mappings
 A string that specifies a list of mappings in the following format (default 
 is  which is the same as no mapping)
 map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]*
 map_specifier := user (u) | group (g)
 source_attribute := user | group | %user
 queue_name := the name of the mapped queue | %user | %primary_group
 The mappings will be evaluated left to right, and the first valid mapping 
 will be used. If the mapped queue does not exist, or the current user does 
 not have permissions to submit jobs to the mapped queue, the submission will 
 fail.
 Example usages:
 1. user1 is mapped to queue1, group1 is mapped to queue2
 u:user1:queue1,g:group1:queue2
 2. To map users to queues with the same name as the user:
 u:%user:%user
 I am happy to volunteer to take this up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100873#comment-14100873
]

Zhijie Shen commented on YARN-2033:
---

The test failures seem to be related to RM HA.

bq. IMO, this is not necessary as application should exist in most cases and we
don't need to duplicated visit LevelDB twice. If application doesn't exist, we
can throw ApplicationNotFoundException in retrieving app attempt info. Isn't it?

First of all, two queries are not duplicate: one to read the application
entity, and the other to read the app attempt entity, and we previously
distinguish ApplicationNotFoundException and
ApplicationAttemptNotFoundException. It is always possible that App1 exists in
the store with the only attempt AppAttempt1 while the user looks up for
AppAttempt2. In this case, we know App1 is there, but AppAttempt2 isn't, so we
will throw ApplicationAttemptNotFoundException.

Moreover, when we go on with generic history ACLs, we will anyway visit the app
entity once to pull the user info for access check.

bq. The point here is we should check on the combination of related
configurations and make all wrong combinations get warned. Any concern on doing
this?

Right, so in the new patch, I've enhanced the configuration check logic, to
make sure either the old or the new history service stack will be used, but not
both. However, I don't cover mis config of within the scope of the old history
service stack itself. For example, ApplicationHistoryStore - null store while
enabling history service. It even didn't work in the previous situation.

bq. I mean to define hadoop.tmp.dir in YarnConfiguration to be something
like: HADOOP_TMP_DIR which sounds more uniform when dealing with config.

hadoop.tmp.dir should be part of YarnConfiguration. If it really needs to be
added, it should be placed in CommonConfigurationKeys. However, I'm afraid it's
not a good idea to do that, either. Let's look into the default of it.
{code}
property
namehadoop.tmp.dir/name
value/tmp/hadoop-${user.name}/value
descriptionA base for other temporary directories./description
/property
{code}
The default comes with a param, which can not be determined upfront either.
AFAIK, all such kind of defaults are not contained in config classes.

Investigate merging generic-history into the Timeline Store
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2310) Revisit the APIs in RM web services where user information can make difference


[ 
https://issues.apache.org/jira/browse/YARN-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100874#comment-14100874
 ] 

Sunil G commented on YARN-2310:
---

YARN-1867 has added queue ACL checks, and hasAccess is already invoked by 
getApp and getApps api's. If queue ACL access is available, then information of 
an application such as *start/finished/elapsed time* and *AM container 
information* will be filled in to AppInfo object.
Do you mean some more extra information is taken from customized yarn filter 
added in YARN-2247, could you please help to give some more insight.

 Revisit the APIs in RM web services where user information can make difference
 --

 Key: YARN-2310
 URL: https://issues.apache.org/jira/browse/YARN-2310
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0, 2.5.0
Reporter: Zhijie Shen

 After YARN-2247, RM web services can be sheltered by the authentication 
 filter, which can help to identify who the user is. With this information, we 
 should be able to fix the security problem of some existing APIs, such as 
 getApp, getAppAttempts, getApps. We should use the user information to check 
 the ACLs before returning the requested data to the user.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2390) Investigating whehther generic history service needs to support queue-acls


[ 
https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100880#comment-14100880
 ] 

Zhijie Shen commented on YARN-2390:
---

[~sunilg], please feel free to assign the ticket to youself.

bq. If the rmApp state is not FinalApplicationStatus.UNDEFINED,

Is this check necessary? The application can do unregistration without 
specifying FinalApplicationStatus. I'm not sure whether RM will conclude a 
FinalApplicationStatus on behalf of the app.

 Investigating whehther generic history service needs to support queue-acls
 --

 Key: YARN-2390
 URL: https://issues.apache.org/jira/browse/YARN-2390
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen

 According YARN-1250,  it's arguable whether queue-acls should be applied to 
 the generic history service as well, because the queue admin may not need the 
 access to the completed application that is removed from the queue. Create 
 this ticket to tackle the discussion around.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


 [ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1514:
-

Attachment: YARN-1514.5.patch

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, 
 YARN-1514.wip-2.patch, YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2310) Revisit the APIs in RM web services where user information can make difference


[ 
https://issues.apache.org/jira/browse/YARN-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100883#comment-14100883
 ] 

Zhijie Shen commented on YARN-2310:
---

Thanks for notifying me of that. Would you please check the other app-related 
getter methods? For example, getAppAttempts. It seems that we can access 
without any access control.

 Revisit the APIs in RM web services where user information can make difference
 --

 Key: YARN-2310
 URL: https://issues.apache.org/jira/browse/YARN-2310
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0, 2.5.0
Reporter: Zhijie Shen

 After YARN-2247, RM web services can be sheltered by the authentication 
 filter, which can help to identify who the user is. With this information, we 
 should be able to fix the security problem of some existing APIs, such as 
 getApp, getAppAttempts, getApps. We should use the user information to check 
 the ACLs before returning the requested data to the user.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2190) Provide a Windows container executor that can limit memory and CPU

2014-08-18 Thread Chuan Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuan Liu updated YARN-2190:


Attachment: YARN-2190.4.patch

Attach a new patch to address the audit warning.

 Provide a Windows container executor that can limit memory and CPU
 --

 Key: YARN-2190
 URL: https://issues.apache.org/jira/browse/YARN-2190
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: Chuan Liu
Assignee: Chuan Liu
 Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, 
 YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch


 Yarn default container executor on Windows does not set the resource limit on 
 the containers currently. The memory limit is enforced by a separate 
 monitoring thread. The container implementation on Windows uses Job Object 
 right now. The latest Windows (8 or later) API allows CPU and memory limits 
 on the job objects. We want to create a Windows container executor that sets 
 the limits on job objects thus provides resource enforcement at OS level.
 http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2390) Investigating whehther generic history service needs to support queue-acls


 [ 
https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-2390:
-

Assignee: Sunil G

 Investigating whehther generic history service needs to support queue-acls
 --

 Key: YARN-2390
 URL: https://issues.apache.org/jira/browse/YARN-2390
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Sunil G

 According YARN-1250,  it's arguable whether queue-acls should be applied to 
 the generic history service as well, because the queue admin may not need the 
 access to the completed application that is removed from the queue. Create 
 this ticket to tackle the discussion around.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2390) Investigating whehther generic history service needs to support queue-acls


[ 
https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100949#comment-14100949
 ] 

Sunil G commented on YARN-2390:
---

Thank you [~zjshen]
I have checked *RMAppImpl#getFinalApplicationStatus*. If 
*currentAttempt.getFinalApplicationStatus()* is null (cases where AM has done 
unregister without specifying the final status), then final status is created 
by RM (calling *RMAppImpl#createFinalApplicationStatus()*)
How do you feel about this.

 Investigating whehther generic history service needs to support queue-acls
 --

 Key: YARN-2390
 URL: https://issues.apache.org/jira/browse/YARN-2390
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Sunil G

 According YARN-1250,  it's arguable whether queue-acls should be applied to 
 the generic history service as well, because the queue admin may not need the 
 access to the completed application that is removed from the queue. Create 
 this ticket to tackle the discussion around.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2310) Revisit the APIs in RM web services where user information can make difference


[ 
https://issues.apache.org/jira/browse/YARN-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100963#comment-14100963
 ] 

Sunil G commented on YARN-2310:
---

Yes. getAppAttempts and getAppState could also fall in to this ACL check. Only 
problem is, *getAppAttempts* does not have HttpServletRequest hsr Context. 
{code}  public AppAttemptsInfo getAppAttempts(@PathParam(appid) String 
appId){code}
Hence getting UGI information without HttpServletRequest  is not possible for 
getAppAttempts api.

 Revisit the APIs in RM web services where user information can make difference
 --

 Key: YARN-2310
 URL: https://issues.apache.org/jira/browse/YARN-2310
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0, 2.5.0
Reporter: Zhijie Shen

 After YARN-2247, RM web services can be sheltered by the authentication 
 filter, which can help to identify who the user is. With this information, we 
 should be able to fix the security problem of some existing APIs, such as 
 getApp, getAppAttempts, getApps. We should use the user information to check 
 the ACLs before returning the requested data to the user.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect

2014-08-18 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101024#comment-14101024
 ] 

Jason Lowe commented on YARN-2034:
--

Description looks OK, but the whitespace formatting for the other entries for 
this property were (inadvertently?) changed and the entry is now inconsistently 
indented.  Could you please update the patch so just the description line is 
being modified?  Thanks!

 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
  Labels: documentation
 Attachments: YARN-2034.patch, YARN-2034.patch


 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2385) Adding support for listing all applications in a queue

2014-08-18 Thread Subramaniam Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101034#comment-14101034
 ] 

Subramaniam Krishnan commented on YARN-2385:


[~sunilg], [~leftnoteasy], [~zjshen]

I suggest we either open a new JIRA to discuss splitting of getAppsinQueue to 
getRunningAppsInQueue + getPendingAppsInQueue or update the current JIRA to 
reflect the discussion? 

 Adding support for listing all applications in a queue
 --

 Key: YARN-2385
 URL: https://issues.apache.org/jira/browse/YARN-2385
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, fairscheduler
Reporter: Subramaniam Krishnan
Assignee: Karthik Kambatla
  Labels: abstractyarnscheduler

 This JIRA proposes adding a method in AbstractYarnScheduler to get all the 
 pending/active applications. Fair scheduler already supports moving a single 
 application from one queue to another. Support for the same is being added to 
 Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition 
 of this method, we can transparently add support for moving all applications 
 from source queue to target queue and draining a queue, i.e. killing all 
 applications in a queue as proposed by YARN-2389



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2385) Adding support for listing all applications in a queue

2014-08-18 Thread Subramaniam Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramaniam Krishnan updated YARN-2385:
---

Assignee: (was: Karthik Kambatla)

 Adding support for listing all applications in a queue
 --

 Key: YARN-2385
 URL: https://issues.apache.org/jira/browse/YARN-2385
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, fairscheduler
Reporter: Subramaniam Krishnan
  Labels: abstractyarnscheduler

 This JIRA proposes adding a method in AbstractYarnScheduler to get all the 
 pending/active applications. Fair scheduler already supports moving a single 
 application from one queue to another. Support for the same is being added to 
 Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition 
 of this method, we can transparently add support for moving all applications 
 from source queue to target queue and draining a queue, i.e. killing all 
 applications in a queue as proposed by YARN-2389



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.

2014-08-18 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101045#comment-14101045
 ] 

zhihai xu commented on YARN-2315:
-

Karthik, thanks for the review. I will implement a test case. Also 
setCurrentCapacity should be 
getResourceUsage().getMemory()/getFairShare().getMemory()(current capacity is 
percentage resource used in your share). I will make this change also.

 Should use setCurrentCapacity instead of setCapacity to configure used 
 resource capacity for FairScheduler.
 ---

 Key: YARN-2315
 URL: https://issues.apache.org/jira/browse/YARN-2315
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2315.patch


 Should use setCurrentCapacity instead of setCapacity to configure used 
 resource capacity for FairScheduler.
 In function getQueueInfo of FSQueue.java, we call setCapacity twice with 
 different parameters so the first call is overrode by the second call. 
 queueInfo.setCapacity((float) getFairShare().getMemory() /
 scheduler.getClusterResource().getMemory());
 queueInfo.setCapacity((float) getResourceUsage().getMemory() /
 scheduler.getClusterResource().getMemory());
 We should change the second setCapacity call to setCurrentCapacity to 
 configure the current used capacity.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2190) Provide a Windows container executor that can limit memory and CPU

2014-08-18 Thread Ivan Mitic (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101097#comment-14101097
 ] 

Ivan Mitic commented on YARN-2190:
--

Thanks Chuan for the new patch. I have a few minor comments left:

1. {code}
jcrci.CpuRate = max(1, vcores * 1 / sysinfo.dwNumberOfProcessors);
{code}
Did you want {{min}} here?
2. {{vcores * 1 / sysinfo.dwNumberOfProcessors}} Can you please add braces 
to signify that multiplication should be done before division? I think this is 
correct but I personally think it is better to be explicit.



 Provide a Windows container executor that can limit memory and CPU
 --

 Key: YARN-2190
 URL: https://issues.apache.org/jira/browse/YARN-2190
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: Chuan Liu
Assignee: Chuan Liu
 Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, 
 YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch


 Yarn default container executor on Windows does not set the resource limit on 
 the containers currently. The memory limit is enforced by a separate 
 monitoring thread. The container implementation on Windows uses Job Object 
 right now. The latest Windows (8 or later) API allows CPU and memory limits 
 on the job objects. We want to create a Windows container executor that sets 
 the limits on job objects thus provides resource enforcement at OS level.
 http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2190) Provide a Windows container executor that can limit memory and CPU

2014-08-18 Thread Chuan Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuan Liu updated YARN-2190:


Attachment: YARN-2190.5.patch

Attach a patch addressing latest comments. Thanks for review!

 Provide a Windows container executor that can limit memory and CPU
 --

 Key: YARN-2190
 URL: https://issues.apache.org/jira/browse/YARN-2190
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: Chuan Liu
Assignee: Chuan Liu
 Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, 
 YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch


 Yarn default container executor on Windows does not set the resource limit on 
 the containers currently. The memory limit is enforced by a separate 
 monitoring thread. The container implementation on Windows uses Job Object 
 right now. The latest Windows (8 or later) API allows CPU and memory limits 
 on the job objects. We want to create a Windows container executor that sets 
 the limits on job objects thus provides resource enforcement at OS level.
 http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect

2014-08-18 Thread Chen He (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2034:
--

Attachment: YARN-2034-2.patch

Thank you for reviewing this, [~jlowe]. Patch updated.

 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
  Labels: documentation
 Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch


 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2395) Fair Scheduler : implement fair share preemption at parent queue based on fairSharePreemptionTimeout


 [ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan reassigned YARN-2395:
-

Assignee: Wei Yan

 Fair Scheduler : implement fair share preemption at parent queue based on 
 fairSharePreemptionTimeout
 

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan

 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue


 [ 
https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan reassigned YARN-2394:
-

Assignee: Wei Yan

 Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
 

 Key: YARN-2394
 URL: https://issues.apache.org/jira/browse/YARN-2394
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan

 Preemption based on fair share starvation happens when usage of a queue is 
 less than 50% of its fair share. This 50% is hardcoded. We'd like to make 
 this configurable on a per queue basis, so that we can choose the threshold 
 at which we want to preempt. Calling this config 
 fairSharePreemptionThreshold. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2415) Expose MiniYARNCluster for use outside of YARN

2014-08-18 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-2415:
--

Assignee: Wei Yan  (was: Karthik Kambatla)

Wei is looking into this. 

 Expose MiniYARNCluster for use outside of YARN
 --

 Key: YARN-2415
 URL: https://issues.apache.org/jira/browse/YARN-2415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: client
Affects Versions: 2.5.0
Reporter: Hari Shreedharan
Assignee: Wei Yan

 The MR/HDFS equivalents are available for applications to use in tests, but 
 the YARN Mini cluster is not. It would be really useful to test applications 
 that are written to run on YARN (like Spark)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart


[ 
https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101250#comment-14101250
 ] 

Zhijie Shen commented on YARN-2249:
---

1. Do the following in AbstractYarnScheduler.serviceInit?
{code}
+super.nmExpireInterval =
+conf.getInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS,
+  YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS);
{code}
{code}
+createReleaseCache();
{code}

2. Add RM_NM_EXPIRY_INTERVAL_MS in yarn-default.xml?

3. Not sure it's going to be an efficient data structure. Different apps' 
containers should not affect each other, right? mutex on the whole collection 
seems to be a too coarse granularity (blocking allocate call). Should we use 
MapAppAttemptId, ListContainerId and make each app have separate mutex?
{code}
+  private SetContainerId pendingRelease = null;
+  private final Object mutex = new Object();
{code}

 AM release request may be lost on RM restart
 

 Key: YARN-2249
 URL: https://issues.apache.org/jira/browse/YARN-2249
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, 
 YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch


 AM resync on RM restart will send outstanding container release requests back 
 to the new RM. In the meantime, NMs report the container statuses back to RM 
 to recover the containers. If RM receives the container release request  
 before the container is actually recovered in scheduler, the container won't 
 be released and the release request will be lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue


[ 
https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101260#comment-14101260
 ] 

Wei Yan commented on YARN-2394:
---

I'll look into this.

 Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
 

 Key: YARN-2394
 URL: https://issues.apache.org/jira/browse/YARN-2394
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan

 Preemption based on fair share starvation happens when usage of a queue is 
 less than 50% of its fair share. This 50% is hardcoded. We'd like to make 
 this configurable on a per queue basis, so that we can choose the threshold 
 at which we want to preempt. Calling this config 
 fairSharePreemptionThreshold. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2422) yarn.scheduler.maximum-allocation-mb should not be hard-coded in yarn-default.xml

2014-08-18 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101274#comment-14101274
 ] 

Sandy Ryza commented on YARN-2422:
--

I think it's weird to have a nodemanager property impact what goes on in the 
ResourceManager. Using this property would be especially weird on heterogeneous 
clusters where resources vary from node to node.  Preferable would be to, 
independently of yarn.scheduler.maximum-allocation-mb, make the ResourceManager 
reject any requests that are larger than the largest node in the cluster.  And 
then default yarn.scheduler.maximum-allocaiton-mb to infinite. 

 yarn.scheduler.maximum-allocation-mb should not be hard-coded in 
 yarn-default.xml
 -

 Key: YARN-2422
 URL: https://issues.apache.org/jira/browse/YARN-2422
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Gopal V
Priority: Minor
 Attachments: YARN-2422.1.patch


 Cluster with 40Gb NM refuses to run containers 8Gb.
 It was finally tracked down to yarn-default.xml hard-coding it to 8Gb.
 In case of lack of a better override, it should default to - 
 ${yarn.nodemanager.resource.memory-mb} instead of a hard-coded 8Gb.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-08-18 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Payne updated YARN-415:

Attachment: YARN-415.201408181938.txt

[~jianhe], thank you for your continuing reviews and comments.

{quote}
Particularly in work-preserving AM restart, current AM is actually the one
who's managing previous running containers. Running containers in scheduler are
already transferred to the current AM. So running containers metrics are
transferred as well. I think it'll be confusing if finished containers are
still charged back against the previous dead attempt. Btw, YARN-1809 will add
the attempt web page where we could show attempt-specific metrics also.
{quote}
You are correct. In the work-preserving AM restart case, the live containers
are transferred to the new attempt for the remaining lifetime of the container,
and then when the container completes, the original attempt gets the
CONTAINER_FINISHED event. But I see your point about being consistent in the
work-preserving AM restart case. So, I have attached a patch which will charge
container usage to the current attempt, whether the container is running or
completed.
{quote}
Regarding the problem of metrics persistency. Agree that it doesn't solve the
problem for running apps in general. Maybe we can have the state store changes
in a separate jira and discuss more there, so that we can get this in first.
{quote}
Yes, I would appreciate it if we could continue this discussion on a separate
JIRA.

Capture memory utilization at the app-level for chargeback
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-08-18 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101341#comment-14101341
]

Eric Payne commented on YARN-415:
-

[~kkambatl], thank you for taking the time to review this patch.

I would like to see if [~kthrapp] could comment on your use case questions, but
here are my initial thoughts:

{quote}
1. Is the chargeback simply to track the usage and may be financially charge
the users. Or, is to influence future scheduling decisions? I agree that the RM
should facilitate collecting this information, but should the collected info be
available to the RM for future use? If not, do we want the RM to serve this
info?
{quote}
Potential goals could be:
# report (and charge for) grid usage
# eventually limit job submission based on a users' budget
{quote}
2. Do we want to charge the app only for the resources used to do meaningful
work or do we also want to include failed/preempted containers? If we don't
charge the app for failed containers, who are they charged to? Are we okay with
letting some resources go uncharged?
{quote}
This implementation does charge the app for failed containers. This was debated
somewhat previously in this JIRA, because if the failure was due to preemption
or a bug that wasn't the app's fault, it may be unfair to charge the app for
those. However, it is very unclear how one could programmatically determine
whose fault the failure is.

{quote}
3. How soon do we want this usage information? It might make sense to
collect/expose this once the app is finished for certain kinds of applications.
What is our story for long-running applications?
{quote}
There is a specific use case for determine the usage at runtime. Again, I would
hope that [~kthrapp] could elaborate on this.

Capture memory utilization at the app-level for chargeback
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101381#comment-14101381
 ] 

Hadoop QA commented on YARN-1514:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12662518/YARN-1514.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4666//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4666//console

This message is automatically generated.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.6.0

 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, 
 YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, 
 YARN-1514.wip-2.patch, YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2249) AM release request may be lost on RM restart


 [ 
https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2249:
--

Attachment: YARN-2249.5.patch

 AM release request may be lost on RM restart
 

 Key: YARN-2249
 URL: https://issues.apache.org/jira/browse/YARN-2249
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, 
 YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch


 AM resync on RM restart will send outstanding container release requests back 
 to the new RM. In the meantime, NMs report the container statuses back to RM 
 to recover the containers. If RM receives the container release request  
 before the container is actually recovered in scheduler, the container won't 
 be released and the release request will be lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart

[
https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101443#comment-14101443
]

Jian He commented on YARN-2249:
---

Thanks Zhijie for the review !
bq. Do the following in AbstractYarnScheduler.serviceInit?
fixed.
bq. Add RM_NM_EXPIRY_INTERVAL_MS in yarn-default.xml?
It is already present.
bq. Not sure it's going to be an efficient data structure. Different apps'
containers should not affect each other, right? mutex on the whole collection
seems to be a too coarse granularity (blocking allocate call). Should we use
MapAppAttemptId, ListContainerId and make each app have separate mutex?
I moved the pendingReleases to SchedulerApplicationAttempt and lock the attempt
object instead.

AM release request may be lost on RM restart

Key: YARN-2249
URL: https://issues.apache.org/jira/browse/YARN-2249
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch,
YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch

AM resync on RM restart will send outstanding container release requests back
to the new RM. In the meantime, NMs report the container statuses back to RM
to recover the containers. If RM receives the container release request
before the container is actually recovered in scheduler, the container won't
be released and the release request will be lost.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-08-18 Thread Kendall Thrapp (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101445#comment-14101445
]

Kendall Thrapp commented on YARN-415:
-

{quote}
1. Is the chargeback simply to track the usage and may be financially charge
the users. Or, is to influence future scheduling decisions? I agree that the RM
should facilitate collecting this information, but should the collected info be
available to the RM for future use? If not, do we want the RM to serve this
info?
{quote}
In addition to the goals [~eepayne] listed, another goal is to make it easier
for users to compare how code changes to a particular recurring Hadoop job
affect its resource usage. Assuming input data size didn't significantly
change, It'd be much more apparent after to the user after a code change if
there was a resulting significant change in the resource usage for their job.
Even without charging, I'm hoping that having the resource usage shown to the
user, without any extra work on their part, will make more people think about
their overall grid resource usage, instead of just run times.

Capture memory utilization at the app-level for chargeback
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-18 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101509#comment-14101509
 ] 

Ravi Prakash commented on YARN-2424:


Thanks Tucu for pointing out the security implications of allowing 
un-authenticated users to run tasks as themselves (or impersonate others) on 
nodes. I agree that is not something we should turn on by default. That is why 
I think the default value for DEFAULT_NM_NONSECURE_MODE_LIMIT_USERS to be true 
is necessary. However, there is a use case as pointed out by Allen (as a 
stepping stone towards turning on Kerberos) that we at Altiscale and presumably 
others also have (e.g. Jay's last comment on YARN-1253). 
 
Thanks for this patch Allen! I'll take a look at it.



 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
  Labels: regression
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue


 [ 
https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2394:
--

Attachment: YARN-2394-1.patch

 Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
 

 Key: YARN-2394
 URL: https://issues.apache.org/jira/browse/YARN-2394
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2394-1.patch


 Preemption based on fair share starvation happens when usage of a queue is 
 less than 50% of its fair share. This 50% is hardcoded. We'd like to make 
 this configurable on a per queue basis, so that we can choose the threshold 
 at which we want to preempt. Calling this config 
 fairSharePreemptionThreshold. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-18 Thread Alejandro Abdelnur (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101533#comment-14101533
 ] 

Alejandro Abdelnur commented on YARN-2424:
--

I really don't like it, it is not my business how you run your clusters, but 
this is dangerous, specially in a multi-tenancy scenario. From Allen's comment 
(the one I highlighted) it is not clear to me this is meant only for 
setup/troubleshooting usage.

I would not -1 this JIRA if...

* the property has 'use-only-for-troubleshooting' in its name.
* the NM logs print a WARN at startup and on every started container stating 
the flag and its un-secure nature
* the container stdout/stderr also print a WARN to alert the user of the 
cluster setup.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
  Labels: regression
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-18 Thread Ravi Prakash (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-2424:
---

Labels:   (was: regression)

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-115) yarn commands shouldn't add m to the heapsize

2014-08-18 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved YARN-115.
---

Resolution: Duplicate

Between HADOOP-9902 and HADOOP-10950, this issue will be fully covered. Closing 
as a dupe.

 yarn commands shouldn't add m to the heapsize
 ---

 Key: YARN-115
 URL: https://issues.apache.org/jira/browse/YARN-115
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 0.23.3
Reporter: Thomas Graves
  Labels: usability

 the yarn commands add m to the heapsize. This is unlike the hdfs side and 
 the the old jt/tt used to do.
 JAVA_HEAP_MAX=-Xmx$YARN_RESOURCEMANAGER_HEAPSIZEm
 JAVA_HEAP_MAX=-Xmx$YARN_NODEMANAGER_HEAPSIZEm
 We should not be adding in the m and allow the user to specify units.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101568#comment-14101568
]

Jian He commented on YARN-1372:
---

bq. Not sure if there is an easier way to link the two right now as the
application cleanup lifecycle also converts into a Container Kill just like any
other container Kill.
I meant can we remove all the containers in NMContext once we received the
NodeHeartbeatResponse#getApplicationsToCleanup notification, instead of
depending on expiration. Because applications are already completed at this
point when receiving the applicationsToCleanUp, the containers kept in
NMContext may not be needed any more.
bq. This it to allow a separate set of justFinishedContainers that can be used
for returning to AM and at the same time acknowledging the previous returned
set to NM.
the same justFinishedContainers set can be used to return to AM and ack NMs?
bq. DECOMMISSIONED/LOST state possible to receive the new event?
sorry for being unclear. I meant is it possible for NM at DECOMMISSIONED/LOST
state to receive the newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we
need to handle them too.

Patch is not applying anymore. Can you update the patch please? thx

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
Attachments: YARN-1372.001.patch, YARN-1372.001.patch,
YARN-1372.prelim.patch, YARN-1372.prelim2.patch

Currently the NM informs the RM about completed containers and then removes
those containers from the RM notification list. The RM passes on that
completed container information to the AM and the AM pulls this data. If the
RM dies before the AM pulls this data then the AM may not be able to get this
information again. To fix this, NM should maintain a separate list of such
completed container notifications sent to the RM. After the AM has pulled the
containers from the RM then the RM will inform the NM about it and the NM can
remove the completed container from the new list. Upon re-register with the
RM (after RM restart) the NM should send the entire list of completed
containers to the RM along with any other containers that completed while the
RM was dead. This ensures that the RM can inform the AM's about all completed
containers. Some container completions may be reported more than once since
the AM may have pulled the container but the RM may die before notifying the
NM about the pull.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE


[ 
https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101605#comment-14101605
 ] 

Jian He commented on YARN-1919:
---

looks good to me. 

 Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
 --

 Key: YARN-1919
 URL: https://issues.apache.org/jira/browse/YARN-1919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Devaraj K
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1919.1.patch, YARN-1919.2.patch


 {code:xml}
 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2386) Refactor common scheduler configurations into a base ResourceSchedulerConfig class

2014-08-18 Thread Subramaniam Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramaniam Krishnan resolved YARN-2386.


Resolution: Invalid

Took a look into both the scheduler configs and unfortunately the 
configurations are so disparate that there isn't much common to refactor out.

 Refactor common scheduler configurations into a base ResourceSchedulerConfig 
 class
 --

 Key: YARN-2386
 URL: https://issues.apache.org/jira/browse/YARN-2386
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Subramaniam Krishnan
Assignee: Subramaniam Krishnan

 As discussed with [~leftnoteasy], [~jianhe] and [~kasha], this JIRA proposes 
 refactoring common configuration from Capacity  Fair scheduler to a common 
 base class to avoid duplicating configs. Currently Capacity  Fair scheduler 
 directly extend configuration and adding a common base Resource scheduler 
 config class would also align with the Resource Scheduler hierarchy and 
 enable other systems like reservation system (YARN-2080) to be scheduler 
 implementation agnostic.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2428) LCE default banned user list should have yarn

2014-08-18 Thread Allen Wittenauer (JIRA)

Allen Wittenauer created YARN-2428:
--

 Summary: LCE default banned user list should have yarn
 Key: YARN-2428
 URL: https://issues.apache.org/jira/browse/YARN-2428
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Allen Wittenauer


When task-controller was retrofitted to YARN, the default banned user list 
didn't add yarn.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2429) LCE should blacklist based upon group

2014-08-18 Thread Allen Wittenauer (JIRA)

Allen Wittenauer created YARN-2429:
--

 Summary: LCE should blacklist based upon group
 Key: YARN-2429
 URL: https://issues.apache.org/jira/browse/YARN-2429
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Allen Wittenauer


It should be possible to list a group to ban, not just individual users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode

2014-08-18 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101677#comment-14101677
 ] 

Ravi Prakash commented on YARN-2424:


Hi Tucu! Thanks for your comment. There is currently capability to blacklist / 
whitelist users in the container-executor.cfg file. Given this capability, do 
you think in a properly configured cluster, yarn tasks launching as different 
users could create problems? This is with the assumption that most clusters do 
not have NFS mounts on the slave nodes.

As an aside I think it would be good to add a blacklist + whitelist for groups 
as well.

 LCE should support non-cgroups, non-secure mode
 ---

 Key: YARN-2424
 URL: https://issues.apache.org/jira/browse/YARN-2424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1
Reporter: Allen Wittenauer
Priority: Blocker
 Attachments: YARN-2424.patch


 After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios.  
 This is a fairly serious regression, as turning on LCE prior to turning on 
 full-blown security is a fairly standard procedure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()

2014-08-18 Thread Maysam Yabandeh (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101686#comment-14101686
]

Maysam Yabandeh commented on YARN-2430:
---

Here are the current alternative solutions:

1. a simple, quick fix would be to cache the result of getResourceUsage in a
field of Schedulable and invalidate the cache after each scheduling. The
invalidation requires iteration on all schedulables with cost O( n ).

2. alternatively as suggested by Karthik the cached result could be updated
periodically as part of UpdateThread. This approach would also encourage moving
the sorting also to the UpdateThread since the sort algorithm is no longer
provided with the most up-to-date data.

3. Karthik also brought up the option of bottom-up update of the resource usage
when something gets updated: each Schedulable pushes up the change in its
resource usage after each change. This would require invoking the push-up
method at the right places. Care must be taken in future changes not to forget
calling the push-up method.

I would highly appreciate the comments.

FairShareComparator: cache the results of getResourceUsage()

Key: YARN-2430
URL: https://issues.apache.org/jira/browse/YARN-2430
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

The compare of FairShareComparator has 3 invocation of getResourceUsage per
comparable object. In the case of queues, the implementation of
getResourceUsage requires iterating over the apps and adding up their current
usage. The compare method can reuse the result of getResourceUsage to reduce
the load by third. However, to further reduce the load the result of
getResourceUsage can be cached in FSLeafQueue. This would be more efficient
since the invocation of compare method on each Comparable object is = 1.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()

2014-08-18 Thread Maysam Yabandeh (JIRA)

Maysam Yabandeh created YARN-2430:
-

 Summary: FairShareComparator: cache the results of 
getResourceUsage()
 Key: YARN-2430
 URL: https://issues.apache.org/jira/browse/YARN-2430
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh


The compare of FairShareComparator has 3 invocation of  getResourceUsage per 
comparable object. In the case of queues, the implementation of 
getResourceUsage requires iterating over the apps and adding up their current 
usage. The compare method can reuse the result of getResourceUsage to reduce 
the load by third. However, to further reduce the load the result of 
getResourceUsage can be cached in FSLeafQueue. This would be more efficient 
since the invocation of compare method on each Comparable object is = 1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()

2014-08-18 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101732#comment-14101732
 ] 

Sandy Ryza commented on YARN-2430:
--

I believe #3 is the best approach as it's more performant than #1 and #2 has 
correctness issues.  I actually implemented it a little while ago as part of 
YARN-1297 and will try to get that in.

 FairShareComparator: cache the results of getResourceUsage()
 

 Key: YARN-2430
 URL: https://issues.apache.org/jira/browse/YARN-2430
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 The compare of FairShareComparator has 3 invocation of  getResourceUsage per 
 comparable object. In the case of queues, the implementation of 
 getResourceUsage requires iterating over the apps and adding up their current 
 usage. The compare method can reuse the result of getResourceUsage to reduce 
 the load by third. However, to further reduce the load the result of 
 getResourceUsage can be cached in FSLeafQueue. This would be more efficient 
 since the invocation of compare method on each Comparable object is = 1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect