date:20150323


[ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375867#comment-14375867
 ] 

Junping Du commented on YARN-3304:
--

Hi [~vinodkv], agree that we can fix it as 0 to keep consistent with other 
values. We can have a separated JIRA to track the improvement to handle 
unavailable case later which shouldn't block 2.7 releases.

 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker

 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2213) Change proxy-user cookie log in AmIpFilter to DEBUG


[ 
https://issues.apache.org/jira/browse/YARN-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375785#comment-14375785
 ] 

Xuan Gong commented on YARN-2213:
-

Could we add 
{code}
if (LOG.isDebugEnabled()) {
+ LOG.debug(Could not find +WebAppProxyServlet.PROXY_USER_COOKIE_NAME
   + cookie, so user will not be set);
}
{code}

 Change proxy-user cookie log in AmIpFilter to DEBUG
 ---

 Key: YARN-2213
 URL: https://issues.apache.org/jira/browse/YARN-2213
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Ted Yu
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-2213.001.patch


 I saw a lot of the following lines in AppMaster log:
 {code}
 14/06/24 17:12:36 WARN web.SliderAmIpFilter: Could not find proxy-user 
 cookie, so user will not be set
 14/06/24 17:12:39 WARN web.SliderAmIpFilter: Could not find proxy-user 
 cookie, so user will not be set
 14/06/24 17:12:39 WARN web.SliderAmIpFilter: Could not find proxy-user 
 cookie, so user will not be set
 {code}
 For long running app, this would consume considerable log space.
 Log level should be changed to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI

2015-03-23 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-3225:

Attachment: YARN-3225-1.patch

 New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
 ---

 Key: YARN-3225
 URL: https://issues.apache.org/jira/browse/YARN-3225
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Devaraj K
 Attachments: YARN-3225-1.patch, YARN-3225.patch, YARN-914.patch


 New CLI (or existing CLI with parameters) should put each node on 
 decommission list to decommissioning status and track timeout to terminate 
 the nodes that haven't get finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-03-23 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3136:
--
Attachment: 0007-YARN-3136.patch

Uploading patch to check findbugs warnings.

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 
 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
 0006-YARN-3136.patch, 0007-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

[
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375759#comment-14375759
]

Hadoop QA commented on YARN-3136:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12706510/0007-YARN-3136.patch
against trunk revision 0b9f12c.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7071//console

This message is automatically generated.

getTransferredContainers can be a bottleneck during AM registration
---

Key: YARN-3136
URL: https://issues.apache.org/jira/browse/YARN-3136
Project: Hadoop YARN
Issue Type: Sub-task
Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch,
0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch,
0006-YARN-3136.patch, 0007-YARN-3136.patch

While examining RM stack traces on a busy cluster I noticed a pattern of AMs
stuck waiting for the scheduler lock trying to call getTransferredContainers.
The scheduler lock is highly contended, especially on a large cluster with
many nodes heartbeating, and it would be nice if we could find a way to
eliminate the need to grab this lock during this call. We've already done
similar work during AM allocate calls to make sure they don't needlessly grab
the scheduler lock, and it would be good to do so here as well, if possible.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2213) Change proxy-user cookie log in AmIpFilter to DEBUG


[ 
https://issues.apache.org/jira/browse/YARN-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375788#comment-14375788
 ] 

Hadoop QA commented on YARN-2213:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12690459/YARN-2213.001.patch
  against trunk revision 0b9f12c.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7073//console

This message is automatically generated.

 Change proxy-user cookie log in AmIpFilter to DEBUG
 ---

 Key: YARN-2213
 URL: https://issues.apache.org/jira/browse/YARN-2213
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Ted Yu
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-2213.001.patch


 I saw a lot of the following lines in AppMaster log:
 {code}
 14/06/24 17:12:36 WARN web.SliderAmIpFilter: Could not find proxy-user 
 cookie, so user will not be set
 14/06/24 17:12:39 WARN web.SliderAmIpFilter: Could not find proxy-user 
 cookie, so user will not be set
 14/06/24 17:12:39 WARN web.SliderAmIpFilter: Could not find proxy-user 
 cookie, so user will not be set
 {code}
 For long running app, this would consume considerable log space.
 Log level should be changed to DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3389) Two attempts might operate on same data structures concurrently

2015-03-23 Thread Jun Gong (JIRA)

Jun Gong created YARN-3389:
--

 Summary: Two attempts might operate on same data structures 
concurrently
 Key: YARN-3389
 URL: https://issues.apache.org/jira/browse/YARN-3389
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong


In AttemptFailedTransition, the new attempt will get 
state('justFinishedContainers' and 'finishedContainersSentToAM') reference from 
the failed attempt. Then the two attempts might operate on these two variables 
concurrently, e.g. they might update 'justFinishedContainers' concurrently when 
they are both handling CONTAINER_FINISHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3389) Two attempts might operate on same data structures concurrently

2015-03-23 Thread Jun Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-3389:
---
Attachment: YARN-3389.01.patch

 Two attempts might operate on same data structures concurrently
 ---

 Key: YARN-3389
 URL: https://issues.apache.org/jira/browse/YARN-3389
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3389.01.patch


 In AttemptFailedTransition, the new attempt will get 
 state('justFinishedContainers' and 'finishedContainersSentToAM') reference 
 from the failed attempt. Then the two attempts might operate on these two 
 variables concurrently, e.g. they might update 'justFinishedContainers' 
 concurrently when they are both handling CONTAINER_FINISHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3384) Test failures since TestLogAggregationService.verifyContainerLogs fails after YARN-2777


 [ 
https://issues.apache.org/jira/browse/YARN-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3384:
-
Summary: Test failures since TestLogAggregationService.verifyContainerLogs 
fails after YARN-2777  (was: test case failures in TestLogAggregationService)

 Test failures since TestLogAggregationService.verifyContainerLogs fails after 
 YARN-2777
 ---

 Key: YARN-3384
 URL: https://issues.apache.org/jira/browse/YARN-3384
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
  Labels: test-fail
 Attachments: YARN-3384.20150321-1.patch


 Following test cases of TestLogAggregationService is failing :
 testMultipleAppsLogAggregation
 testLogAggregationServiceWithRetention
 testLogAggregationServiceWithInterval
 testLogAggregationServiceWithPatterns 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3384) TestLogAggregationService.verifyContainerLogs fails after YARN-2777


 [ 
https://issues.apache.org/jira/browse/YARN-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3384:
-
Summary: TestLogAggregationService.verifyContainerLogs fails after 
YARN-2777  (was: Test failures since 
TestLogAggregationService.verifyContainerLogs fails after YARN-2777)

 TestLogAggregationService.verifyContainerLogs fails after YARN-2777
 ---

 Key: YARN-3384
 URL: https://issues.apache.org/jira/browse/YARN-3384
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
  Labels: test-fail
 Attachments: YARN-3384.20150321-1.patch


 Following test cases of TestLogAggregationService is failing :
 testMultipleAppsLogAggregation
 testLogAggregationServiceWithRetention
 testLogAggregationServiceWithInterval
 testLogAggregationServiceWithPatterns 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3388) userlimit isn't playing well with DRF calculator

2015-03-23 Thread Nathan Roberts (JIRA)

Nathan Roberts created YARN-3388:


 Summary: userlimit isn't playing well with DRF calculator
 Key: YARN-3388
 URL: https://issues.apache.org/jira/browse/YARN-3388
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


When there are multiple active users in a queue, it should be possible for 
those users to make use of capacity up-to max_capacity (or close). The 
resources should be fairly distributed among the active users in the queue. 
This works pretty well when there is a single resource being scheduled.   
However, when there are multiple resources the situation gets more complex and 
the current algorithm tends to get stuck at Capacity. 

Example illustrated in subsequent comment.











--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3384) test case failures in TestLogAggregationService


[ 
https://issues.apache.org/jira/browse/YARN-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376035#comment-14376035
 ] 

Tsuyoshi Ozawa commented on YARN-3384:
--

+1, committing this shortly.

 test case failures in TestLogAggregationService
 ---

 Key: YARN-3384
 URL: https://issues.apache.org/jira/browse/YARN-3384
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
  Labels: test-fail
 Attachments: YARN-3384.20150321-1.patch


 Following test cases of TestLogAggregationService is failing :
 testMultipleAppsLogAggregation
 testLogAggregationServiceWithRetention
 testLogAggregationServiceWithInterval
 testLogAggregationServiceWithPatterns 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3384) TestLogAggregationService.verifyContainerLogs fails after YARN-2777


[ 
https://issues.apache.org/jira/browse/YARN-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376059#comment-14376059
 ] 

Naganarasimha G R commented on YARN-3384:
-

Thanks [~ozawa], for reviewing and committing the patch :)

 TestLogAggregationService.verifyContainerLogs fails after YARN-2777
 ---

 Key: YARN-3384
 URL: https://issues.apache.org/jira/browse/YARN-3384
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
  Labels: test-fail
 Fix For: 2.7.0

 Attachments: YARN-3384.20150321-1.patch


 Following test cases of TestLogAggregationService is failing :
 testMultipleAppsLogAggregation
 testLogAggregationServiceWithRetention
 testLogAggregationServiceWithInterval
 testLogAggregationServiceWithPatterns 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.


 [ 
https://issues.apache.org/jira/browse/YARN-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3363:

Attachment: YARN-3363.000.patch

 add localization and container launch time to ContainerMetrics at NM to show 
 these timing information for each active container.
 

 Key: YARN-3363
 URL: https://issues.apache.org/jira/browse/YARN-3363
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
  Labels: metrics, supportability
 Attachments: YARN-3363.000.patch


 add localization and container launch time to ContainerMetrics at NM to show 
 these timing information for each active container.
 Currently ContainerMetrics has container's actual memory usage(YARN-2984),  
 actual CPU usage(YARN-3122), resource  and pid(YARN-3022). It will be better 
 to have localization and container launch time in ContainerMetrics for each 
 active container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3111) Fix ratio problem on FairScheduler page

2015-03-23 Thread Peng Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Zhang updated YARN-3111:
-
Attachment: YARN-3111.v2.patch

 Fix ratio problem on FairScheduler page
 ---

 Key: YARN-3111
 URL: https://issues.apache.org/jira/browse/YARN-3111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Minor
 Attachments: YARN-3111.1.patch, YARN-3111.png, YARN-3111.v2.patch, 
 parenttooltip.png


 Found 3 problems on FairScheduler page:
 1. Only compute memory for ratio even when queue schedulingPolicy is DRF.
 2. When min resources is configured larger than real resources, the steady 
 fair share ratio is so long that it is out the page.
 3. When cluster resources is 0(no nodemanager start), ratio is displayed as 
 NaN% used
 Attached image shows the snapshot of above problems. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3388) userlimit isn't playing well with DRF calculator

2015-03-23 Thread Nathan Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376060#comment-14376060
 ] 

Nathan Roberts commented on YARN-3388:
--

Example (lots of things going on in this algorithm. I simplified to just the 
key pieces for clarity.)
tuples are resources [memory] or [memory,cpu]

just memory:
-
Queue Capacity is [100]
2 active users, both request [10] at a time
User1 is at [45]
User2 is at [40]
Limit is calculated to be 100/2=50, both users can allocate
User2 goes to [50] - now used Capacity is 45+50=95
Limit is still 50
User1 goes to [55] - used Capacity now 50+55=105
Limit is now 105/2
User2 goes to [60] - used Capacity is now 60+55=115
Limit is now 115/2
So on and so forth until maxCapacity is hit.
Notice how the users essentially leap frog one another, allowing the Limit to 
continually move higher.

memory and cpu

Queue Capacity is [100,100]
2 active users, User1 asks for [10,20], User2 asks for [20,10]
User1 is at [35,45]
User2 is at [45,35]
Limit is calculated to be [100/2=50,100/2=50], both users can allocate
User2 goes to [65,45] - used Capacity is now [65+35=100,45+45=90]
Limit is still [50,50]
User1 goes to [45,65] - used Capacity is now [65+45=110,45+65=110]
Limit is now [110/2=55, 110/2=55]
User1 and User2 are now both considered over limit and neither can allocate. 
User1 is over on cpu, User2 is over on memory.

Open to suggestions on simple ways to fix this. I'm currently thinking a 
reasonable (simple, effective, computationally cheap, mostly fair) approach 
might be to give some small percentage of additional leeway for userLimit. 



 userlimit isn't playing well with DRF calculator
 

 Key: YARN-3388
 URL: https://issues.apache.org/jira/browse/YARN-3388
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts

 When there are multiple active users in a queue, it should be possible for 
 those users to make use of capacity up-to max_capacity (or close). The 
 resources should be fairly distributed among the active users in the queue. 
 This works pretty well when there is a single resource being scheduled.   
 However, when there are multiple resources the situation gets more complex 
 and the current algorithm tends to get stuck at Capacity. 
 Example illustrated in subsequent comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2777) Mark the end of individual log in aggregated log


[ 
https://issues.apache.org/jira/browse/YARN-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376066#comment-14376066
 ] 

Hudson commented on YARN-2777:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7402 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7402/])
YARN-3384. TestLogAggregationService.verifyContainerLogs fails after YARN-2777. 
Contributed by Naganarasimha G R. (ozawa: rev 
82eda771e05cf2b31788ee1582551e65f1c0f9aa)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java


 Mark the end of individual log in aggregated log
 

 Key: YARN-2777
 URL: https://issues.apache.org/jira/browse/YARN-2777
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ted Yu
Assignee: Varun Saxena
  Labels: log-aggregation
 Fix For: 2.7.0

 Attachments: YARN-2777.001.patch, YARN-2777.02.patch


 Below is snippet of aggregated log showing hbase master log:
 {code}
 LogType: hbase-hbase-master-ip-172-31-34-167.log
 LogUploadTime: 29-Oct-2014 22:31:55
 LogLength: 24103045
 Log Contents:
 Wed Oct 29 15:43:57 UTC 2014 Starting master on ip-172-31-34-167
 ...
   at 
 org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
   at org.apache.hadoop.hbase.Chore.run(Chore.java:80)
   at java.lang.Thread.run(Thread.java:745)
 LogType: hbase-hbase-master-ip-172-31-34-167.out
 {code}
 Since logs from various daemons are aggregated in one log file, it would be 
 desirable to mark the end of one log before starting with the next.
 e.g. with such a line:
 {code}
 End of LogType: hbase-hbase-master-ip-172-31-34-167.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page


[ 
https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376086#comment-14376086
 ] 

Hadoop QA commented on YARN-3111:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706530/YARN-3111.v2.patch
  against trunk revision 0b9f12c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7074//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7074//console

This message is automatically generated.

 Fix ratio problem on FairScheduler page
 ---

 Key: YARN-3111
 URL: https://issues.apache.org/jira/browse/YARN-3111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Minor
 Attachments: YARN-3111.1.patch, YARN-3111.png, YARN-3111.v2.patch, 
 parenttooltip.png


 Found 3 problems on FairScheduler page:
 1. Only compute memory for ratio even when queue schedulingPolicy is DRF.
 2. When min resources is configured larger than real resources, the steady 
 fair share ratio is so long that it is out the page.
 3. When cluster resources is 0(no nodemanager start), ratio is displayed as 
 NaN% used
 Attached image shows the snapshot of above problems. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3384) TestLogAggregationService.verifyContainerLogs fails after YARN-2777


[ 
https://issues.apache.org/jira/browse/YARN-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376067#comment-14376067
 ] 

Hudson commented on YARN-3384:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7402 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7402/])
YARN-3384. TestLogAggregationService.verifyContainerLogs fails after YARN-2777. 
Contributed by Naganarasimha G R. (ozawa: rev 
82eda771e05cf2b31788ee1582551e65f1c0f9aa)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java


 TestLogAggregationService.verifyContainerLogs fails after YARN-2777
 ---

 Key: YARN-3384
 URL: https://issues.apache.org/jira/browse/YARN-3384
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
  Labels: test-fail
 Fix For: 2.7.0

 Attachments: YARN-3384.20150321-1.patch


 Following test cases of TestLogAggregationService is failing :
 testMultipleAppsLogAggregation
 testLogAggregationServiceWithRetention
 testLogAggregationServiceWithInterval
 testLogAggregationServiceWithPatterns 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI


[ 
https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375763#comment-14375763
 ] 

Hadoop QA commented on YARN-3225:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706513/YARN-3225-1.patch
  against trunk revision 0b9f12c.

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7072//console

This message is automatically generated.

 New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
 ---

 Key: YARN-3225
 URL: https://issues.apache.org/jira/browse/YARN-3225
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Devaraj K
 Attachments: YARN-3225-1.patch, YARN-3225.patch, YARN-914.patch


 New CLI (or existing CLI with parameters) should put each node on 
 decommission list to decommissioning status and track timeout to terminate 
 the nodes that haven't get finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3394) WebApplication proxy documentation is incomplete


[ 
https://issues.apache.org/jira/browse/YARN-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377320#comment-14377320
 ] 

Tsuyoshi Ozawa commented on YARN-3394:
--

+1 for having the document.

 WebApplication  proxy documentation is incomplete
 -

 Key: YARN-3394
 URL: https://issues.apache.org/jira/browse/YARN-3394
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor

 Webproxy documentation is incomplete
 hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html
 1.Configuration of service start/stop as separate server
 2.Steps to start as daemon service
 3.Secure mode for Web proxy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3394) WebApplication proxy documentation is incomplete

2015-03-23 Thread Bibin A Chundatt (JIRA)

Bibin A Chundatt created YARN-3394:
--

 Summary: WebApplication  proxy documentation is incomplete
 Key: YARN-3394
 URL: https://issues.apache.org/jira/browse/YARN-3394
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor



Webproxy documentation is incomplete

hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html

1.Configuration of service start/stop as separate server
2.Steps to start as daemon service
3.Secure mode for Web proxy





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context


[ 
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376602#comment-14376602
 ] 

Junping Du commented on YARN-3040:
--

Hi [~zjshen], thanks for the patch! I am still reviewing the patch but have 
some quick comments so far:
{code}
+  public static String generateDefaultClusterIdBasedOnAppId(
+  ApplicationId appId) {
+return cluster_ + appId.getClusterTimestamp();
+  }
{code}
It seems appId's ClusterTimestamp comes from RM and get changed everytime RM 
get restart. I think here we need a ClusterID that can keep consistent across 
from RM restarts. Isn't it? Or applications get submitted to the same cluster 
could get different ClusterID just because RM failed over which shouldn't be 
users' expectation. Suggest to add a configuration for user to input a 
specified ClusterID or it generate default (and variable) value for test 
purpose.

{code}
+  rpc getTimelienCollectorContext (GetTimelineCollectorContextRequestProto) 
returns (GetTimelineCollectorContextResponseProto);
{code}
One typos here and other places, Timelien should be Timeline.

{code}
-import java.util.ArrayList;
-import java.util.HashMap;
-import java.util.List;
-import java.util.Map;
-import java.util.Vector;
+import java.util.*;
{code}
We shouldn't do this which could load unnecessary classes.

{code}
+   * The aggregator needs to get the context information including user, flow
{code}
aggregator = collector

 [Data Model] Make putEntities operation be aware of the app's context
 -

 Key: YARN-3040
 URL: https://issues.apache.org/jira/browse/YARN-3040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3040.1.patch, YARN-3040.2.patch


 Per design in YARN-2928, implement client-side API for handling *flows*. 
 Frameworks should be able to define and pass in all attributes of flows and 
 flow runs to YARN, and they should be passed into ATS writers.
 YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3390) RMTimelineCollector should have the context info of each app


[ 
https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376684#comment-14376684
 ] 

Zhijie Shen commented on YARN-3390:
---

It shouldn't. Storage layer implementations only depends on the writer 
interface, which is covered in YARN-3040.

 RMTimelineCollector should have the context info of each app
 

 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 RMTimelineCollector should have the context info of each app whose entity  
 has been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376765#comment-14376765
 ] 

Yongjun Zhang commented on YARN-3021:
-

Hi [~jianhe],

Thanks a lot for the clarification, I did a new rev (06) to address your latest 
comment, and also tested it against real clusters. Would you please take a  
further look? Thanks.





 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376150#comment-14376150
 ] 

Sangjin Lee commented on YARN-3034:
---

LGTM. Let's wait to hear from Zhijie.

 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle


[ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376109#comment-14376109
 ] 

Zhijie Shen commented on YARN-3047:
---

bq. We can probably move it to yarn-api.

I prefer to keeping it in the server module, unless it's supposed to be public 
to users.

bq. This has to be discussed though as Zhijie Shen thinks we can use the same 
v1 config.

My opinion is that collector should bind on a random port, which will be 
reported to timeline client. Reader on single daemon should start on a 
configured port, and users know it form the config.

bq. TimelineReaderWebServer

If you'd like to keep reader, I'm fine with it, but let's still say 
TimelineReaderServer. Meanwhile, TimelineReaderWebService - 
TimelineReaderWebService*s*.

 [Data Serving] Set up ATS reader with basic request serving structure and 
 lifecycle
 ---

 Key: YARN-3047
 URL: https://issues.apache.org/jira/browse/YARN-3047
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Varun Saxena
 Attachments: YARN-3047.001.patch, YARN-3047.003.patch, 
 YARN-3047.02.patch


 Per design in YARN-2938, set up the ATS reader as a service and implement the 
 basic structure as a service. It includes lifecycle management, request 
 serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI


[ 
https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376444#comment-14376444
 ] 

Hadoop QA commented on YARN-3225:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706513/YARN-3225-1.patch
  against trunk revision 7e6f384.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRM

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7077//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7077//console

This message is automatically generated.

 New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
 ---

 Key: YARN-3225
 URL: https://issues.apache.org/jira/browse/YARN-3225
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Devaraj K
 Attachments: YARN-3225-1.patch, YARN-3225.patch, YARN-914.patch


 New CLI (or existing CLI with parameters) should put each node on 
 decommission list to decommissioning status and track timeout to terminate 
 the nodes that haven't get finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context

[
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376169#comment-14376169
]

Zhijie Shen commented on YARN-3040:
---

bq. It sounds not quite scalable if we have one client for each app in the
RM...

In RM/NM, I think we can and we should implement a wrapper layer, which may
contain multiple applications, to have delegator to write the data for
multiple applications.

bq. One most significant advantage to have run ids as integers is we can easily
sort all existing runs for one flow in ascending or descending order. This
might be a solid use case in general?

I can see the benefit. For example, if it represents the timestamp, we can
filter the flow runs and say give me the runs in the last 5 mins. But my
concern is whether it's the general way to let user to describe a run.

bq. Hmm, I didn't think the version as part of the flow id.

I can understand this particular case described above. Like my prior comment
about flow run ID, my concern is whether flow/version/run's explicit hierarchy
is so general to capture most use cases. IMHO, by nature, the hierarchy is the
tree of flows, and a flow can be the flow of flows or the flow of apps.
However, if other users just want to use one level of flow, version/run info
seems to be redundant. On the other side, if use the flow recursion structure,
it's elastic to have flow levels from one to many. We can treat the first level
as the flow, the second as version and third and run. I don't have expertise
knowledge about workflow such as Oozie, but just want to think out my concern
loudly. That said, if flow/version/run is the general description of a flow, I
agree we should pass in these three env vars together and separately.

bq. Mostly fine, but I have some concerns about rolling upgrades.
bq. I'm still not sure why it would make sense to have different logical
cluster id's every time the RM/cluster restarts.

I meant the admin can configure a cluster ID explicitly, which won't be
appended with the timestamp. I added it for the default value to distinguish
the clusters that are started by you and me, but I think about it again, and it
seems that RM restarting problem makes sense. I'll change the default not to
append timestamp.

[Data Model] Make putEntities operation be aware of the app's context
-

Key: YARN-3040
URL: https://issues.apache.org/jira/browse/YARN-3040
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
Attachments: YARN-3040.1.patch, YARN-3040.2.patch

Per design in YARN-2928, implement client-side API for handling *flows*.
Frameworks should be able to define and pass in all attributes of flows and
flow runs to YARN, and they should be passed into ATS writers.
YARN tags were discussed as a way to handle this piece of information.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3389) Two attempts might operate on same data structures concurrently


[ 
https://issues.apache.org/jira/browse/YARN-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376180#comment-14376180
 ] 

Hadoop QA commented on YARN-3389:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706549/YARN-3389.01.patch
  against trunk revision 0b9f12c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens
  org.apache.hadoop.yarn.server.resourcemanager.TestRM

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7075//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7075//console

This message is automatically generated.

 Two attempts might operate on same data structures concurrently
 ---

 Key: YARN-3389
 URL: https://issues.apache.org/jira/browse/YARN-3389
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3389.01.patch


 In AttemptFailedTransition, the new attempt will get 
 state('justFinishedContainers' and 'finishedContainersSentToAM') reference 
 from the failed attempt. Then the two attempts might operate on these two 
 variables concurrently, e.g. they might update 'justFinishedContainers' 
 concurrently when they are both handling CONTAINER_FINISHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376200#comment-14376200
 ] 

Junping Du commented on YARN-3034:
--

Thanks [~Naganarasimha] for updating the patch!
bq.  Also, we should add a warning message log if user put something illegal 
here or it just get silent without any warn. This i feel is not required as we 
don't do this for any other configuration and also we have clearly captured the 
possible values in the yarn-default.xml.
Most configurations get loaded as boolean value or int number. Some String 
configuration is for loading class, so ClassNotFound will get throw immediately 
if name is wrong. Here it belongs a different case, so I still suggest to add 
some check and warning here.

For context info, [~zjshen], can we put that work on your patch in YARN-3040? 
Or you suggest something else? 


 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376415#comment-14376415
 ] 

Naganarasimha G R commented on YARN-3034:
-

Also [~zjshen], earlier thought process  to expose RMTimelineCollector to RM 
and it's context, was to gradually replace SystemMetricsPublisher with 
RMTimelineCollector, as i felt once we deprecate  completely remove ATSV1, we 
might not require much of the functionality of SystemMetricsPublisher and it 
will be just delegating the calls to RMTimelineCollector. your thoughts ?

 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3241) FairScheduler handles invalid queue names inconsistently

[
https://issues.apache.org/jira/browse/YARN-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karthik Kambatla updated YARN-3241:
---
Summary: FairScheduler handles invalid queue names inconsistently (was:
Leading space, trailing space and empty sub queue name may cause
MetricsException for fair scheduler)

FairScheduler handles invalid queue names inconsistently
--

Key: YARN-3241
URL: https://issues.apache.org/jira/browse/YARN-3241
Project: Hadoop YARN
Issue Type: Bug
Components: fairscheduler
Reporter: zhihai xu
Assignee: zhihai xu
Attachments: YARN-3241.000.patch, YARN-3241.001.patch,
YARN-3241.002.patch

Leading space, trailing space and empty sub queue name may cause
MetricsException(Metrics source XXX already exists! ) when add application to
FairScheduler.
The reason is because QueueMetrics parse the queue name different from the
QueueManager.
QueueMetrics use Q_SPLITTER to parse queue name, it will remove Leading space
and trailing space in the sub queue name, It will also remove empty sub queue
name.
{code}
static final Splitter Q_SPLITTER =
Splitter.on('.').omitEmptyStrings().trimResults();
{code}
But QueueManager won't remove Leading space, trailing space and empty sub
queue name.
This will cause out of sync between FSQueue and FSQueueMetrics.
QueueManager will think two queue names are different so it will try to
create a new queue.
But FSQueueMetrics will treat these two queue names as same queue which will
create Metrics source XXX already exists! MetricsException.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376498#comment-14376498
 ] 

Naganarasimha G R commented on YARN-3034:
-

Thanks for the comments [~zjshen],
bq.  and in this approach, I don't think we should couple RMTimelineCollector 
and SystemMetricsPublisher. Keeping SystemMetricsPublisher separate, we can 
easily deprecate and even remove it from the code base later. 
May be i am missing something here , If RM or RM context is not aware then only 
way RMTimelineCollector can be invoked is through SystemMetricsPublisher's 
(SMP) public methods like appCreated, appFinished,appAttemptRegistered or 
RMTimelineCollector can have its own event handler and during initialization  
SMP can select the event handler present in its class or of 
RMTimelineCollector. But still there will be dependency of event source calling 
public methods of SMP . So i feel it will not be smoother to deprecate and 
remove SystemMetricsPublisher's as it will have code for creation of 
RMTimelineCollector, sending events to RMTimelineCollector to publish to ATS V2.

bq. Moreover, we can keep the existing config as what it is now, and create a 
new config to control starting v2 RM writing data stack.
IMHO i feel the current config is better because in ATS V2 container events are 
planned to be moved to NM side (YARN-3045), So in NM side too we req a 
configuration but we cannot use the existing one 
{{yarn.resourcemanager.system-metrics-publisher.enabled}}  as it indicates 
more like RM side configuration only
Approach specified in the patch uses a single config for both NM and RM 

 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3024) LocalizerRunner should give DIE action when all resources are localized


[ 
https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376556#comment-14376556
 ] 

Karthik Kambatla commented on YARN-3024:


[~chengbing.liu] - thanks for the clarifications. Makes sense.

For the TODOs, it would be nice to have follow-up JIRAs. If it is not too much 
trouble, can you create them so interested contributors could follow up? 

 LocalizerRunner should give DIE action when all resources are localized
 ---

 Key: YARN-3024
 URL: https://issues.apache.org/jira/browse/YARN-3024
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.7.0

 Attachments: YARN-3024.01.patch, YARN-3024.02.patch, 
 YARN-3024.03.patch, YARN-3024.04.patch


 We have observed that {{LocalizerRunner}} always gives a LIVE action at the 
 end of localization process.
 The problem is {{findNextResource()}} can return null even when {{pending}} 
 was not empty prior to the call. This method removes localized resources from 
 {{pending}}, therefore we should check the return value, and gives DIE action 
 when it returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3362) Add node label usage in RM CapacityScheduler web UI


[ 
https://issues.apache.org/jira/browse/YARN-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376563#comment-14376563
 ] 

Naganarasimha G R commented on YARN-3362:
-

Thanks for the feedback [~leftnoteasy], 
bq. different labels under same queue can have different 
user-limit/capacity/maximum-capacity/max-am-resource, etc.
If this is the case then the approach which you specified makes sense but by 
can you mean currently its not there and in future it can come in ? 

More than repeated info, other drawback i can see is suppose for particular 
label userlimit is not reached but as overall at queue level if the user has 
reached his limit it will be difficult for user to go through all labels and 
find out whether user has reached queue limit . Correct me if my understanding 
on this is wrong .


 Add node label usage in RM CapacityScheduler web UI
 ---

 Key: YARN-3362
 URL: https://issues.apache.org/jira/browse/YARN-3362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager, webapp
Reporter: Wangda Tan
Assignee: Naganarasimha G R

 We don't have node label usage in RM CapacityScheduler web UI now, without 
 this, user will be hard to understand what happened to nodes have labels 
 assign to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector

[
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376446#comment-14376446
]

Zhijie Shen commented on YARN-3034:
---

bq. so i think its not an incompatible change. Please provide your opinion on
the same.

Sorry, I missed that piece.

bq. IIUC SystemMetricsPublisher.publish*Event methods can determine which
version of ATS to publish and can post it accordingly ?

I meant in the current approach SystemMetricsPublisher can be self contained.
RMTimelineCollector can be a private stuff in SystemMetricsPublisher,
constructed and started there. It's not necessary to be visible in RM and its
context objects.

bq. we might not require much of the functionality of SystemMetricsPublisher
and it will be just delegating the calls to RMTimelineCollector.

I'm not sure about if there's previous discussion about the way for RM to put
entities, but this approach sound cleaner, and in this approach, I don't think
we should couple RMTimelineCollector and SystemMetricsPublisher. Keeping
SystemMetricsPublisher separate, we can easily deprecate and even remove it
from the code base later. Moreover, we can keep the existing config as what it
is now, and create a new config to control starting v2 RM writing data stack.

[Collector wireup] Implement RM starting its timeline collector
---

Key: YARN-3034
URL: https://issues.apache.org/jira/browse/YARN-3034
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch,
YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch,
YARN-3034.20150320-1.patch

Per design in YARN-2928, implement resource managers starting their own ATS
writers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer


[ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376447#comment-14376447
 ] 

zhihai xu commented on YARN-3336:
-

Thanks [~cnauroth] for valuable feedback and committing the patch! Greatly 
appreciated.

 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.7.0

 Attachments: YARN-3336.000.patch, YARN-3336.001.patch, 
 YARN-3336.002.patch, YARN-3336.003.patch, YARN-3336.004.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3241) Leading space, trailing space and empty sub queue name may cause MetricsException for fair scheduler

[
https://issues.apache.org/jira/browse/YARN-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376559#comment-14376559
]

Karthik Kambatla commented on YARN-3241:

+1. Checking this in.

Leading space, trailing space and empty sub queue name may cause
MetricsException for fair scheduler

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.


[ 
https://issues.apache.org/jira/browse/YARN-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375465#comment-14375465
 ] 

zhihai xu commented on YARN-3363:
-

I uploaded a new patch YARN-3363.000.patch for review.

 add localization and container launch time to ContainerMetrics at NM to show 
 these timing information for each active container.
 

 Key: YARN-3363
 URL: https://issues.apache.org/jira/browse/YARN-3363
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
  Labels: metrics, supportability
 Attachments: YARN-3363.000.patch


 add localization and container launch time to ContainerMetrics at NM to show 
 these timing information for each active container.
 Currently ContainerMetrics has container's actual memory usage(YARN-2984),  
 actual CPU usage(YARN-3122), resource  and pid(YARN-3022). It will be better 
 to have localization and container launch time in ContainerMetrics for each 
 active container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-03-23 Thread sandflee (JIRA)

sandflee created YARN-3387:
--

 Summary: container complete message couldn't pass to am if am 
restarted and rm changed
 Key: YARN-3387
 URL: https://issues.apache.org/jira/browse/YARN-3387
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: sandflee


suppose am work preserving and rm ha is enabled.
container complete message is passed to appattemt.justFinishedContainers in 
rm。in normal situation，all attempt in one app shares the same 
justFinishedContainers, but when rm changed, every attempt has it's own 
justFinishedContainers, so in situations below, container complete message 
couldn't passed to am:
1, am restart
2, rm changes
3, container launched by first am completes
container complete message will be passed to appAttempt1 not appAttempt2, but 
am pull finished containers from appAttempt2 (currentAppAttempt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.


[ 
https://issues.apache.org/jira/browse/YARN-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375493#comment-14375493
 ] 

Hadoop QA commented on YARN-3363:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706461/YARN-3363.000.patch
  against trunk revision 0b9f12c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7070//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7070//console

This message is automatically generated.

 add localization and container launch time to ContainerMetrics at NM to show 
 these timing information for each active container.
 

 Key: YARN-3363
 URL: https://issues.apache.org/jira/browse/YARN-3363
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
  Labels: metrics, supportability
 Attachments: YARN-3363.000.patch


 add localization and container launch time to ContainerMetrics at NM to show 
 these timing information for each active container.
 Currently ContainerMetrics has container's actual memory usage(YARN-2984),  
 actual CPU usage(YARN-3122), resource  and pid(YARN-3022). It will be better 
 to have localization and container launch time in ContainerMetrics for each 
 active container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context


[ 
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376749#comment-14376749
 ] 

Sangjin Lee commented on YARN-3040:
---

Thanks [~zjshen] for the updated patch!

I am comfortable with continuing to work on the flow-related items in the 
separate JIRA. I'll jot down the key points in that JIRA shortly.

I went over the latest patch, and overall it looks good. I do have a few 
comments:

(AppLevelTimelineCollector.java)
{code}
50protected void serviceInit(Configuration conf) throws Exception {
51  context.setClusterId(conf.get(YarnConfiguration.RM_CLUSTER_ID,
52  YarnConfiguration.DEFAULT_RM_CLUSTER_ID));
53  
context.setUserId(UserGroupInformation.getCurrentUser().getShortUserName());
54  
context.setFlowId(TimelineUtils.generateDefaultFlowIdBasedOnAppId(appId));
55  context.setFlowRunId(0);
56  context.setAppId(appId.toString());
{code}

I'm not sure of these set calls. Are these here just to initialize the context 
to default values? For example, UGI.getCurrentUser().getShortUserName() would 
return the user under which the daemon was started (whether it is NM or a 
standalone daemon) in case of a per-node daemon, which is highly likely to be 
incorrect. Do we need to bother setting default values if they are going to be 
incorrect anyway, for example, for user?

At minimum, it would be helpful to have a comment here why this is being done.

(AMLauncher.java)
- Do we need to be case-insensitive here? I think we can be strict about the 
tag names?
- You might want to be bit defensive about the tag not carrying any value (e.g. 
TIMELINE_FLOW_ID_TAG:). If the value is empty, tag.substring() would throw an 
IndexOutOfBoundsException.

 [Data Model] Make putEntities operation be aware of the app's context
 -

 Key: YARN-3040
 URL: https://issues.apache.org/jira/browse/YARN-3040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3040.1.patch, YARN-3040.2.patch, YARN-3040.3.patch


 Per design in YARN-2928, implement client-side API for handling *flows*. 
 Frameworks should be able to define and pass in all attributes of flows and 
 flow runs to YARN, and they should be passed into ATS writers.
 YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector

[
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376575#comment-14376575
]

Zhijie Shen commented on YARN-3034:
---

bq. then only way RMTimelineCollector can be invoked is through
SystemMetricsPublisher's (SMP) public methods

Oh, probably I misunderstood your intention. I used to think this is way that
you want to do put the data into RMTimelineCollector. So in this case, we could
put RMTimelineCollector inside SystemMetricsPublisher, and whereas we invoke
timeline client, we call RMTimelineCollector for v2.

According to this comments, it seems that you want to create a separate stack
to put entities into RMTimelineCollector, right? If so, the current design
makes sense.

bq. So in NM side too we req a configuration but we cannot use the existing one

I meant we keep {{yarn.resourcemanager.system-metrics-publisher.enabled}} for
v1 SystemMetricsPublisher. For v2, both RM and NM reads
{{yarn.system-metrics-publisher.enabled}}? No need to have v1/v2 flag?

[Collector wireup] Implement RM starting its timeline collector
---

Per design in YARN-2928, implement resource managers starting their own ATS
writers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3241) FairScheduler handles invalid queue names inconsistently


[ 
https://issues.apache.org/jira/browse/YARN-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376577#comment-14376577
 ] 

Hudson commented on YARN-3241:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7406 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7406/])
YARN-3241. FairScheduler handles invalid queue names inconsistently. (Zhihai Xu 
via kasha) (kasha: rev 2bc097cd14692e6ceb06bff959f28531534eb307)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationFileLoaderService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestAllocationFileLoaderService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/InvalidQueueNameException.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestQueueManager.java


 FairScheduler handles invalid queue names inconsistently
 --

 Key: YARN-3241
 URL: https://issues.apache.org/jira/browse/YARN-3241
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-3241.000.patch, YARN-3241.001.patch, 
 YARN-3241.002.patch


 Leading space, trailing space and empty sub queue name may cause 
 MetricsException(Metrics source XXX already exists! ) when add application to 
 FairScheduler.
 The reason is because QueueMetrics parse the queue name different from the 
 QueueManager.
 QueueMetrics use Q_SPLITTER to parse queue name, it will remove Leading space 
 and trailing space in the sub queue name, It will also remove empty sub queue 
 name.
 {code}
   static final Splitter Q_SPLITTER =
   Splitter.on('.').omitEmptyStrings().trimResults(); 
 {code}
 But QueueManager won't remove Leading space, trailing space and empty sub 
 queue name.
 This will cause out of sync between FSQueue and FSQueueMetrics.
 QueueManager will think two queue names are different so it will try to 
 create a new queue.
 But FSQueueMetrics will treat these two queue names as same queue which will 
 create Metrics source XXX already exists! MetricsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3390) RMTimelineCollector should have the context info of each app

2015-03-23 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376670#comment-14376670
 ] 

Li Lu commented on YARN-3390:
-

Hi [~zjshen], could you please confirm that this JIRA will also block all 
storage layer implementations? Or we can proceed after YARN-3040 is in? Thanks! 

 RMTimelineCollector should have the context info of each app
 

 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 RMTimelineCollector should have the context info of each app whose entity  
 has been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2868) FairScheduler: Metric for latency to allocate first container for an application


[ 
https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376672#comment-14376672
 ] 

Hudson commented on YARN-2868:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7407 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7407/])
YARN-2868. FairScheduler: Metric for latency to allocate first container for an 
application. (Ray Chiang via kasha) (kasha: rev 
972f1f1ab94a26ec446a272ad030fe13f03ed442)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java
* hadoop-yarn-project/CHANGES.txt


 FairScheduler: Metric for latency to allocate first container for an 
 application
 

 Key: YARN-2868
 URL: https://issues.apache.org/jira/browse/YARN-2868
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: metrics, supportability
 Fix For: 2.8.0

 Attachments: YARN-2868-01.patch, YARN-2868.002.patch, 
 YARN-2868.003.patch, YARN-2868.004.patch, YARN-2868.005.patch, 
 YARN-2868.006.patch, YARN-2868.007.patch, YARN-2868.008.patch, 
 YARN-2868.009.patch, YARN-2868.010.patch, YARN-2868.011.patch, 
 YARN-2868.012.patch


 Add a metric to measure the latency between starting container allocation 
 and first container actually allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3392) Change NodeManager metrics to not populate resource usage metrics if they are unavailable

2015-03-23 Thread Anubhav Dhoot (JIRA)

Anubhav Dhoot created YARN-3392:
---

 Summary: Change NodeManager metrics to not populate resource usage 
metrics if they are unavailable 
 Key: YARN-3392
 URL: https://issues.apache.org/jira/browse/YARN-3392
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3393) Getting application(s) goes wrong when app finishes before starting the attempt

Zhijie Shen created YARN-3393:
-

 Summary: Getting application(s) goes wrong when app finishes 
before starting the attempt
 Key: YARN-3393
 URL: https://issues.apache.org/jira/browse/YARN-3393
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical


When generating app report in ApplicationHistoryManagerOnTimelineStore, it 
checks if appAttempt == null.
{code}
ApplicationAttemptReport appAttempt = 
getApplicationAttempt(app.appReport.getCurrentApplicationAttemptId());
if (appAttempt != null) {
  app.appReport.setHost(appAttempt.getHost());
  app.appReport.setRpcPort(appAttempt.getRpcPort());
  app.appReport.setTrackingUrl(appAttempt.getTrackingUrl());
  
app.appReport.setOriginalTrackingUrl(appAttempt.getOriginalTrackingUrl());
}
{code}

However, {{getApplicationAttempt}} doesn't return null but throws 
ApplicationAttemptNotFoundException:
{code}
if (entity == null) {
  throw new ApplicationAttemptNotFoundException(
  The entity for application attempt  + appAttemptId +
   doesn't exist in the timeline store);
} else {
  return convertToApplicationAttemptReport(entity);
}
{code}
They code isn't coupled well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)

2015-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376605#comment-14376605
 ] 

Wangda Tan commented on YARN-2495:
--

Hmm.. {{StringArrayProto.stringElement - elements}} is still not changed in 
latest patch, could you take a look again?
I meant to remove the string prefix, since the StringArrayProto already 
indicates that. Beyond that, patch LGTM.

 Allow admin specify labels from each NM (Distributed configuration)
 ---

 Key: YARN-2495
 URL: https://issues.apache.org/jira/browse/YARN-2495
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
 Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, 
 YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, 
 YARN-2495.20141119-1.patch, YARN-2495.20141126-1.patch, 
 YARN-2495.20141204-1.patch, YARN-2495.20141208-1.patch, 
 YARN-2495.20150305-1.patch, YARN-2495.20150309-1.patch, 
 YARN-2495.20150318-1.patch, YARN-2495.20150320-1.patch, 
 YARN-2495.20150321-1.patch, YARN-2495_20141022.1.patch


 Target of this JIRA is to allow admin specify labels in each NM, this covers
 - User can set labels in each NM (by setting yarn-site.xml (YARN-2923) or 
 using script suggested by [~aw] (YARN-2729) )
 - NM will send labels to RM via ResourceTracker API
 - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3383) AdminService should use warn instead of info to log exception when operation fails


[ 
https://issues.apache.org/jira/browse/YARN-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376629#comment-14376629
 ] 

Hadoop QA commented on YARN-3383:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12706096/YARN-3383-032015.patch
  against trunk revision 2bc097c.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7079//console

This message is automatically generated.

 AdminService should use warn instead of info to log exception when 
 operation fails
 --

 Key: YARN-3383
 URL: https://issues.apache.org/jira/browse/YARN-3383
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Li Lu
 Attachments: YARN-3383-032015.patch


 Now it uses info:
 {code}
   private YarnException logAndWrapException(IOException ioe, String user,
   String argName, String msg) throws YarnException {
 LOG.info(Exception  + msg, ioe);
 {code}
 But it should use warn instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2868) FairScheduler: Metric for latency to allocate first container for an application


 [ 
https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2868:
---
Summary: FairScheduler: Metric for latency to allocate first container for 
an application  (was: Add metric for initial container launch time to 
FairScheduler)

 FairScheduler: Metric for latency to allocate first container for an 
 application
 

 Key: YARN-2868
 URL: https://issues.apache.org/jira/browse/YARN-2868
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: metrics, supportability
 Attachments: YARN-2868-01.patch, YARN-2868.002.patch, 
 YARN-2868.003.patch, YARN-2868.004.patch, YARN-2868.005.patch, 
 YARN-2868.006.patch, YARN-2868.007.patch, YARN-2868.008.patch, 
 YARN-2868.009.patch, YARN-2868.010.patch, YARN-2868.011.patch, 
 YARN-2868.012.patch


 Add a metric to measure the latency between starting container allocation 
 and first container actually allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters

2015-03-23 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376695#comment-14376695
 ] 

Anubhav Dhoot commented on YARN-3304:
-

Hi [~djp] [~vinodkv],

If we use a default of zero we cannot distinguish when its unavailable versus 
zero usage.  
That will make the future track the improvement to handle unavailable case 
later near impossible to do.
I propose we make all the defaults consistently -1. 
I can fix the metrics as well to use this to implement tracking unavailable 
case. Opened YARN-3392 for that




 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-3304.patch


 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage


 [ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3391:
--
Description: 
To continue the discussion in YARN-3040, let's figure out the best way to 
describe the flow.

Some key issues that we need to conclude on:
- How do we include the flow version in the context so that it gets passed into 
the collector and to the storage eventually?
- Flow run id should be a number as opposed to a generic string?
- Default behavior for the flow run id if it is missing (i.e. client did not 
set it)


  was:To continue the discussion in YARN-3040, let's figure out the best way to 
describe the flow.


 Clearly define flow ID/ flow run / flow version in API and storage
 --

 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 To continue the discussion in YARN-3040, let's figure out the best way to 
 describe the flow.
 Some key issues that we need to conclude on:
 - How do we include the flow version in the context so that it gets passed 
 into the collector and to the storage eventually?
 - Flow run id should be a number as opposed to a generic string?
 - Default behavior for the flow run id if it is missing (i.e. client did not 
 set it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


 [ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated YARN-3021:

Attachment: YARN-3021.006.patch

 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2868) Add metric for initial container launch time to FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376633#comment-14376633
 ] 

Karthik Kambatla commented on YARN-2868:


+1, checking this in. 

 Add metric for initial container launch time to FairScheduler
 -

 Key: YARN-2868
 URL: https://issues.apache.org/jira/browse/YARN-2868
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: metrics, supportability
 Attachments: YARN-2868-01.patch, YARN-2868.002.patch, 
 YARN-2868.003.patch, YARN-2868.004.patch, YARN-2868.005.patch, 
 YARN-2868.006.patch, YARN-2868.007.patch, YARN-2868.008.patch, 
 YARN-2868.009.patch, YARN-2868.010.patch, YARN-2868.011.patch, 
 YARN-2868.012.patch


 Add a metric to measure the latency between starting container allocation 
 and first container actually allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle

2015-03-23 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376719#comment-14376719
 ] 

Li Lu commented on YARN-3047:
-

Hi [~varun_saxena], thanks for the new patch. Could you please elaborate more 
about which exact comment will be addressed in YARN-3051? Thanks! BTW, in 003 
patch I can still see TimelineEvents.java. Do we still need that? 

 [Data Serving] Set up ATS reader with basic request serving structure and 
 lifecycle
 ---

 Key: YARN-3047
 URL: https://issues.apache.org/jira/browse/YARN-3047
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Varun Saxena
 Attachments: YARN-3047.001.patch, YARN-3047.003.patch, 
 YARN-3047.02.patch


 Per design in YARN-2938, set up the ATS reader as a service and implement the 
 basic structure as a service. It includes lifecycle management, request 
 serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3241) FairScheduler handles invalid queue names inconsistently

[
https://issues.apache.org/jira/browse/YARN-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376737#comment-14376737
]

zhihai xu commented on YARN-3241:
-

Thanks [~kasha] for valuable feedback and committing the patch!

FairScheduler handles invalid queue names inconsistently
--

Key: YARN-3241
URL: https://issues.apache.org/jira/browse/YARN-3241
Project: Hadoop YARN
Issue Type: Bug
Components: fairscheduler
Reporter: zhihai xu
Assignee: zhihai xu
Fix For: 2.8.0

Attachments: YARN-3241.000.patch, YARN-3241.001.patch,
YARN-3241.002.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed


 [ 
https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3387:
---
Priority: Critical  (was: Major)
Target Version/s: 2.7.0

 container complete message couldn't pass to am if am restarted and rm changed
 -

 Key: YARN-3387
 URL: https://issues.apache.org/jira/browse/YARN-3387
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: sandflee
Priority: Critical

 suppose am work preserving and rm ha is enabled.
 container complete message is passed to appattemt.justFinishedContainers in 
 rm。in normal situation，all attempt in one app shares the same 
 justFinishedContainers, but when rm changed, every attempt has it's own 
 justFinishedContainers, so in situations below, container complete message 
 couldn't passed to am:
 1, am restart
 2, rm changes
 3, container launched by first am completes
 container complete message will be passed to appAttempt1 not appAttempt2, but 
 am pull finished containers from appAttempt2 (currentAppAttempt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed


[ 
https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376640#comment-14376640
 ] 

Karthik Kambatla commented on YARN-3387:


Does this imply our work-preserving AM restart is broken on a RM failover? 

 container complete message couldn't pass to am if am restarted and rm changed
 -

 Key: YARN-3387
 URL: https://issues.apache.org/jira/browse/YARN-3387
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: sandflee
Priority: Critical

 suppose am work preserving and rm ha is enabled.
 container complete message is passed to appattemt.justFinishedContainers in 
 rm。in normal situation，all attempt in one app shares the same 
 justFinishedContainers, but when rm changed, every attempt has it's own 
 justFinishedContainers, so in situations below, container complete message 
 couldn't passed to am:
 1, am restart
 2, rm changes
 3, container launched by first am completes
 container complete message will be passed to appAttempt1 not appAttempt2, but 
 am pull finished containers from appAttempt2 (currentAppAttempt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context


 [ 
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3040:
--
Attachment: YARN-3040.3.patch

Upload a new patch to address the comments so far. The notable change in this 
patch is to remove the timestamp suffix. And add the default for RM_CLUSTER_ID, 
such that the ID won't change across RM restarting or failover.

 [Data Model] Make putEntities operation be aware of the app's context
 -

 Key: YARN-3040
 URL: https://issues.apache.org/jira/browse/YARN-3040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3040.1.patch, YARN-3040.2.patch, YARN-3040.3.patch


 Per design in YARN-2928, implement client-side API for handling *flows*. 
 Frameworks should be able to define and pass in all attributes of flows and 
 flow runs to YARN, and they should be passed into ATS writers.
 YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3386) Cgroups feature should work with default hierarchy settings of CentOS 7


[ 
https://issues.apache.org/jira/browse/YARN-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376645#comment-14376645
 ] 

Karthik Kambatla commented on YARN-3386:


YARN-2194 seems to imply there are more changes required for cgroups to work 
with RHEL/Centos 7? Should this marked a duplicate of the other? 

 Cgroups feature should work with default hierarchy settings of CentOS 7
 ---

 Key: YARN-3386
 URL: https://issues.apache.org/jira/browse/YARN-3386
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki

 The path found by CgroupsLCEResourcesHandler#parseMtab contains comma and 
 results in failure of container-executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3383) AdminService should use warn instead of info to log exception when operation fails

2015-03-23 Thread Li Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-3383:

Attachment: YARN-3383-032315.patch

Rebase the patch with the latest trunk. 

 AdminService should use warn instead of info to log exception when 
 operation fails
 --

 Key: YARN-3383
 URL: https://issues.apache.org/jira/browse/YARN-3383
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Li Lu
 Attachments: YARN-3383-032015.patch, YARN-3383-032315.patch


 Now it uses info:
 {code}
   private YarnException logAndWrapException(IOException ioe, String user,
   String argName, String msg) throws YarnException {
 LOG.info(Exception  + msg, ioe);
 {code}
 But it should use warn instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly

2015-03-23 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-2605:
---

Assignee: Anubhav Dhoot

 [RM HA] Rest api endpoints doing redirect incorrectly
 -

 Key: YARN-2605
 URL: https://issues.apache.org/jira/browse/YARN-2605
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: bc Wong
Assignee: Anubhav Dhoot
  Labels: newbie

 The standby RM's webui tries to do a redirect via meta-refresh. That is fine 
 for pages designed to be viewed by web browsers. But the API endpoints 
 shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd 
 suggest HTTP 303, or return a well-defined error message (json or xml) 
 stating that the standby status and a link to the active RM.
 The standby RM is returning this today:
 {noformat}
 $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics
 HTTP/1.1 200 OK
 Cache-Control: no-cache
 Expires: Thu, 25 Sep 2014 18:34:53 GMT
 Date: Thu, 25 Sep 2014 18:34:53 GMT
 Pragma: no-cache
 Expires: Thu, 25 Sep 2014 18:34:53 GMT
 Date: Thu, 25 Sep 2014 18:34:53 GMT
 Pragma: no-cache
 Content-Type: text/plain; charset=UTF-8
 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
 Content-Length: 117
 Server: Jetty(6.1.26)
 This is standby RM. Redirecting to the current active RM: 
 http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage


 [ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3391:
--
Description: 
To continue the discussion in YARN-3040, let's figure out the best way to 
describe the flow.

Some key issues that we need to conclude on:
- How do we include the flow version in the context so that it gets passed into 
the collector and to the storage eventually?
- Flow run id should be a number as opposed to a generic string?
- Default behavior for the flow run id if it is missing (i.e. client did not 
set it)
- How do we handle flow attributes in case of nested levels of flows?


  was:
To continue the discussion in YARN-3040, let's figure out the best way to 
describe the flow.

Some key issues that we need to conclude on:
- How do we include the flow version in the context so that it gets passed into 
the collector and to the storage eventually?
- Flow run id should be a number as opposed to a generic string?
- Default behavior for the flow run id if it is missing (i.e. client did not 
set it)



 Clearly define flow ID/ flow run / flow version in API and storage
 --

 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 To continue the discussion in YARN-3040, let's figure out the best way to 
 describe the flow.
 Some key issues that we need to conclude on:
 - How do we include the flow version in the context so that it gets passed 
 into the collector and to the storage eventually?
 - Flow run id should be a number as opposed to a generic string?
 - Default behavior for the flow run id if it is missing (i.e. client did not 
 set it)
 - How do we handle flow attributes in case of nested levels of flows?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376229#comment-14376229
 ] 

Naganarasimha G R commented on YARN-3034:
-

Thanks [~sjlee0]  [~djp] for the reviews, 
{{ so I still suggest to add some check and warning here.}} : well currently 
i have logged a warning message as {{RMTimelineCollector has not been 
configured to publish System Metrics in ATS V2}}  if it is not configured to 
publish system metrics for ATS v2. will that suffice ?
bq. Zhijie Shen, can we put that work on your patch in YARN-3040? Or you 
suggest something else?
We can do it in 2 ways , 
* as Zhijie suggested 
[earlier|https://issues.apache.org/jira/browse/YARN-3034?focusedCommentId=14372342page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14372342],
 we can handle it in a separate jira 
* can handle as part of YARN-3044 (which i am working on )

I would prefer for the former one as it would be simpler to review. Please 
provide your opinion 
 

 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-03-23 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3136:
--
Attachment: 0008-YARN-3136.patch

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 
 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector

[
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376228#comment-14376228
]

Zhijie Shen commented on YARN-3034:
---

Let me elaborate my previous comments. In YARN-3040, I'm working on the issue
to make the context info available in app-level collector, such that when we
use timeline client to put entity inside AM and NM, the entity will be
automatically associated to this context.

This jiar is to create RM collector. To achieve the similar thing, RM collector
should have the context info available too. RM has all this information
available (should be inside RMApp), such that RM collector needs to make sure
this information is available in some when when putting an entity. I'm okay if
you want to exclude this work here, and I'll file a separate jira for it.
However, I want to exclude it from YARN-3040 to prevent the patch there growing
even bigger. That one is required to unblock the framework to write their
specific data, and I wish it could get in asap.

[Collector wireup] Implement RM starting its timeline collector
---

Per design in YARN-2928, implement resource managers starting their own ATS
writers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3390) RMTimelineCollector should have the context info of each app


[ 
https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376323#comment-14376323
 ] 

Naganarasimha G R commented on YARN-3390:
-

Hi [~zjshen], 
Shall i work on this jira ?  as i can utilize the same in YARN-3044 ?

 RMTimelineCollector should have the context info of each app
 

 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 RMTimelineCollector should have the context info of each app whose entity  
 has been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3390) RMTimelineCollector should have the context info of each app

Zhijie Shen created YARN-3390:
-

 Summary: RMTimelineCollector should have the context info of each 
app
 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen


RMTimelineCollector should have the context info of each app whose entity  has 
been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context

[
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376280#comment-14376280
]

Sangjin Lee commented on YARN-3040:
---

{quote}
I can understand this particular case described above. Like my prior comment
about flow run ID, my concern is whether flow/version/run's explicit hierarchy
is so general to capture most use cases. IMHO, by nature, the hierarchy is the
tree of flows, and a flow can be the flow of flows or the flow of apps.
However, if other users just want to use one level of flow, version/run info
seems to be redundant. On the other side, if use the flow recursion structure,
it's elastic to have flow levels from one to many. We can treat the first level
as the flow, the second as version and third and run. I don't have expertise
knowledge about workflow such as Oozie, but just want to think out my concern
loudly. That said, if flow/version/run is the general description of a flow, I
agree we should pass in these three env vars together and separately.
{quote}

Agreed that we need to consider both use cases (single level and multi-level).
I just want to clarify that even with one level of flows, it is possible (and
in fact it is more common) that there are multiple runs for a given flow
version, and multiple version for a given flow name; e.g. foo.pig/v.1/1,
foo.pig/v.1/2, ..., foo.pig/v.2/10, foo.pig/v.2/11, ...

Also, my mental model is that flow id/version/run-id is not a hierarchy. It's
just a group of 3 attributes (although there is some implied contains
relationship).

Also, when we store these 3 attributes in the storage, I suspect schemas like
HBase/phoenix will probably make only the flow id (name) and the flow run id as
part of the primary/row key, and store the flow version in a separate table.

[Data Model] Make putEntities operation be aware of the app's context
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context


[ 
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376400#comment-14376400
 ] 

Zhijie Shen commented on YARN-3040:
---

[~sjlee0], thanks for more comments, but would you mind continuing the flow 
attributes discussion in YARN-3391 to unblock this jira? In this jira, how 
about focusing on the data flow to passing this context info to the collector? 
For flow info, no matter what it should be specifically, this patch works out 
the path to collect it from user via application submission context and pass it 
to RM, NM and finally to the collector. If we're okay with is approach. It is 
easy for us to add new flow info or correct existing flow info later on. I 
filed YARN-3391 to fork the flow related discussion.

 [Data Model] Make putEntities operation be aware of the app's context
 -

 Key: YARN-3040
 URL: https://issues.apache.org/jira/browse/YARN-3040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3040.1.patch, YARN-3040.2.patch


 Per design in YARN-2928, implement client-side API for handling *flows*. 
 Frameworks should be able to define and pass in all attributes of flows and 
 flow runs to YARN, and they should be passed into ATS writers.
 YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer


[ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376300#comment-14376300
 ] 

Hudson commented on YARN-3336:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7405 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7405/])
YARN-3336. FileSystem memory leak in DelegationTokenRenewer. (cnauroth: rev 
6ca1f12024fd7cec7b01df0f039ca59f3f365dc1)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java


 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.7.0

 Attachments: YARN-3336.000.patch, YARN-3336.001.patch, 
 YARN-3336.002.patch, YARN-3336.003.patch, YARN-3336.004.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376349#comment-14376349
 ] 

Naganarasimha G R commented on YARN-3034:
-

Thanks for your comments [~zjshen]
bq. RM_SYSTEM_METRICS_PUBLISHER_ENABLED - SYSTEM_METRICS_PUBLISHER_ENABLED is 
an incompatible change. :
This i incorporated based on [~vinodkv]'s 
[comment|https://issues.apache.org/jira/browse/YARN-3034?focusedCommentId=14360797page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14360797],
 And also i have added old keys as part of {{addDeprecatedKeys}}, so i think 
its not an incompatible change. Please provide your opinion on the same.

bq. RMTimelineCollector doesn't need to be exposed to RM and it's context. It 
seems to be enough to construct it inside SystemMetricsPublisher only.
IIUC  SystemMetricsPublisher.publish*Event methods can determine which version 
of ATS to publish and can post it accordingly ?

 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage

Zhijie Shen created YARN-3391:
-

 Summary: Clearly define flow ID/ flow run / flow version in API 
and storage
 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen


To continue the discussion in YARN-3040, let's figure out the best way to 
describe the flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters


 [ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3304:
-
Attachment: YARN-3304.patch

Deliver a quick patch to fix it, given this is a blocker for release.

 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-3304.patch


 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

[
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376380#comment-14376380
]

Hadoop QA commented on YARN-3136:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12706581/0008-YARN-3136.patch
against trunk revision 36af4a9.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 14 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-tools/hadoop-sls
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/7076//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/7076//artifact/patchprocess/newPatchFindbugsWarningshadoop-sls.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/7076//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7076//console

This message is automatically generated.

getTransferredContainers can be a bottleneck during AM registration
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector


[ 
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376256#comment-14376256
 ] 

Zhijie Shen commented on YARN-3034:
---

Some comments about the patch:

1. RM_SYSTEM_METRICS_PUBLISHER_ENABLED - SYSTEM_METRICS_PUBLISHER_ENABLED is 
an incompatible change.

2. RMTimelineCollector doesn't need to be exposed to RM and it's context. It 
seems to be enough to construct it inside SystemMetricsPublisher only.

bq. I would prefer for the former one as it would be simpler to review. Please 
provide your opinion

I filed a separate Jira: YARN-3390

 [Collector wireup] Implement RM starting its timeline collector
 ---

 Key: YARN-3034
 URL: https://issues.apache.org/jira/browse/YARN-3034
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3034-20150312-1.patch, YARN-3034.20150205-1.patch, 
 YARN-3034.20150316-1.patch, YARN-3034.20150318-1.patch, 
 YARN-3034.20150320-1.patch


 Per design in YARN-2928, implement resource managers starting their own ATS 
 writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context

[
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376255#comment-14376255
]

Sangjin Lee commented on YARN-3040:
---

bq. I can see the benefit. For example, if it represents the timestamp, we can
filter the flow runs and say give me the runs in the last 5 mins. But my
concern is whether it's the general way to let user to describe a run.

The design doc says the flow runs for a given flow must have unique and
totally ordered run identifiers. We obviously had numbers in mind when we had
that (mostly coming from the ease of sorting and ordering in the storage). And
that's the convention we will push frameworks to use. I think it is important
that we make it a number (long). However, there is a difference between having
numbers as run id's and having timestamps as run id's. I don't think we need to
go so far as requiring timestamps as run id's. As long as they are numbers, I
think it would be fine. I can imagine some flows using run id's like 1, 2,
...

We could allow any arbitrary scheme to generate the run id's, but the challenge
is it might seriously hamper the ability to store and sort them efficiently.
And, in most cases, the timestamp of the flow start is a quite natural scheme,
and I would think most frameworks will just adopt that scheme. What do you
think?

On a related note, we should also generate the default run id if it is missing.
I realize this could be bit tricky. If the flow id is also missing, then we're
treating this single YARN app as a flow in and of itself. Then we can do
flow/version/run id = (yarn app name)/(1)/(app submission timestamp). This is
also mentioned in the design doc.

However, if the flow id is provided but not the flow run id, it can be tricky
as there can be multiple YARN apps for the given flow run. One obvious solution
might be to reject app submission if the flow client (not the timeline client)
sets the flow id but not the flow run id. For that we'd need some kind of a
common layer for checks. Thoughts?

[Data Model] Make putEntities operation be aware of the app's context
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context

[
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376285#comment-14376285
]

Sangjin Lee commented on YARN-3040:
---

Also, my mental model is that flow id/version/run-id is not a hierarchy. It's
just a group of 3 attributes (although there is some implied contains
relationship).

[Data Model] Make putEntities operation be aware of the app's context
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS


[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376402#comment-14376402
 ] 

Naganarasimha G R commented on YARN-3044:
-

Hi [~zjshen]  [~sjlee0],
As part of this jira following basic App and AppAttempt life cycle events, i am 
planning to capture in {{RMTimelineCollector}} :
* ApplicationCreated
* ApplicationFinished
* ApplicationACLsUpdated
* AppAttemptRegistered
* AppAttemptFinished

Apart from these any other events you guys have thought about to be captured 
(as i remember, some where Sangjin had mentioned to capture all the life cycle 
events/states) ?




 [Event producers] Implement RM writing app lifecycle events to ATS
 --

 Key: YARN-3044
 URL: https://issues.apache.org/jira/browse/YARN-3044
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R

 Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3383) AdminService should use warn instead of info to log exception when operation fails


[ 
https://issues.apache.org/jira/browse/YARN-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376830#comment-14376830
 ] 

Hadoop QA commented on YARN-3383:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12706717/YARN-3383-032315.patch
  against trunk revision 972f1f1.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7080//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7080//console

This message is automatically generated.

 AdminService should use warn instead of info to log exception when 
 operation fails
 --

 Key: YARN-3383
 URL: https://issues.apache.org/jira/browse/YARN-3383
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Li Lu
 Attachments: YARN-3383-032015.patch, YARN-3383-032315.patch


 Now it uses info:
 {code}
   private YarnException logAndWrapException(IOException ioe, String user,
   String argName, String msg) throws YarnException {
 LOG.info(Exception  + msg, ioe);
 {code}
 But it should use warn instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters


[ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376920#comment-14376920
 ] 

Junping Du commented on YARN-3304:
--

Thanks [~adhoot] for comments!
bq. If we use a default of zero we cannot distinguish when its unavailable 
versus zero usage. That will make the future track the improvement to handle 
unavailable case later near impossible to do.
May be we don't have to leverage -1 in resource usage to distinguish 
unavailable case? e.g. we can have some boolean value to identify the resource 
is available or not which sounds more correct than using odd value like 
[~ka...@cloudera.com] mentioned before.

bq.  I propose we make all the defaults consistently -1.
That's an incompatible change which sounds not necessary for now.  

bq. I can fix the metrics as well to use this to implement tracking unavailable 
case. Opened YARN-3392 for that.
Agree that we should have some fix on metrics side later. But even that, with 
changed all default values to -1, it is still a incompatible behavior with old 
released version. So I propose to go patch here (after fixing a minor test 
failure) in 2.7 given this is a blocker and we can fix YARN-3392 later in 2.8. 
Thoughts?

 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-3304.patch


 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2901) Add errors and warning stats to RM, NM web UI

2015-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377040#comment-14377040
 ] 

Wangda Tan commented on YARN-2901:
--

Hi [~vvasudev],

I spent some time take a look at Log4JMetricsAppeneder implementation (will 
include other modified component in next round).

1) Log4jMetricsAppender, 
1.1 Better to place in yarn-server-common?
1.2 If you agree above, how about put into package o.a.h.y.server.metrics (or 
utils)?
1.3 Rename it to Log4jWarnErrorMetricsAppender?
1.4 Comments about implementation:
I think currently, implementation of cleanup can be improved, now cutoff 
process of message/count is basically loop all items stored, which could be 
inefficient (imaging if number of stored message  threshold), existing logics 
in the patch would lead to lots of potential stored message (tons of messages 
could be genereated in 5 min, which is purge message task run interval).

If you can make the data structure to be:
SortedMapString, SortedMapLong, Integer errors (and warnings), the outside 
map is sorted by value (SortedMap with smallest timestamp goes first), and 
inside map is sorted by key (smallest timestamp goes first), purge can happen 
when we add any event, it will just take at most log(N=500) time to do the 
purge, and no extra timer task needed.

To make SortedMap can sort by value, one way to do that can refer to 
http://stackoverflow.com/questions/109383/how-to-sort-a-mapkey-value-on-the-values-in-java
 (first answer).

Here, value = SortedMapLong, Integer, we can sort the SortedMaps according 
to smallest key in each SortedMap.

And one corner case may need to consider is, it is possible a same message can 
have lots of different timestamps, so we need purge the inner SortedMap too.

To make better code readability, you can wrap the SortedMap to a inner class 
like MessageInfo.

 Add errors and warning stats to RM, NM web UI
 -

 Key: YARN-2901
 URL: https://issues.apache.org/jira/browse/YARN-2901
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen 
 Shot 2015-03-19 at 7.40.02 PM.png, apache-yarn-2901.0.patch, 
 apache-yarn-2901.1.patch


 It would be really useful to have statistics on the number of errors and 
 warnings in the RM and NM web UI. I'm thinking about -
 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day
 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 
 hours/day
 By errors and warnings I'm referring to the log level.
 I suspect we can probably achieve this by writing a custom appender?(I'm open 
 to suggestions on alternate mechanisms for implementing this).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376944#comment-14376944
 ] 

Hadoop QA commented on YARN-3021:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706735/YARN-3021.006.patch
  against trunk revision 972f1f1.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesHttpStaticUserPermissions
  org.apache.hadoop.yarn.server.resourcemanager.TestRM
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7081//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7081//console

This message is automatically generated.

 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters


[ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376973#comment-14376973
 ] 

Karthik Kambatla commented on YARN-3304:


bq. That's an incompatible change which sounds not necessary for now.
In previous releases, we have never called these APIs Public even if they were 
intended to be sub-classed. In my mind, this is the last opportunity to decide 
on what the API should do? I think consistent and reasonable return values 
should be given a higher priority over compatibility. 

bq. May be we don't have to leverage -1 in resource usage to distinguish 
unavailable case? e.g. we can have some boolean value to identify the resource 
is available or not which sounds more correct than using odd value like Karthik 
Kambatla mentioned before.

I am okay with adding boolean methods to capture unavailability, but that seems 
a little overboard. Using -1 in the ResourceCalculatorProcessTree is okay by 
me. My concern was with logging this -1 value in the metrics. In either case, I 
would like for the container usage metrics to see if the usage is available 
before logging the same.

bq. So I propose to go patch here (after fixing a minor test failure) in 2.7 
given this is a blocker and we can fix YARN-3392 later in 2.8. Thoughts?
Since it is not too much work or risk, I would prefer we fix both in 2.7. This 
is solely wearing my Apache hat on. My Cloudera hat doesn't really mind it 
being in 2.8 vs 2.7. 



 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-3304-v2.patch, YARN-3304.patch


 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3347) Improve YARN log command to get AMContainer logs


 [ 
https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3347:

Attachment: YARN-3347.2.patch

fix -1 on findBug

 Improve YARN log command to get AMContainer logs
 

 Key: YARN-3347
 URL: https://issues.apache.org/jira/browse/YARN-3347
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, 
 YARN-3347.2.patch


 Right now, we could specify applicationId, node http address and container ID 
 to get the specify container log. Or we could only specify applicationId to 
 get all the container logs. It is very hard for the users to get logs for AM 
 container since the AMContainer logs have more useful information. Users need 
 to know the AMContainer's container ID and related Node http address.
 We could improve the YARN Log Command to allow users to get AMContainer logs 
 directly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters


[ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377008#comment-14377008
 ] 

Hadoop QA commented on YARN-3304:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706767/YARN-3304-v2.patch
  against trunk revision 2c238ae.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7082//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7082//console

This message is automatically generated.

 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-3304-v2.patch, YARN-3304.patch


 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3387) container complete message couldn't pass to am if am restarted and rm changed

2015-03-23 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377019#comment-14377019
 ] 

sandflee commented on YARN-3387:


yes

 container complete message couldn't pass to am if am restarted and rm changed
 -

 Key: YARN-3387
 URL: https://issues.apache.org/jira/browse/YARN-3387
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: sandflee
Priority: Critical

 suppose am work preserving and rm ha is enabled.
 container complete message is passed to appattemt.justFinishedContainers in 
 rm。in normal situation，all attempt in one app shares the same 
 justFinishedContainers, but when rm changed, every attempt has it's own 
 justFinishedContainers, so in situations below, container complete message 
 couldn't passed to am:
 1, am restart
 2, rm changes
 3, container launched by first am completes
 container complete message will be passed to appAttempt1 not appAttempt2, but 
 am pull finished containers from appAttempt2 (currentAppAttempt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs


[ 
https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377031#comment-14377031
 ] 

Hadoop QA commented on YARN-3347:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706770/YARN-3347.2.patch
  against trunk revision 2c238ae.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7083//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7083//console

This message is automatically generated.

 Improve YARN log command to get AMContainer logs
 

 Key: YARN-3347
 URL: https://issues.apache.org/jira/browse/YARN-3347
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, 
 YARN-3347.2.patch


 Right now, we could specify applicationId, node http address and container ID 
 to get the specify container log. Or we could only specify applicationId to 
 get all the container logs. It is very hard for the users to get logs for AM 
 container since the AMContainer logs have more useful information. Users need 
 to know the AMContainer's container ID and related Node http address.
 We could improve the YARN Log Command to allow users to get AMContainer logs 
 directly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer

2015-03-23 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376914#comment-14376914
 ] 

Chris Nauroth commented on YARN-3336:
-

[~zxu], I apologize, but I missed entering your name on the git commit message:

{code}
commit 6ca1f12024fd7cec7b01df0f039ca59f3f365dc1
Author: cnauroth cnaur...@apache.org
Date:   Mon Mar 23 10:45:50 2015 -0700

YARN-3336. FileSystem memory leak in DelegationTokenRenewer.
{code}

Unfortunately, this isn't something we can change, because it could mess up the 
git history.

You're still there in CHANGES.txt though, so you get the proper credit for the 
patch:

{code}
YARN-3336. FileSystem memory leak in DelegationTokenRenewer.
(Zhihai Xu via cnauroth)
{code}


 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.7.0

 Attachments: YARN-3336.000.patch, YARN-3336.001.patch, 
 YARN-3336.002.patch, YARN-3336.003.patch, YARN-3336.004.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters


 [ 
https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3304:
-
Attachment: YARN-3304-v2.patch

Update patch to v2 to fix the test failure for 1st patch.

 ResourceCalculatorProcessTree#getCpuUsagePercent default return value is 
 inconsistent with other getters
 

 Key: YARN-3304
 URL: https://issues.apache.org/jira/browse/YARN-3304
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-3304-v2.patch, YARN-3304.patch


 Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for 
 unavailable case while other resource metrics are return 0 in the same case 
 which sounds inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


 [ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated YARN-3021:

Attachment: YARN-3021.006.patch

The test failure seems to be unrelated, upload same patch 06 to trigger another 
jenkins run. 


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


 [ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated YARN-3021:

Attachment: (was: YARN-3021.006.patch)

 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext


 [ 
https://issues.apache.org/jira/browse/YARN-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3244:

Attachment: YARN-3244.2.patch

Address all the latest comments

 Add user specified information for clean-up container in 
 ApplicationSubmissionContext
 -

 Key: YARN-3244
 URL: https://issues.apache.org/jira/browse/YARN-3244
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3244.1.patch, YARN-3244.2.patch


 To launch user-specified clean up container, users need to provide proper 
 informations to YARN.
 It should at least have following properties:
 * A flag to indicate whether needs to launch the clean-up container
 * A time-out period to indicate how long the clean-up container can run
 * maxRetry times
 * containerLaunchContext for clean-up container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext


[ 
https://issues.apache.org/jira/browse/YARN-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377175#comment-14377175
 ] 

Hadoop QA commented on YARN-3244:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706802/YARN-3244.2.patch
  against trunk revision 2c238ae.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7086//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7086//console

This message is automatically generated.

 Add user specified information for clean-up container in 
 ApplicationSubmissionContext
 -

 Key: YARN-3244
 URL: https://issues.apache.org/jira/browse/YARN-3244
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3244.1.patch, YARN-3244.2.patch


 To launch user-specified clean up container, users need to provide proper 
 informations to YARN.
 It should at least have following properties:
 * A flag to indicate whether needs to launch the clean-up container
 * A time-out period to indicate how long the clean-up container can run
 * maxRetry times
 * containerLaunchContext for clean-up container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3034) [Collector wireup] Implement RM starting its timeline collector

[
https://issues.apache.org/jira/browse/YARN-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377196#comment-14377196
]

Naganarasimha G R commented on YARN-3034:
-

Hi [~zjshen],
bq. According to this comments, it seems that you want to create a separate
stack to put entities into RMTimelineCollector, right? If so, the current
design makes sense.
yes I wanted to create a separate stack similar to SystemMetricsPublisher, so
that ATS V1 and V2 are less coupled and removal of SMP once completely
deprecated is smoother

bq. yarn.resourcemanager.system-metrics-publisher.enabled for v1
SystemMetricsPublisher. For v2, both RM and NM reads
yarn.system-metrics-publisher.enabled? No need to have v1/v2 flag?
On after thoughts i feel this approach is better as once we deprecate SMP then
there will be unnecessary additional configuration of which version type, which
would not be required. If all are fine then will move back to the approach as
mentioned by Zhijie.

[Collector wireup] Implement RM starting its timeline collector
---

Per design in YARN-2928, implement resource managers starting their own ATS
writers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3393) Getting application(s) goes wrong when app finishes before starting the attempt


[ 
https://issues.apache.org/jira/browse/YARN-3393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377216#comment-14377216
 ] 

Hudson commented on YARN-3393:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7409 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7409/])
YARN-3393. Getting application(s) goes wrong when app finishes before (xgong: 
rev 9fae455e26e0230107e1c6db58a49a5b6b296cf4)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java


 Getting application(s) goes wrong when app finishes before starting the 
 attempt
 ---

 Key: YARN-3393
 URL: https://issues.apache.org/jira/browse/YARN-3393
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical
 Fix For: 2.7.0

 Attachments: YARN-3393.1.patch


 When generating app report in ApplicationHistoryManagerOnTimelineStore, it 
 checks if appAttempt == null.
 {code}
 ApplicationAttemptReport appAttempt = 
 getApplicationAttempt(app.appReport.getCurrentApplicationAttemptId());
 if (appAttempt != null) {
   app.appReport.setHost(appAttempt.getHost());
   app.appReport.setRpcPort(appAttempt.getRpcPort());
   app.appReport.setTrackingUrl(appAttempt.getTrackingUrl());
   
 app.appReport.setOriginalTrackingUrl(appAttempt.getOriginalTrackingUrl());
 }
 {code}
 However, {{getApplicationAttempt}} doesn't return null but throws 
 ApplicationAttemptNotFoundException:
 {code}
 if (entity == null) {
   throw new ApplicationAttemptNotFoundException(
   The entity for application attempt  + appAttemptId +
doesn't exist in the timeline store);
 } else {
   return convertToApplicationAttemptReport(entity);
 }
 {code}
 They code isn't coupled well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)


 [ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2495:

Attachment: YARN-2495.20150324-1.patch

 Allow admin specify labels from each NM (Distributed configuration)
 ---

 Key: YARN-2495
 URL: https://issues.apache.org/jira/browse/YARN-2495
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
 Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, 
 YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, 
 YARN-2495.20141119-1.patch, YARN-2495.20141126-1.patch, 
 YARN-2495.20141204-1.patch, YARN-2495.20141208-1.patch, 
 YARN-2495.20150305-1.patch, YARN-2495.20150309-1.patch, 
 YARN-2495.20150318-1.patch, YARN-2495.20150320-1.patch, 
 YARN-2495.20150321-1.patch, YARN-2495.20150324-1.patch, 
 YARN-2495_20141022.1.patch


 Target of this JIRA is to allow admin specify labels in each NM, this covers
 - User can set labels in each NM (by setting yarn-site.xml (YARN-2923) or 
 using script suggested by [~aw] (YARN-2729) )
 - NM will send labels to RM via ResourceTracker API
 - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext