[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177376#comment-15177376
 ] 

Rohith Sharma K S commented on YARN-4755:
-

I looked in code base 2.7.1 version :-( !! Ignore it.

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177369#comment-15177369
 ] 

Sunil G commented on YARN-4755:
---

Yes. Correct. But I think appACLUpdated event is going for app rejected event 
also. May be Naga can double confirm this. Is it really needed. If not,  we can 
club and it's really fine. 

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177366#comment-15177366
 ] 

Rohith Sharma K S commented on YARN-4755:
-

bq. But appCreated will be send only whne RMApp is created and START event is 
fired
Event appCreated is sent in RMAppImpl constructor which always go first. So I 
think events appCreated and appACLUpdated can be clubbed together unless there 
is no dependencies from timelineserver end.
 

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177309#comment-15177309
 ] 

Varun Saxena commented on YARN-4754:


This should be closed as per what I understand from Jersey API documentation. 
[~rohithsharma] can confirm if scenario is same in his case or not.

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177306#comment-15177306
 ] 

Varun Saxena commented on YARN-4754:


I still see 2 places where we are not closing ClientResponse, when we call 
{{putDomain}} and in {{doPosting}} if response is not 200 OK.

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177177#comment-15177177
 ] 

Naganarasimha G R commented on YARN-4712:
-

Thanks for the comments [~djp] & [~varun_saxena],\
bq. Regarding checkstyle, you can fix them for now.
As you can note in the latest patch line length issues are already taken care 
of.

bq. We shouldn't let Eclipse's bug affect our code convention.
Well its not that i dont want to do it, but i presume Eclipse optimizes it in 
some ways and does only when required, Anyway have taken care of it but it 
would more easy to rely on the editors formatter if accepted  :) 

bq.  it seems more things need to be fixed for UNAVAILABLE case,
agree , milliVcoresUsed can be set to 0 in UNAVAILABLE case, right ?

bq. It sounds weird if cpuUsageTotalCoresPercentage is -1 in UNAVAILABLE case.
we have set it -1 to indicate not to store this  value in the ATS. if its 
*unavaiable do we need to store it as 0 or not store at all* ?

bq. it make cpu metric to be either 0 or 1 which is not expected here?
As [~varun_saxena] explained it directly gives as percent values (no need to 
multiply with 100) and we round of to only remove the decimals values

[~djp], if you can confirm on these queries, i can finish the patch

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177140#comment-15177140
 ] 

Hadoop QA commented on YARN-4740:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 0 new + 117 unchanged - 3 fixed = 117 total (was 120) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 30s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 36s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 150m 8s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177119#comment-15177119
 ] 

Hadoop QA commented on YARN-4700:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
32s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
36s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 13s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice:
 patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 23s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 30s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 49s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791090/YARN-4700-YARN-2928.v1.004.patch
 |
| JIRA Issue | YARN-4700 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 391c03561be0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | 

[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177097#comment-15177097
 ] 

Hadoop QA commented on YARN-4686:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 13s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
58s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 3s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 1 new + 
30 unchanged - 0 fixed = 31 total (was 30) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 55s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 44s {color} 
| {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_72. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 12s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_72. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 22s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 57s {color} 
| {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 31s 

[jira] [Updated] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-4700:

Attachment: YARN-4700-YARN-2928.v1.004.patch

Thanks for the comments [~vrushalic] & [~sjlee0],
Have uploaded a patch with the fixes for testcases, javadoc and other comments

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4700-YARN-2928.v1.001.patch, 
> YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, 
> YARN-4700-YARN-2928.v1.004.patch, YARN-4700-YARN-2928.wip.patch
>
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177054#comment-15177054
 ] 

Hadoop QA commented on YARN-4737:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
33s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 53s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 51s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
7s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 48s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
48s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 9s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 59s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 41s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 12s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 12s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 52s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 52s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 8s 
{color} | {color:red} root: patch generated 2 new + 436 unchanged - 4 fixed = 
438 total (was 440) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
50s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 
50s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 55s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 53s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 18s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 47s 
{color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the 
patch passed with JDK v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 13s {color} 

[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader

2016-03-02 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176999#comment-15176999
 ] 

Sangjin Lee commented on YARN-3863:
---

I did another pass at the latest patch.

One high level question: am I correct in understanding that if a relations 
filter is specified for example but relation was *not* specified as part of 
fields to retrieve, we would try to fetch the relation? So, in a sense, would 
specifying a filter override/modify the fields to retrieve behavior? If so, how 
much additional complexity is added by trying to support that behavior? What if 
we simply reject or ignore the filters if they do not match the fields to 
retrieve? Would it make the implementation simpler or harder? To me, supporting 
more contents even if the filters and the fields to retrieve are not consistent 
seems very much optional, and I'm not sure if it is worth it especially if it 
adds a lot more complexity. What do you think?

(TimelineEntityFilters.java)
- l.49: typo: "ids'" -> "id's" (also in l.60)
- l.62: should be a link for {{TimelineKeyValuesFilter}}
- For limit, createdTimeBegin, and createdTimeEnd, we're ensuring they can 
never be null. In that vein, I think it might make sense to start using 
{{long}} over {{Long}} as part of the method interface. Thoughts?

(TimelineCompareFilter.java)
- l.36: Is the default constructor useful at all? It doesn't sound like it if 
key and value are empty/null. Should we remove it?

(TimelineKeyValuesFilter.java)
- l.34-36: nit: let's make them final
- l.55-56: super-nit: an empty line between the methods would be good
- l.68: another super-nit: the C-style equality pattern is not needed/helpful 
in java; let's just do {{values == null}}

(GenericEntityReader.java)
- l.90: Do we need to check if {{getFilters()}} returns null? When I check all 
callers of {{getFilters()}}, some check for null and some don't. It would be 
good to make it clear and consistent either way.
- l.95: See above comment in {{TimelineEntityFilters.java}}. If we switch to 
{{long}}, this becomes easier to understand (no need to reason about unboxing 
yielding a NPE).

(ApplicationEntityReader.java)
- l.89: see above

(TimelineReaderWebServicesUtils.java)
- l.257: I'm still not sure I understand. Is this a temporary thing until 
YARN-4447 is addressed? What if the metric value happens to be negative? That 
would be a non-match, then?

(TimelineStorageUtils.java)
- l.466: {{equals()}} should be replaced with {{==}}
- l.591: same
- l.666: a comment here that states we're using the latest value of the metric 
might be helpful

(ColumnHelper.java)
- I'm not sure if this is an issue or not, but I remember there were cases 
where we join an empty string at the end to have the qualifier end with the 
separator, and that was for a reason. I hope this patch did not change an 
occurrence of that inadvertently.

(unit tests)
- I know [~vrushalic] had some thoughts on how to split this monolithic 
{{TestHBaseTimelineStorage}}. It might be good to come to a consensus on how to 
split it...

> Support complex filters in TimelineReader
> -
>
> Key: YARN-3863
> URL: https://issues.apache.org/jira/browse/YARN-3863
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3863-YARN-2928.v2.01.patch, 
> YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, 
> YARN-3863-feature-YARN-2928.wip.003.patch, 
> YARN-3863-feature-YARN-2928.wip.01.patch, 
> YARN-3863-feature-YARN-2928.wip.02.patch, 
> YARN-3863-feature-YARN-2928.wip.04.patch, 
> YARN-3863-feature-YARN-2928.wip.05.patch
>
>
> Currently filters in timeline reader will return an entity only if all the 
> filter conditions hold true i.e. only AND operation is supported. We can 
> support OR operation for the filters as well. Additionally as primary backend 
> implementation is HBase, we can design our filters in a manner, where they 
> closely resemble HBase Filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-03-02 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176947#comment-15176947
 ] 

sandflee commented on YARN-4740:


thanks for your suggest,  attach a new patch to fix these.

> container complete msg may lost while AM restart in race condition
> --
>
> Key: YARN-4740
> URL: https://issues.apache.org/jira/browse/YARN-4740
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-4740.01.patch, YARN-4740.02.patch
>
>
> 1, container completed, and the msg is store in 
> RMAppAttempt.justFinishedContainers
> 2,  AM allocate and before allocateResponse came to AM, AM crashed
> 3,  AM restart and couldn't get the container complete msg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-03-02 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4740:
---
Attachment: YARN-4740.02.patch

> container complete msg may lost while AM restart in race condition
> --
>
> Key: YARN-4740
> URL: https://issues.apache.org/jira/browse/YARN-4740
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>Assignee: sandflee
> Attachments: YARN-4740.01.patch, YARN-4740.02.patch
>
>
> 1, container completed, and the msg is store in 
> RMAppAttempt.justFinishedContainers
> 2,  AM allocate and before allocateResponse came to AM, AM crashed
> 3,  AM restart and couldn't get the container complete msg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176899#comment-15176899
 ] 

Sidharta Seethana commented on YARN-4744:
-

Testing note : This patch makes minor logging changes - I tested the patch 
manually using distributed shell. 

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
> Attachments: YARN-4744.001.patch
>
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 9 more
> 2014-03-02 09:20:43,113 INFO 
> 

[jira] [Updated] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-03-02 Thread Eric Badger (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-4686:
--
Attachment: YARN-4686.002.patch

This patch fixes a race issue between the NMs resyncing with the RM and the NMs 
stopping via serviceStop. 

> MiniYARNCluster.start() returns before cluster is completely started
> 
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Eric Badger
> Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, 
> YARN-4686.002.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176743#comment-15176743
 ] 

Hadoop QA commented on YARN-4744:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
49s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
59s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 30s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 47s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 34m 29s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791050/YARN-4744.001.patch |
| JIRA Issue | YARN-4744 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux bfc907ba6af7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Updated] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Sidharta Seethana (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sidharta Seethana updated YARN-4744:

Attachment: YARN-4744.001.patch

Uploaded a patch that removes 'invalid pid' signal failure logging. [~jlowe], 
could you please take a look?

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
> Attachments: YARN-4744.001.patch
>
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 9 more
> 2014-03-02 09:20:43,113 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: 

[jira] [Commented] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176665#comment-15176665
 ] 

Hadoop QA commented on YARN-4359:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 9 new + 25 unchanged - 2 fixed = 34 total (was 27) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
21s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 20s 
{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed 
with JDK v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 2m 37s 
{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95
 with JDK v1.7.0_95 generated 4 new + 2 unchanged - 0 fixed = 6 total (was 2) 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 38s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 7s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 155m 22s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | 

[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176644#comment-15176644
 ] 

Karthik Kambatla commented on YARN-4719:


[~leftnoteasy] - thanks for chiming in, you make some valid points. 

Since we are building a library for node tracking, I would like for us to 
restrict access to the map/set of nodes tracked only through addNode and 
removeNode so total_cluster_resources, total_inflated_cluster_resources (for 
YARN-1011), max_cluster_resources are not affected by other scheduler code. Do 
you think this is a reasonable goal? At least, as long as it doesn't hurt 
performance?

If yes, we should decide on how to handle cases where the scheduler code needs 
to iterate through the nodes: (1) we could provide a snapshot copy of the 
map/set of nodes/nodeIds, or (2) provide a way to do the same with the right 
locks by adding additional methods or an abstraction (similar to lambdas) that 
applies to multiple methods. 

Thoughts? 

PS: By the way, thanks for pointing out the javadoc for values(). I will clean 
that up based on the discussion output here.

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176611#comment-15176611
 ] 

Sangjin Lee commented on YARN-4700:
---

It seems that the unit test failures are real. So is the javadoc error. Could 
you please look into it? Thanks!

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4700-YARN-2928.v1.001.patch, 
> YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, 
> YARN-4700-YARN-2928.wip.patch
>
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176609#comment-15176609
 ] 

Sidharta Seethana commented on YARN-4744:
-

[~jlowe] That was my thinking as well - double logging is better than missed 
logs. In addition, logging in {{PrivilegedOperationExecutor}} includes 
information that isn't necessarily available when the exception is propagated. 

I'll upload a patch soon, thanks.

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> 

[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176555#comment-15176555
 ] 

Hadoop QA commented on YARN-4719:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
24s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
35s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 3m 59s {color} 
| {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95
 with JDK v1.7.0_95 generated 1 new + 1 unchanged - 1 fixed = 2 total (was 2) 
{color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 19s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 9 new + 278 unchanged - 8 fixed = 287 total (was 286) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 27s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 47s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 156m 28s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore |
|   | 

[jira] [Assigned] (YARN-1547) Prevent DoS of ApplicationMasterProtocol by putting in limits

2016-03-02 Thread Giovanni Matteo Fumarola (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola reassigned YARN-1547:
--

Assignee: Giovanni Matteo Fumarola

> Prevent DoS of ApplicationMasterProtocol by putting in limits
> -
>
> Key: YARN-1547
> URL: https://issues.apache.org/jira/browse/YARN-1547
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Giovanni Matteo Fumarola
>
> Points of DoS in ApplicationMasterProtocol
>  - Host and trackingURL in RegisterApplicationMasterRequest
>  - Diagnostics, final trackingURL in FinishApplicationMasterRequest
>  - Unlimited number of resourceAsks, containersToBeReleased and 
> resourceBlacklistRequest in AllocateRequest
> -- Unbounded number of priorities and/or resourceRequests in each ask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Jonathan Maron (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Maron updated YARN-4737:
-
Attachment: YARN-4737.002.patch

I believe all code issues have been addressed

> Use CSRF Filter in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch, YARN-4737.002.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176498#comment-15176498
 ] 

Jason Lowe commented on YARN-4744:
--

As long as we're not logging a bunch of warningsf for benign events I'm good.  
I still think the log-then-throw idiom can be problematic in practice as it 
tends to lead to double-logging (both by the thrower and by the catcher).  I 
understand the concern to miss logs, and it's safer to double-log than not log 
at all.


> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> 

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176436#comment-15176436
 ] 

Sidharta Seethana commented on YARN-4744:
-

[~bibinchundatt],

Those two error codes are used differently. INVALID_CONTAINER_PID is used with 
errno ESRCH . UNABLE_TO_SIGNAL_CONTAINER is used in other cases. See code below 
:

{code}
int signal_container_as_user(const char *user, int pid, int sig) {
  if(pid <= 0) {
return INVALID_CONTAINER_PID;
  }

  if (change_user(user_detail->pw_uid, user_detail->pw_gid) != 0) {
return SETUID_OPER_FAILED;
  }

  //Don't continue if the process-group is not alive anymore.
  int has_group = 1;
  if (kill(-pid,0) < 0) {
if (kill(pid, 0) < 0) {
  if (errno == ESRCH) {
return INVALID_CONTAINER_PID;
  }
  fprintf(LOGFILE, "Error signalling container %d with %d - %s\n",
  pid, sig, strerror(errno));
  return -1;
} else {
  has_group = 0;
}
  }

  if (kill((has_group ? -1 : 1) * pid, sig) < 0) {
if(errno != ESRCH) {
  fprintf(LOGFILE, 
  "Error signalling process group %d with signal %d - %s\n", 
  -pid, sig, strerror(errno));
  fprintf(stderr, 
  "Error signalling process group %d with signal %d - %s\n", 
  -pid, sig, strerror(errno));
  fflush(LOGFILE);
  return UNABLE_TO_SIGNAL_CONTAINER;
} else {
  return INVALID_CONTAINER_PID;
}
  }
  fprintf(LOGFILE, "Killing process %s%d with %d\n",
  (has_group ? "group " :""), pid, sig);
  return 0;
}
{code}

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> 

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176416#comment-15176416
 ] 

Sidharta Seethana commented on YARN-4744:
-

Before {{PrivilegedOperationExecutor}} existed, there were several cases where 
not enough information was being logged about container-executor failures. 
Centralizing this provided useful information like invocation arguments, shell 
output etc - which has proved useful for debugging. In all cases except 
'invalid pid', an error returned by container-executor is an error. IMO, we 
shouldn't remove the error logging.

It looks like {{signalContainer}} in {{LinuxContainerExecutor}} ignores the 
exception for the 'invalid pid' case. We could do something this this :

 * Change {{DefaultContainerRuntime}} to ignore the 'invalid pid' error as 
well. 
 * Change {{PrivilegedOperationExecutor}} / {{PrivilegedOperation}} to add the 
notion of 'ignore failures' for certain kinds of operations  . Use this only 
for {{signalContainer}} and let the runtime/executor decide what they want to 
do. 

I'll submit a patch with these changes. 

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> 

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176399#comment-15176399
 ] 

Hadoop QA commented on YARN-4700:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
50s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
37s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 16s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 17s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice:
 patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
46s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 48s 
{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.7.0_95
 with JDK v1.7.0_95 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 3m 25s {color} 
| {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 3m 8s {color} | 
{color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 53s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
 |
|   | 

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176397#comment-15176397
 ] 

Bibin A Chundatt commented on YARN-4744:


[~jlowe]/[~vinodkv]

Confused with the {{exit code 9}} also. From one of the documentation i read 
,below are the exit code for container executor
{noformat}
exit code  | NAME| Description
---
8 |  UNABLE_TO_SIGNAL_CONTAINER  | The container-executor could 
not signal the container it was passed.
9  |INVALID_CONTAINER_PID| The PID passed 
to the container-executor was negative or 0.
{noformat}

The  exit code returned when container doesn't exist should have been {{8}} rt?
We should recheck the exit code from container-executor and also based on exit 
code might be able to handle the errors too..



> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> 

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176364#comment-15176364
 ] 

Vrushali C commented on YARN-4700:
--

Thanks [~Naganarasimha Garla] for the updated patch. Overall it looks good. 

I have an extremely minor comment, please make the change _only_ if you plan to 
make another patch, else we can make those changes later.
- Lines 195 and 200 in TestFlowDataGenerator are commented out in the patch, we 
can remove them. 

+ 1 otherwise.


> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4700-YARN-2928.v1.001.patch, 
> YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, 
> YARN-4700-YARN-2928.wip.patch
>
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176357#comment-15176357
 ] 

Vinod Kumar Vavilapalli commented on YARN-4744:
---

bq. the NM appears to signal containers that have already exited ( as a part of 
ContainerLaunch.cleanupContainer() )
This is by design. We did this originally so as to ensure cleaning up of any 
orphaned child-processes or process-groups - even if the root-process exits.

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> 

[jira] [Commented] (YARN-1547) Prevent DoS of ApplicationMasterProtocol by putting in limits

2016-03-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176343#comment-15176343
 ] 

Vinod Kumar Vavilapalli commented on YARN-1547:
---

bq. Thanks for raising this Vinod Kumar Vavilapalli. I was wondering if I might 
take this up, if you are not actively working on it. 
Tx [~giovanni.fumarola], please go ahead and assign it to yourselves!

We can discuss after you have a design, but wanted to bring up one point of 
note w.r.t this ticket and the larger YARN-1545 itself.

It is likely that we can solve 60-70% of our use-case of avoiding accidental 
DoS'ing by well-behaved apps by way of putting limits in the client, but it is 
imperative that we handle this on the server-side instead of on client-side, 
lest an abusive client can circumvent any client-side restrictions.

> Prevent DoS of ApplicationMasterProtocol by putting in limits
> -
>
> Key: YARN-1547
> URL: https://issues.apache.org/jira/browse/YARN-1547
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>
> Points of DoS in ApplicationMasterProtocol
>  - Host and trackingURL in RegisterApplicationMasterRequest
>  - Diagnostics, final trackingURL in FinishApplicationMasterRequest
>  - Unlimited number of resourceAsks, containersToBeReleased and 
> resourceBlacklistRequest in AllocateRequest
> -- Unbounded number of priorities and/or resourceRequests in each ask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4650) The AM should be launched with its own set of configs instead of using the NM's configs

2016-03-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176322#comment-15176322
 ] 

Vinod Kumar Vavilapalli commented on YARN-4650:
---

Trying to understand the problem and solution being addressed here. May be I am 
missing something, but I actually don't see a major change from what we already 
have today.

>From the beginning of YARN, we've been very careful about apps not relying on 
>server configuration. In theory it is still possible for an app to hard-code 
>and depend on server configuration (via 
>{{ApplicationConstants.Environment.HADOOP_CONF_DIR}} / 
>{{YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH}}), but things like 
>rolling-upgrades (YARN-666) further forced our users to not play such tricks.

bq. The AM should be launched with its own set of configs instead of using the 
NM's configs
For most of our apps (MapReduce, Tez, Spark etc), this already doesn't happen 
by default. MR for example depends on job-configuration 
{{mapreduce.application.classpath}}. In all these cases, all the configuration 
needed by AMs is usually supposed to come from the client itself. Only 
DistributedShell is the corner-case that by default depends on NM Configuration 
via {{DEFAULT_YARN_APPLICATION_CLASSPATH}}.

bq. There are cases, such as a secure LDAP configuration where the NM may need 
access to credentials that should not be exposed to the user. As long as the NM 
and AM share the same configuration files, anything exposed to the NM is also 
exposed to the AM and hence the users.
This is  already possible to do right now, *without* breaking most of our 
well-behave apps: an admin can simply (a) remove HADOOP_CONF_DIR from NM 
white-list and/or (b) change the permissions of the NMs configs to be very 
restrictive.

> The AM should be launched with its own set of configs instead of using the 
> NM's configs
> ---
>
> Key: YARN-4650
> URL: https://issues.apache.org/jira/browse/YARN-4650
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>
> There are cases, such as a secure LDAP configuration where the NM may need 
> access to credentials that should not be exposed to the user.  As long as the 
> NM and AM share the same configuration files, anything exposed to the NM is 
> also exposed to the AM and hence the users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-4700:

Attachment: YARN-4700-YARN-2928.v1.003.patch

Thanks for the review [~varun_saxena], attaching a patch with the fixes for the 
review comments.

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4700-YARN-2928.v1.001.patch, 
> YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, 
> YARN-4700-YARN-2928.wip.patch
>
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176278#comment-15176278
 ] 

Wangda Tan commented on YARN-4719:
--

[~kasha],

bq. Not sure I understand the suggestion. Elaborate?
In ver.2 patch, getAllNodes uses shallowCopy, what I meant is instead of 
copying the entire HashMap, you can use ConcurrentMap instead.
In ver.3 patch, you removed shallowCopy and returns HashMap.values(), if node 
removed while someone is iterating values(), the behavior is undefined. See: 
[javadoc|https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html#values()]

bq. I feel any logic that has to iterate through all nodes should go through 
ClusterNodeTracker - that way, we don't run into cases where we access the list 
of nodes without a lock.
As I commented above, we can use ConcurrentMap instead of locking 
ClusterNodeTracker. Do you need strong consistency for 
addBlacklistedNodeIdsToList? (Because node list could be updated while we 
updating blacklistedNodes.

bq. Any particular reason you think this doesn't belong here?
I would prefer to keep cleaner responsibility of ClusterNodeTracker, if we adds 
application logic here, we could add any logic related to SchedulerNode to this 
class as well. This refactoring patch is majorly for code clean up to me, I 
think it's better to keep it clean from the beginning.


> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358

2016-03-02 Thread Ishai Menache (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishai Menache updated YARN-4359:

Attachment: YARN-4359.4.patch

> Update LowCost agents logic to take advantage of YARN-4358
> --
>
> Key: YARN-4359
> URL: https://issues.apache.org/jira/browse/YARN-4359
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Carlo Curino
>Assignee: Ishai Menache
> Attachments: YARN-4359.0.patch, YARN-4359.3.patch, YARN-4359.4.patch
>
>
> Given the improvements of YARN-4358, the LowCost agent should be improved to 
> leverage this, and operate on RLESparseResourceAllocation (ideally leveraging 
> the improvements of YARN-3454 to compute avaialable resources)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176248#comment-15176248
 ] 

Karthik Kambatla commented on YARN-4719:


[~rkanter] - this code touches the adaptive max allocation you have worked on. 
Mind taking a look to make sure I didn't screw up any of that. 

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4719:
---
Attachment: (was: yarn-4719-3.patch)

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4719:
---
Attachment: yarn-4719-3.patch

Updated patch should fix the test failures and findbugs warnings. 

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176186#comment-15176186
 ] 

Jason Lowe commented on YARN-4744:
--

bq. Can we use similar check like 
LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx).

That function is implemented in terms of signalContainer (so we have the same 
issue), and the process could exit between the check and the subsequent kill 
attempt.

bq. My feeling is that the PrivilegedOperationExecutor should log failures 
irrespective of the error code

There's always going to be a race where a container can exit before it gets 
killed, and I'm not sure we accomplish much besides alarming users when we log 
warnings when that occurs.  IMHO PrivilegedOperationExecutor should not be the 
one that decides what should and shouldn't be logged, since it doesn't have any 
context on whether the error is severe enough to warrant it.  Instead I think 
we should ensure the same data is present in the PrivilegedOperationException 
and let the code handling that error perform the logging if it is appropriate 
to do so.


> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> 

[jira] [Commented] (YARN-4634) Scheduler UI/Metrics need to consider cases like non-queue label mappings

2016-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176164#comment-15176164
 ] 

Hadoop QA commented on YARN-4634:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 25s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 54s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 158m 47s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12790948/0004-YARN-4634.patch |
| JIRA 

[jira] [Updated] (YARN-4719) Add a helper library to maintain node state and allows common queries

2016-03-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4719:
---
Attachment: yarn-4719-3.patch

> Add a helper library to maintain node state and allows common queries
> -
>
> Key: YARN-4719
> URL: https://issues.apache.org/jira/browse/YARN-4719
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch
>
>
> The scheduler could use a helper library to maintain node state and allowing 
> matching/sorting queries. Several reasons for this:
> # Today, a lot of the node state management is done separately in each 
> scheduler. Having a single library will take us that much closer to reducing 
> duplication among schedulers.
> # Adding a filtering/matching API would simplify node labels and locality 
> significantly. 
> # An API that returns a sorted list for a custom comparator would help 
> YARN-1011 where we want to sort by allocation and utilization for 
> continuous/asynchronous and opportunistic scheduling respectively. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175997#comment-15175997
 ] 

Varun Saxena commented on YARN-4712:


bq. it make cpu metric to be either 0 or 1 which is not expected here?
Are you saying this because you are expecting cpuUsageTotalCoresPercentage to 
be in the range 0-1 ? I was thinking the same initially and hence in the 
initial patch we were multiplying this value with 100. But that doesnt seem to 
be the case. Upon testing found that this value is not between 0-1. If there 
are 4 cores and 2 cores are fully used, this value will be 50.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175980#comment-15175980
 ] 

Junping Du commented on YARN-4712:
--

Back on this patch, it seems more things need to be fixed for UNAVAILABLE case, 
like code blow:
{code}
// Multiply by 1000 to avoid losing data when converting to int
int milliVcoresUsed = (int) (cpuUsageTotalCoresPercentage * 1000
* maxVCoresAllottedForContainers /nodeCpuPercentageForYARN);
{code}
It sounds weird if cpuUsageTotalCoresPercentage is -1 in UNAVAILABLE case.

In addition, code below sounds not right:
{code}
+cpuMetric.addValue(currentTimeMillis,
+(long) Math.round(cpuUsageTotalCoresPercentage));
{code}
it make cpu metric to be either 0 or 1 which is not expected here?

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175973#comment-15175973
 ] 

Junping Du commented on YARN-4712:
--

That's correct. We shouldn't let Eclipse's bug affect our code convention. My 
practice is show the print margin of 80 chars and address it myself ahead. Or 
you can run a local checkstyle tools before submit the patch.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175878#comment-15175878
 ] 

Varun Saxena commented on YARN-4712:


[~djp], the checkstyle issues I was referring to were line > 80 characters, the 
ones fixed in last patch.
Naga was telling me offline that he uses eclipse formatter given on Hadoop Wiki 
page which does not always take care of > 80 characters issue especially if its 
just 5-6 characters extra. And as he uses it, these checkstyle issues(for > 80 
chars) keep on cropping up in every patch.

So for this branch, we need to fix line > 80 characters issue. Right ?

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175864#comment-15175864
 ] 

Junping Du commented on YARN-4712:
--

Regarding checkstyle, I think YARN-2928 dev branch should follow the same 
standard/criteria as trunk branch or it will have trouble when merge back. The 
common practice for trunk on checksyte issues is we need to fix them as much as 
we can. However, for some annoy warnings like "method too long" (like this 
case), "method parameter too many", etc., we don't need to worry about it 
unless there is strong justification for refactor the code.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175841#comment-15175841
 ] 

Varun Saxena commented on YARN-4712:


Regarding checkstyle, you can fix them for now. We can confirm with [~sjlee0] 
in tomorrow's meeting if for this branch we need to follow checkstyle or do not 
consider checkstyle issues which appear despite using the eclipse formatter 
given on Hadoop Wiki.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175838#comment-15175838
 ] 

Varun Saxena commented on YARN-4712:


Thanks [~Naganarasimha] for the patch.
Looks good overall. A couple of nits.

# In {{NMTimelinePublisher}}, cast to long is not required as Math#round 
returns an int/long depending on input value.
# TestNMTimelinePublisher line 84, use 
ResourceCalculatorProcessTree.UNAVAILABLE instead of -1.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175802#comment-15175802
 ] 

Varun Saxena commented on YARN-4700:


[~Naganarasimha], had a glance at the patch. It looks good to me in general.

# Changes in TestTimelineReaderWebServicesHBaseStorage l.801 are not required.
# I think javadoc should be fixable.
# In FlowActivityEntityRowKey#getRowKey, the javadoc says we are passing top of 
the day timestamp. But we are not. We are calculating it inside. We can change 
the param name and description(say to something like event timestamp).
# Although created time should be fine but should we use event timestamp at 
both the places ? Just for consistency.

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4700-YARN-2928.v1.001.patch, 
> YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.wip.patch
>
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4484) Available Resource calculation for a queue is not correct when used with labels

2016-03-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175803#comment-15175803
 ] 

Sunil G commented on YARN-4484:
---

Hi [~leftnoteasy],
Could you pls help to check the patch.

> Available Resource calculation for a queue is not correct when used with 
> labels
> ---
>
> Key: YARN-4484
> URL: https://issues.apache.org/jira/browse/YARN-4484
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4484.patch, 0002-YARN-4484.patch, 
> 0003-YARN-4484.patch
>
>
> To calculate available resource for a queue, we have to get the total 
> resource allocated for all labels in queue compare to its usage. 
> Also address the comments given in 
> [YARN-4304-comments|https://issues.apache.org/jira/browse/YARN-4304?focusedCommentId=15064874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15064874
>  ] given by [~leftnoteasy] for same.
> ClusterMetrics related issues will also get handled once we fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4634) Scheduler UI/Metrics need to consider cases like non-queue label mappings

2016-03-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4634:
--
Attachment: 0004-YARN-4634.patch

Updating patch correcting findbugs warning.

> Scheduler UI/Metrics need to consider cases like non-queue label mappings
> -
>
> Key: YARN-4634
> URL: https://issues.apache.org/jira/browse/YARN-4634
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4634.patch, 0002-YARN-4634.patch, 
> 0003-YARN-4634.patch, 0004-YARN-4634.patch
>
>
> Currently when label-queue mappings are not available, there are few 
> assumptions taken in UI and in metrics.
> In above case where labels are enabled and available in cluster but without 
> any queue mappings, UI displays queues under labels. This is not correct.
> Currently  labels enabled check and availability of labels are considered to 
> render scheduler UI. Henceforth we also need to check whether 
> - queue-mappings are available
> - nodes are mapped with labels with proper exclusivity flags on
> This ticket also will try to see the default configurations in queue when 
> labels are not mapped. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4680) TimerTasks leak in ATS V1.5 Writer

2016-03-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/YARN-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175792#comment-15175792
 ] 

Jakob Stengård commented on YARN-4680:
--

Hi.
What are the symptoms of this issue?

Im having a problem with hiveserver2, which is creating a lot of timer tasks 
named 
"LogFDsCachecleanInActiveFDsTimer" and "LogFDsCacheFlushTimer".

Eventually, hiveserver2 crashes. Could this be related to this bug?

> TimerTasks leak in ATS V1.5 Writer
> --
>
> Key: YARN-4680
> URL: https://issues.apache.org/jira/browse/YARN-4680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-4680.1.patch, YARN-4680.20160108.patch, 
> YARN-4680.20160109.patch, YARN-4680.20160222.patch
>
>
> We have seen TimerTasks leak which could cause application server done (such 
> as oozie server done due to too many active threads)
> Although we have fixed some potentially leak situations in upper application 
> level, such as
> https://issues.apache.org/jira/browse/MAPREDUCE-6618
> https://issues.apache.org/jira/browse/MAPREDUCE-6621, we still can not 
> guarantee that we fixed the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175772#comment-15175772
 ] 

Sunil G commented on YARN-4755:
---

{{appACLsUpdated}} is invoked in {{createAndPopulateNewRMApp}}. But 
{{appCreated}} will be send only whne RMApp is created and START event is fired.
So for apps which goes with secure way OR Rejected apps wont get appACLsUpdated 
if we club thse 2 together. I am not very sure whether this is needed. 
[~rohithsharma], do you recollect anything in line?

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Jonathan Maron (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175759#comment-15175759
 ] 

Jonathan Maron commented on YARN-4737:
--

Enabling CSRF w/o auth will require the inclusion of the custom header for all 
invocations, regardless of whether they are secure invocations or not.  I don't 
believe that is the expected usage model for the filter.

As far as identifying auth mechanisms - I'm trying to find instances that would 
show the use of custom auth filters but I'm not really finding any.  One theory 
I have is that looking up a value other than "Simple" for 
"hadoop.http.authentication.type" might provide a more general indicator of 
auth being enabled?  Does that seem correct?

POST requests from java clients should not be an issue - the filter only 
executes when a browser user agent is detected.

BTW, the license issues (asflicense) don't appear even remotely related to this 
patch.

> Use CSRF Filter in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175752#comment-15175752
 ] 

Naganarasimha G R commented on YARN-4754:
-

Initially i also suspected the same and then realised that if its 
ClientResponse is read then the stream is closed. so not sure whats leaking the 
events ..
[~rohithsharma] any other error logs while processing the events ? 

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175747#comment-15175747
 ] 

Naganarasimha G R commented on YARN-4755:
-

bq.  I think appACLsUpdated cannot be go with appCreated as its little early. 
SO if appCreated can be delayed, this can be accomodated. But it comes with a 
cost of delayed notification to timeline. 
Sorry dint get this, why its early? ACL is actually got from app submission 
context, so there is no point in sending as part of another event and not in 
AppCreated right ? Correct me if my understanding is wrong 
{code}
String appViewACLs = submissionContext.getAMContainerSpec()
.getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
rmContext.getSystemMetricsPublisher().appACLsUpdated(
application, appViewACLs, System.currentTimeMillis());
{code}

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-4754:
---
Assignee: (was: Varun Saxena)

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175734#comment-15175734
 ] 

Varun Vasudev commented on YARN-4737:
-

bq. Is the ATS leveraging another auth mechanism (or not using WebApps to 
construct the endpoint)?

I took a look and it looks like the ATS doesn't use WebApps.Builder. Can you 
take a look at the startWebApp function in ApplicationHistoryServer.java? It 
handles the server setup. The impact of enabling CSRF on the ATS will have to 
evaluated though - the RM and the Tez AM write to it via POST requests.

bq.  Is there another auth mechanism that can be enabled independent of API 
calls to WebApps.Builder?

Admins can setup custom web authentication filters. You can look at 
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/HttpAuthentication.html
 for more details. What's the impact of enabling csrf with no authentication?

> Use CSRF Filter in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175731#comment-15175731
 ] 

Sunil G commented on YARN-4755:
---

Yes [~Naganarasimha Garla], those discussions and conclusions are perfectly 
fine. However by seeing this scale of events, I got the doubt again as its a 
tradeoff. Ideal case, whichever events can be clubbed while recovering finished 
apps will be perfectly fine. I think appACLsUpdated cannot be go with 
appCreated as its little early. SO if appCreated can be delayed, this can be 
accomodated. But it comes with a cost of delayed notification to timeline. I 
think for this case, appACLsUpdated can be a part of appFinished. But need to 
see how existing code can be retained w/o impacts.

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175711#comment-15175711
 ] 

Varun Saxena commented on YARN-4754:


The relevant code in {{TimelineWriter#putEntities}}.
close will close the underlying input stream.

{code}
  public TimelinePutResponse putEntities(
  TimelineEntity... entities) throws IOException, YarnException {
.
ClientResponse resp = doPosting(entitiesContainer, null);
return resp.getEntity(TimelinePutResponse.class);   // ClientResponse 
object is not closed here.
  }
{code}

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Jonathan Maron (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175697#comment-15175697
 ] 

Jonathan Maron commented on YARN-4737:
--

1)  Will do
2)  will perform renaming.  As for the ATS - the only three web apps instances 
I identified that have an authentication mechanism enabled were the three I 
modified.  Is the ATS leveraging another auth mechanism (or not using WebApps 
to construct the endpoint)?
3)  The CSRF protection doesn't make sense in the context of not auth 
mechanism, and the only auth mechanism I see enabled with WebApps in SPNEGO?  
Is there another auth mechanism that can be enabled independent of API calls to 
WebApps.Builder?

> Use CSRF Filter in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175693#comment-15175693
 ] 

Naganarasimha G R commented on YARN-4755:
-

I think we discussed these topics in the following JIRAs' YARN-3127 and 
YARN-4392 and the conclusion was, we were ok with republishing the events with 
exact data rather than not publishing at all because its not guaranteed that 
ATS events for apps in state store are successfully published.
To actually see *appACLsUpdated* need not be separately published again we can 
directly publish this operation along with appCreatedEvent thus avoids one 
additional entity processing. Need to check what would be the ideal place to 
have acl information and also ensure its compatible with the current code.

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175689#comment-15175689
 ] 

Naganarasimha G R commented on YARN-4700:
-

As [~varun_saxena] pointed offline and the test results some hdfs modifications 
for my minihbase cluster to run has got into the patch, will re upload the 
patch without these changes ...

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4700-YARN-2928.v1.001.patch, 
> YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.wip.patch
>
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175690#comment-15175690
 ] 

Varun Saxena commented on YARN-4754:


I think this is happening because we are not calling {{ClientResponse#close}}.
This should be a problem in trunk too.

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena reassigned YARN-4754:
--

Assignee: Varun Saxena

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175666#comment-15175666
 ] 

Sunil G commented on YARN-4755:
---

Hi [~Naganarasimha Garla]
Does this mean that, event will be raised to timeline only once for a completed 
app as AppFinished event? Is that the idea.?

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175623#comment-15175623
 ] 

Naganarasimha G R commented on YARN-4755:
-

Thanks [~rohithsharma], assigning !

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-4755:
---

Assignee: Naganarasimha G R

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Naganarasimha G R
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175597#comment-15175597
 ] 

Rohith Sharma K S commented on YARN-4755:
-

Approach sounds good.  You can take up this JIRA!!

> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM

2016-03-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175578#comment-15175578
 ] 

Steve Loughran commented on YARN-4696:
--

Findbugs is in code to determine scheme. It's correct; this is no longer 
neededed once {{FileSystem.newInstance()}} is used to instantiate a new 
instance of an FS


> EntityGroupFSTimelineStore to work in the absence of an RM
> --
>
> Key: YARN-4696
> URL: https://issues.apache.org/jira/browse/YARN-4696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-4696-001.patch, YARN-4696-002.patch, 
> YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, 
> YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, 
> YARN-4696-010.patch
>
>
> {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the 
> configuration pointing to it. This is a new change, and impacts testing where 
> you have historically been able to test without an RM running.
> The sole purpose of the probe is to automatically determine if an app is 
> running; it falls back to "unknown" if not. If the RM connection was 
> optional, the "unknown" codepath could be called directly, relying on age of 
> file as a metric of completion
> Options
> # add a flag to disable RM connect
> # skip automatically if RM not defined/set to 0.0.0.0
> # disable retries on yarn client IPC; if it fails, tag app as unknown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175572#comment-15175572
 ] 

Varun Vasudev commented on YARN-4744:
-

Actually, there are two warn statements that are logged. One is in 
executePrivilegedOperation() in PrivilegedOperationExecutor and the second one 
is in signalContainer() in DefaultLinuxContainerRuntime. 

I'm unsure of how to handle this. My feeling is that the 
PrivilegedOperationExecutor should log failures irrespective of the error code 
but that the DefaultLinuxContainerRuntime shouldn't log the warning for invalid 
pids(similar to what LinuxContainerExecutor used to do before the refactoring).

[~jlowe], [~vinodkv], [~rohithsharma] - what do you think?

> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at 

[jira] [Comment Edited] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175562#comment-15175562
 ] 

Varun Vasudev edited comment on YARN-4737 at 3/2/16 1:11 PM:
-

Thanks for the patch [~jmaron]. 

1) Can you please address the checkstyle, javadoc, and ASF license warnings in 
the pre-commit build?

2) Rename "yarn.resourcemanager.rest-csrf.\*" to 
"yarn.resourcemanager.webapp.rest-csrf.\*". Similar changes for nodemanager and 
JHS as well. I also noticed that you haven't added CSRF protection for the ATS. 
Is that going to be done in a follow up patch?

3) Currently the CSRF protection is enabled by
{code}
+if (hasSpnegoConf && hasCSRFEnabled(params)) {
+  String restCsrfClassName = RestCsrfPreventionFilter.class.getName();
+  HttpServer2.defineFilter(server.getWebAppContext(), 
restCsrfClassName,
+   restCsrfClassName, params, new String[] 
{"/*"});
+}
{code}
which means that users with custom web auth cannot use the filter. Can we 
remove the hasSpnegoConf check?


was (Author: vvasudev):
Thanks for the patch [~jmaron]. 

1) Can you please address the checkstyle, javadoc, and ASF license warnings in 
the pre-commit build?

2) Rename "yarn.resourcemanager.rest-csrf.*" to 
"yarn.resourcemanager.webapp.rest-csrf.*". Similar changes for nodemanager and 
JHS as well. I also noticed that you haven't added CSRF protection for the ATS. 
Is that going to be done in a follow up patch?

3) Currently the CSRF protection is enabled by
{code}
+if (hasSpnegoConf && hasCSRFEnabled(params)) {
+  String restCsrfClassName = RestCsrfPreventionFilter.class.getName();
+  HttpServer2.defineFilter(server.getWebAppContext(), 
restCsrfClassName,
+   restCsrfClassName, params, new String[] 
{"/*"});
+}
{code}
which means that users with custom web auth cannot use the filter. Can we 
remove the hasSpnegoConf check?

> Use CSRF Filter in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN

2016-03-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175562#comment-15175562
 ] 

Varun Vasudev commented on YARN-4737:
-

Thanks for the patch [~jmaron]. 

1) Can you please address the checkstyle, javadoc, and ASF license warnings in 
the pre-commit build?

2) Rename "yarn.resourcemanager.rest-csrf.*" to 
"yarn.resourcemanager.webapp.rest-csrf.*". Similar changes for nodemanager and 
JHS as well. I also noticed that you haven't added CSRF protection for the ATS. 
Is that going to be done in a follow up patch?

3) Currently the CSRF protection is enabled by
{code}
+if (hasSpnegoConf && hasCSRFEnabled(params)) {
+  String restCsrfClassName = RestCsrfPreventionFilter.class.getName();
+  HttpServer2.defineFilter(server.getWebAppContext(), 
restCsrfClassName,
+   restCsrfClassName, params, new String[] 
{"/*"});
+}
{code}
which means that users with custom web auth cannot use the filter. Can we 
remove the hasSpnegoConf check?

> Use CSRF Filter in YARN
> ---
>
> Key: YARN-4737
> URL: https://issues.apache.org/jira/browse/YARN-4737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, webapp
>Reporter: Jonathan Maron
>Assignee: Jonathan Maron
> Attachments: YARN-4737.001.patch
>
>
> A CSRF filter was added to hadoop common 
> (https://issues.apache.org/jira/browse/HADOOP-12691).  The aim of this JIRA 
> is to come up with a mechanism to integrate this filter into the webapps for 
> which it is applicable (web apps that may establish an authenticated 
> identity).  That includes the RM, NM, and mapreduce jobhistory web app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175537#comment-15175537
 ] 

Naganarasimha G R commented on YARN-4754:
-

[~rohithsharma], is this 2.7.2 version ?

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175384#comment-15175384
 ] 

Naganarasimha G R commented on YARN-4755:
-

Hi [~rohithsharma], i was planning to add this as part of App itself so that 
new event is not required for the same. Thoughts ?
If ok i can take this issue up.



> Optimize sending appACLsUpdated event to TimelineServer while recovering 
> completed applications
> ---
>
> Key: YARN-4755
> URL: https://issues.apache.org/jira/browse/YARN-4755
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>
> In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent 
> to timelineserver for every application that get created. 
> {code}
>  private RMAppImpl createAndPopulateNewRMApp(
>   ApplicationSubmissionContext submissionContext, long submitTime,
>   String user, boolean isRecovery) throws YarnException {
>   //
> //
> String appViewACLs = submissionContext.getAMContainerSpec()
> .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
> rmContext.getSystemMetricsPublisher().appACLsUpdated(
> application, appViewACLs, System.currentTimeMillis());
> return application;
>   }
> {code}
> Say if we have 10K completed applications to recover, 30K events will be 
> generated i.e app_created, app_finished and app_acl_updated. For completed 
> applications, I think need not to send app-acl-updated event with which 
> gradually reduce load on the dispatcher. 
> Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
> neck when max-completed is configured very high value may be 100K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications

2016-03-02 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-4755:
---

 Summary: Optimize sending appACLsUpdated event to TimelineServer 
while recovering completed applications
 Key: YARN-4755
 URL: https://issues.apache.org/jira/browse/YARN-4755
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Rohith Sharma K S


In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent to 
timelineserver for every application that get created. 
{code}
 private RMAppImpl createAndPopulateNewRMApp(
  ApplicationSubmissionContext submissionContext, long submitTime,
  String user, boolean isRecovery) throws YarnException {
//
//
String appViewACLs = submissionContext.getAMContainerSpec()
.getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
rmContext.getSystemMetricsPublisher().appACLsUpdated(
application, appViewACLs, System.currentTimeMillis());
return application;
  }
{code}

Say if we have 10K completed applications to recover, 30K events will be 
generated i.e app_created, app_finished and app_acl_updated. For completed 
applications, I think need not to send app-acl-updated event with which 
gradually reduce load on the dispatcher. 

Eventhough MultiDispatcher is used to publish timeline events, it is bottle 
neck when max-completed is configured very high value may be 100K.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-4754:

Attachment: ConnectionLeak.rar

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: ConnectionLeak.rar
>
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175325#comment-15175325
 ] 

Rohith Sharma K S commented on YARN-4754:
-

As a result of above sometimes RM itself wont get resources to publish which 
causes entity publish fails.
Exception trace-
{noformat}
2016-03-01 11:34:34,325 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
Error when publishing entity [YARN_APPLICATION,application_1456545891178_0950]
com.sun.jersey.api.client.ClientHandlerException: java.net.SocketException: Too 
many open files
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:235)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:184)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:246)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:481)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:324)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:321)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1711)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:306)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:456)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationACLsUpdatedEvent(SystemMetricsPublisher.java:320)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:232)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:473)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:468)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:189)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:117)
at java.lang.Thread.run(Thread.java:745)
{noformat}

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   

[jira] [Moved] (YARN-4754) Too many connection opened to TimelineServer while publishing entities

2016-03-02 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S moved HADOOP-12863 to YARN-4754:
--

Key: YARN-4754  (was: HADOOP-12863)
Project: Hadoop YARN  (was: Hadoop Common)

> Too many connection opened to TimelineServer while publishing entities
> --
>
> Key: YARN-4754
> URL: https://issues.apache.org/jira/browse/YARN-4754
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
>
> It is observed that there are too many connections are kept opened to 
> TimelineServer while publishing entities via SystemMetricsPublisher. This 
> cause sometimes resource shortage for other process or RM itself
> {noformat}
> tcp0  0 10.18.99.110:3999   10.18.214.60:59265  
> ESTABLISHED 115302/java 
> tcp0  0 10.18.99.110:25001  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25002  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25003  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25004  :::*LISTEN
>   115302/java 
> tcp0  0 10.18.99.110:25005  :::*LISTEN
>   115302/java 
> tcp1  0 10.18.99.110:48866  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48137  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47553  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48424  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48139  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:48096  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:47558  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> tcp1  0 10.18.99.110:49270  10.18.99.110:8188   
> CLOSE_WAIT  115302/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-02 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175284#comment-15175284
 ] 

Bibin A Chundatt commented on YARN-4744:


[~sidharta-s]

Can we use similar check like 
{{LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx)}}.


> Too many signal to container failure in case of LCE
> ---
>
> Key: YARN-4744
> URL: https://issues.apache.org/jira/browse/YARN-4744
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_13
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_09 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_09 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_09
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 9 more
> 2014-03-02 09:20:43,113 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yarn 
> OPERATION=Container