[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177376#comment-15177376 ] Rohith Sharma K S commented on YARN-4755: - I looked in code base 2.7.1 version :-( !! Ignore it. > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177369#comment-15177369 ] Sunil G commented on YARN-4755: --- Yes. Correct. But I think appACLUpdated event is going for app rejected event also. May be Naga can double confirm this. Is it really needed. If not, we can club and it's really fine. > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177366#comment-15177366 ] Rohith Sharma K S commented on YARN-4755: - bq. But appCreated will be send only whne RMApp is created and START event is fired Event appCreated is sent in RMAppImpl constructor which always go first. So I think events appCreated and appACLUpdated can be clubbed together unless there is no dependencies from timelineserver end. > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177309#comment-15177309 ] Varun Saxena commented on YARN-4754: This should be closed as per what I understand from Jersey API documentation. [~rohithsharma] can confirm if scenario is same in his case or not. > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177306#comment-15177306 ] Varun Saxena commented on YARN-4754: I still see 2 places where we are not closing ClientResponse, when we call {{putDomain}} and in {{doPosting}} if response is not 200 OK. > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177177#comment-15177177 ] Naganarasimha G R commented on YARN-4712: - Thanks for the comments [~djp] & [~varun_saxena],\ bq. Regarding checkstyle, you can fix them for now. As you can note in the latest patch line length issues are already taken care of. bq. We shouldn't let Eclipse's bug affect our code convention. Well its not that i dont want to do it, but i presume Eclipse optimizes it in some ways and does only when required, Anyway have taken care of it but it would more easy to rely on the editors formatter if accepted :) bq. it seems more things need to be fixed for UNAVAILABLE case, agree , milliVcoresUsed can be set to 0 in UNAVAILABLE case, right ? bq. It sounds weird if cpuUsageTotalCoresPercentage is -1 in UNAVAILABLE case. we have set it -1 to indicate not to store this value in the ATS. if its *unavaiable do we need to store it as 0 or not store at all* ? bq. it make cpu metric to be either 0 or 1 which is not expected here? As [~varun_saxena] explained it directly gives as percent values (no need to multiply with 100) and we round of to only remove the decimals values [~djp], if you can confirm on these queries, i can finish the patch > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177140#comment-15177140 ] Hadoop QA commented on YARN-4740: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 0 new + 117 unchanged - 3 fixed = 117 total (was 120) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 30s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 36s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 150m 8s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL |
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177119#comment-15177119 ] Hadoop QA commented on YARN-4700: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 32s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 36s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 18s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice: patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 45s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 23s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 30s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 22m 49s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791090/YARN-4700-YARN-2928.v1.004.patch | | JIRA Issue | YARN-4700 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 391c03561be0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool |
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177097#comment-15177097 ] Hadoop QA commented on YARN-4686: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 58s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 3s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 1 new + 30 unchanged - 0 fixed = 31 total (was 30) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 55s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 44s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 12s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 22s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 57s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 31s
[jira] [Updated] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4700: Attachment: YARN-4700-YARN-2928.v1.004.patch Thanks for the comments [~vrushalic] & [~sjlee0], Have uploaded a patch with the fixes for testcases, javadoc and other comments > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, > YARN-4700-YARN-2928.v1.004.patch, YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177054#comment-15177054 ] Hadoop QA commented on YARN-4737: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 53s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 51s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 48s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 48s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 9s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 59s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 41s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 12s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 52s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 52s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 8s {color} | {color:red} root: patch generated 2 new + 436 unchanged - 4 fixed = 438 total (was 440) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 55s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 34s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 53s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 18s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 47s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 13s {color}
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176999#comment-15176999 ] Sangjin Lee commented on YARN-3863: --- I did another pass at the latest patch. One high level question: am I correct in understanding that if a relations filter is specified for example but relation was *not* specified as part of fields to retrieve, we would try to fetch the relation? So, in a sense, would specifying a filter override/modify the fields to retrieve behavior? If so, how much additional complexity is added by trying to support that behavior? What if we simply reject or ignore the filters if they do not match the fields to retrieve? Would it make the implementation simpler or harder? To me, supporting more contents even if the filters and the fields to retrieve are not consistent seems very much optional, and I'm not sure if it is worth it especially if it adds a lot more complexity. What do you think? (TimelineEntityFilters.java) - l.49: typo: "ids'" -> "id's" (also in l.60) - l.62: should be a link for {{TimelineKeyValuesFilter}} - For limit, createdTimeBegin, and createdTimeEnd, we're ensuring they can never be null. In that vein, I think it might make sense to start using {{long}} over {{Long}} as part of the method interface. Thoughts? (TimelineCompareFilter.java) - l.36: Is the default constructor useful at all? It doesn't sound like it if key and value are empty/null. Should we remove it? (TimelineKeyValuesFilter.java) - l.34-36: nit: let's make them final - l.55-56: super-nit: an empty line between the methods would be good - l.68: another super-nit: the C-style equality pattern is not needed/helpful in java; let's just do {{values == null}} (GenericEntityReader.java) - l.90: Do we need to check if {{getFilters()}} returns null? When I check all callers of {{getFilters()}}, some check for null and some don't. It would be good to make it clear and consistent either way. - l.95: See above comment in {{TimelineEntityFilters.java}}. If we switch to {{long}}, this becomes easier to understand (no need to reason about unboxing yielding a NPE). (ApplicationEntityReader.java) - l.89: see above (TimelineReaderWebServicesUtils.java) - l.257: I'm still not sure I understand. Is this a temporary thing until YARN-4447 is addressed? What if the metric value happens to be negative? That would be a non-match, then? (TimelineStorageUtils.java) - l.466: {{equals()}} should be replaced with {{==}} - l.591: same - l.666: a comment here that states we're using the latest value of the metric might be helpful (ColumnHelper.java) - I'm not sure if this is an issue or not, but I remember there were cases where we join an empty string at the end to have the qualifier end with the separator, and that was for a reason. I hope this patch did not change an occurrence of that inadvertently. (unit tests) - I know [~vrushalic] had some thoughts on how to split this monolithic {{TestHBaseTimelineStorage}}. It might be good to come to a consensus on how to split it... > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, > YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176947#comment-15176947 ] sandflee commented on YARN-4740: thanks for your suggest, attach a new patch to fix these. > container complete msg may lost while AM restart in race condition > -- > > Key: YARN-4740 > URL: https://issues.apache.org/jira/browse/YARN-4740 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4740.01.patch, YARN-4740.02.patch > > > 1, container completed, and the msg is store in > RMAppAttempt.justFinishedContainers > 2, AM allocate and before allocateResponse came to AM, AM crashed > 3, AM restart and couldn't get the container complete msg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4740) container complete msg may lost while AM restart in race condition
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4740: --- Attachment: YARN-4740.02.patch > container complete msg may lost while AM restart in race condition > -- > > Key: YARN-4740 > URL: https://issues.apache.org/jira/browse/YARN-4740 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4740.01.patch, YARN-4740.02.patch > > > 1, container completed, and the msg is store in > RMAppAttempt.justFinishedContainers > 2, AM allocate and before allocateResponse came to AM, AM crashed > 3, AM restart and couldn't get the container complete msg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176899#comment-15176899 ] Sidharta Seethana commented on YARN-4744: - Testing note : This patch makes minor logging changes - I tested the patch manually using distributed shell. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 9 more > 2014-03-02 09:20:43,113 INFO >
[jira] [Updated] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4686: -- Attachment: YARN-4686.002.patch This patch fixes a race issue between the NMs resyncing with the RM and the NMs stopping via serviceStop. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, > YARN-4686.002.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176743#comment-15176743 ] Hadoop QA commented on YARN-4744: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 30s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 47s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 34m 29s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791050/YARN-4744.001.patch | | JIRA Issue | YARN-4744 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux bfc907ba6af7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Updated] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-4744: Attachment: YARN-4744.001.patch Uploaded a patch that removes 'invalid pid' signal failure logging. [~jlowe], could you please take a look? > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > Attachments: YARN-4744.001.patch > > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 9 more > 2014-03-02 09:20:43,113 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
[jira] [Commented] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358
[ https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176665#comment-15176665 ] Hadoop QA commented on YARN-4359: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 9 new + 25 unchanged - 2 fixed = 34 total (was 27) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 21s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 20s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 2m 37s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95 with JDK v1.7.0_95 generated 4 new + 2 unchanged - 0 fixed = 6 total (was 2) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 38s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 7s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 155m 22s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | |
[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176644#comment-15176644 ] Karthik Kambatla commented on YARN-4719: [~leftnoteasy] - thanks for chiming in, you make some valid points. Since we are building a library for node tracking, I would like for us to restrict access to the map/set of nodes tracked only through addNode and removeNode so total_cluster_resources, total_inflated_cluster_resources (for YARN-1011), max_cluster_resources are not affected by other scheduler code. Do you think this is a reasonable goal? At least, as long as it doesn't hurt performance? If yes, we should decide on how to handle cases where the scheduler code needs to iterate through the nodes: (1) we could provide a snapshot copy of the map/set of nodes/nodeIds, or (2) provide a way to do the same with the right locks by adding additional methods or an abstraction (similar to lambdas) that applies to multiple methods. Thoughts? PS: By the way, thanks for pointing out the javadoc for values(). I will clean that up based on the discussion output here. > Add a helper library to maintain node state and allows common queries > - > > Key: YARN-4719 > URL: https://issues.apache.org/jira/browse/YARN-4719 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch > > > The scheduler could use a helper library to maintain node state and allowing > matching/sorting queries. Several reasons for this: > # Today, a lot of the node state management is done separately in each > scheduler. Having a single library will take us that much closer to reducing > duplication among schedulers. > # Adding a filtering/matching API would simplify node labels and locality > significantly. > # An API that returns a sorted list for a custom comparator would help > YARN-1011 where we want to sort by allocation and utilization for > continuous/asynchronous and opportunistic scheduling respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176611#comment-15176611 ] Sangjin Lee commented on YARN-4700: --- It seems that the unit test failures are real. So is the javadoc error. Could you please look into it? Thanks! > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, > YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176609#comment-15176609 ] Sidharta Seethana commented on YARN-4744: - [~jlowe] That was my thinking as well - double logging is better than missed logs. In addition, logging in {{PrivilegedOperationExecutor}} includes information that isn't necessarily available when the exception is propagated. I'll upload a patch soon, thanks. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) >
[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176555#comment-15176555 ] Hadoop QA commented on YARN-4719: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 3m 59s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95 with JDK v1.7.0_95 generated 1 new + 1 unchanged - 1 fixed = 2 total (was 2) {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 9 new + 278 unchanged - 8 fixed = 287 total (was 286) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 27s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 47s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 156m 28s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore | | |
[jira] [Assigned] (YARN-1547) Prevent DoS of ApplicationMasterProtocol by putting in limits
[ https://issues.apache.org/jira/browse/YARN-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola reassigned YARN-1547: -- Assignee: Giovanni Matteo Fumarola > Prevent DoS of ApplicationMasterProtocol by putting in limits > - > > Key: YARN-1547 > URL: https://issues.apache.org/jira/browse/YARN-1547 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Giovanni Matteo Fumarola > > Points of DoS in ApplicationMasterProtocol > - Host and trackingURL in RegisterApplicationMasterRequest > - Diagnostics, final trackingURL in FinishApplicationMasterRequest > - Unlimited number of resourceAsks, containersToBeReleased and > resourceBlacklistRequest in AllocateRequest > -- Unbounded number of priorities and/or resourceRequests in each ask. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Maron updated YARN-4737: - Attachment: YARN-4737.002.patch I believe all code issues have been addressed > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch, YARN-4737.002.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176498#comment-15176498 ] Jason Lowe commented on YARN-4744: -- As long as we're not logging a bunch of warningsf for benign events I'm good. I still think the log-then-throw idiom can be problematic in practice as it tends to lead to double-logging (both by the thrower and by the catcher). I understand the concern to miss logs, and it's safer to double-log than not log at all. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at >
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176436#comment-15176436 ] Sidharta Seethana commented on YARN-4744: - [~bibinchundatt], Those two error codes are used differently. INVALID_CONTAINER_PID is used with errno ESRCH . UNABLE_TO_SIGNAL_CONTAINER is used in other cases. See code below : {code} int signal_container_as_user(const char *user, int pid, int sig) { if(pid <= 0) { return INVALID_CONTAINER_PID; } if (change_user(user_detail->pw_uid, user_detail->pw_gid) != 0) { return SETUID_OPER_FAILED; } //Don't continue if the process-group is not alive anymore. int has_group = 1; if (kill(-pid,0) < 0) { if (kill(pid, 0) < 0) { if (errno == ESRCH) { return INVALID_CONTAINER_PID; } fprintf(LOGFILE, "Error signalling container %d with %d - %s\n", pid, sig, strerror(errno)); return -1; } else { has_group = 0; } } if (kill((has_group ? -1 : 1) * pid, sig) < 0) { if(errno != ESRCH) { fprintf(LOGFILE, "Error signalling process group %d with signal %d - %s\n", -pid, sig, strerror(errno)); fprintf(stderr, "Error signalling process group %d with signal %d - %s\n", -pid, sig, strerror(errno)); fflush(LOGFILE); return UNABLE_TO_SIGNAL_CONTAINER; } else { return INVALID_CONTAINER_PID; } } fprintf(LOGFILE, "Killing process %s%d with %d\n", (has_group ? "group " :""), pid, sig); return 0; } {code} > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at >
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176416#comment-15176416 ] Sidharta Seethana commented on YARN-4744: - Before {{PrivilegedOperationExecutor}} existed, there were several cases where not enough information was being logged about container-executor failures. Centralizing this provided useful information like invocation arguments, shell output etc - which has proved useful for debugging. In all cases except 'invalid pid', an error returned by container-executor is an error. IMO, we shouldn't remove the error logging. It looks like {{signalContainer}} in {{LinuxContainerExecutor}} ignores the exception for the 'invalid pid' case. We could do something this this : * Change {{DefaultContainerRuntime}} to ignore the 'invalid pid' error as well. * Change {{PrivilegedOperationExecutor}} / {{PrivilegedOperation}} to add the notion of 'ignore failures' for certain kinds of operations . Use this only for {{signalContainer}} and let the runtime/executor decide what they want to do. I'll submit a patch with these changes. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at >
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176399#comment-15176399 ] Hadoop QA commented on YARN-4700: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 50s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 37s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice: patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 48s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.7.0_95 with JDK v1.7.0_95 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 3m 25s {color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 3m 8s {color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 22m 53s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage | | |
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176397#comment-15176397 ] Bibin A Chundatt commented on YARN-4744: [~jlowe]/[~vinodkv] Confused with the {{exit code 9}} also. From one of the documentation i read ,below are the exit code for container executor {noformat} exit code | NAME| Description --- 8 | UNABLE_TO_SIGNAL_CONTAINER | The container-executor could not signal the container it was passed. 9 |INVALID_CONTAINER_PID| The PID passed to the container-executor was negative or 0. {noformat} The exit code returned when container doesn't exist should have been {{8}} rt? We should recheck the exit code from container-executor and also based on exit code might be able to handle the errors too.. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at >
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176364#comment-15176364 ] Vrushali C commented on YARN-4700: -- Thanks [~Naganarasimha Garla] for the updated patch. Overall it looks good. I have an extremely minor comment, please make the change _only_ if you plan to make another patch, else we can make those changes later. - Lines 195 and 200 in TestFlowDataGenerator are commented out in the patch, we can remove them. + 1 otherwise. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, > YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176357#comment-15176357 ] Vinod Kumar Vavilapalli commented on YARN-4744: --- bq. the NM appears to signal containers that have already exited ( as a part of ContainerLaunch.cleanupContainer() ) This is by design. We did this originally so as to ensure cleaning up of any orphaned child-processes or process-groups - even if the root-process exits. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at >
[jira] [Commented] (YARN-1547) Prevent DoS of ApplicationMasterProtocol by putting in limits
[ https://issues.apache.org/jira/browse/YARN-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176343#comment-15176343 ] Vinod Kumar Vavilapalli commented on YARN-1547: --- bq. Thanks for raising this Vinod Kumar Vavilapalli. I was wondering if I might take this up, if you are not actively working on it. Tx [~giovanni.fumarola], please go ahead and assign it to yourselves! We can discuss after you have a design, but wanted to bring up one point of note w.r.t this ticket and the larger YARN-1545 itself. It is likely that we can solve 60-70% of our use-case of avoiding accidental DoS'ing by well-behaved apps by way of putting limits in the client, but it is imperative that we handle this on the server-side instead of on client-side, lest an abusive client can circumvent any client-side restrictions. > Prevent DoS of ApplicationMasterProtocol by putting in limits > - > > Key: YARN-1547 > URL: https://issues.apache.org/jira/browse/YARN-1547 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli > > Points of DoS in ApplicationMasterProtocol > - Host and trackingURL in RegisterApplicationMasterRequest > - Diagnostics, final trackingURL in FinishApplicationMasterRequest > - Unlimited number of resourceAsks, containersToBeReleased and > resourceBlacklistRequest in AllocateRequest > -- Unbounded number of priorities and/or resourceRequests in each ask. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4650) The AM should be launched with its own set of configs instead of using the NM's configs
[ https://issues.apache.org/jira/browse/YARN-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176322#comment-15176322 ] Vinod Kumar Vavilapalli commented on YARN-4650: --- Trying to understand the problem and solution being addressed here. May be I am missing something, but I actually don't see a major change from what we already have today. >From the beginning of YARN, we've been very careful about apps not relying on >server configuration. In theory it is still possible for an app to hard-code >and depend on server configuration (via >{{ApplicationConstants.Environment.HADOOP_CONF_DIR}} / >{{YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH}}), but things like >rolling-upgrades (YARN-666) further forced our users to not play such tricks. bq. The AM should be launched with its own set of configs instead of using the NM's configs For most of our apps (MapReduce, Tez, Spark etc), this already doesn't happen by default. MR for example depends on job-configuration {{mapreduce.application.classpath}}. In all these cases, all the configuration needed by AMs is usually supposed to come from the client itself. Only DistributedShell is the corner-case that by default depends on NM Configuration via {{DEFAULT_YARN_APPLICATION_CLASSPATH}}. bq. There are cases, such as a secure LDAP configuration where the NM may need access to credentials that should not be exposed to the user. As long as the NM and AM share the same configuration files, anything exposed to the NM is also exposed to the AM and hence the users. This is already possible to do right now, *without* breaking most of our well-behave apps: an admin can simply (a) remove HADOOP_CONF_DIR from NM white-list and/or (b) change the permissions of the NMs configs to be very restrictive. > The AM should be launched with its own set of configs instead of using the > NM's configs > --- > > Key: YARN-4650 > URL: https://issues.apache.org/jira/browse/YARN-4650 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > There are cases, such as a secure LDAP configuration where the NM may need > access to credentials that should not be exposed to the user. As long as the > NM and AM share the same configuration files, anything exposed to the NM is > also exposed to the AM and hence the users. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4700: Attachment: YARN-4700-YARN-2928.v1.003.patch Thanks for the review [~varun_saxena], attaching a patch with the fixes for the review comments. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.v1.003.patch, > YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176278#comment-15176278 ] Wangda Tan commented on YARN-4719: -- [~kasha], bq. Not sure I understand the suggestion. Elaborate? In ver.2 patch, getAllNodes uses shallowCopy, what I meant is instead of copying the entire HashMap, you can use ConcurrentMap instead. In ver.3 patch, you removed shallowCopy and returns HashMap.values(), if node removed while someone is iterating values(), the behavior is undefined. See: [javadoc|https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html#values()] bq. I feel any logic that has to iterate through all nodes should go through ClusterNodeTracker - that way, we don't run into cases where we access the list of nodes without a lock. As I commented above, we can use ConcurrentMap instead of locking ClusterNodeTracker. Do you need strong consistency for addBlacklistedNodeIdsToList? (Because node list could be updated while we updating blacklistedNodes. bq. Any particular reason you think this doesn't belong here? I would prefer to keep cleaner responsibility of ClusterNodeTracker, if we adds application logic here, we could add any logic related to SchedulerNode to this class as well. This refactoring patch is majorly for code clean up to me, I think it's better to keep it clean from the beginning. > Add a helper library to maintain node state and allows common queries > - > > Key: YARN-4719 > URL: https://issues.apache.org/jira/browse/YARN-4719 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch > > > The scheduler could use a helper library to maintain node state and allowing > matching/sorting queries. Several reasons for this: > # Today, a lot of the node state management is done separately in each > scheduler. Having a single library will take us that much closer to reducing > duplication among schedulers. > # Adding a filtering/matching API would simplify node labels and locality > significantly. > # An API that returns a sorted list for a custom comparator would help > YARN-1011 where we want to sort by allocation and utilization for > continuous/asynchronous and opportunistic scheduling respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358
[ https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishai Menache updated YARN-4359: Attachment: YARN-4359.4.patch > Update LowCost agents logic to take advantage of YARN-4358 > -- > > Key: YARN-4359 > URL: https://issues.apache.org/jira/browse/YARN-4359 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Carlo Curino >Assignee: Ishai Menache > Attachments: YARN-4359.0.patch, YARN-4359.3.patch, YARN-4359.4.patch > > > Given the improvements of YARN-4358, the LowCost agent should be improved to > leverage this, and operate on RLESparseResourceAllocation (ideally leveraging > the improvements of YARN-3454 to compute avaialable resources) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176248#comment-15176248 ] Karthik Kambatla commented on YARN-4719: [~rkanter] - this code touches the adaptive max allocation you have worked on. Mind taking a look to make sure I didn't screw up any of that. > Add a helper library to maintain node state and allows common queries > - > > Key: YARN-4719 > URL: https://issues.apache.org/jira/browse/YARN-4719 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch > > > The scheduler could use a helper library to maintain node state and allowing > matching/sorting queries. Several reasons for this: > # Today, a lot of the node state management is done separately in each > scheduler. Having a single library will take us that much closer to reducing > duplication among schedulers. > # Adding a filtering/matching API would simplify node labels and locality > significantly. > # An API that returns a sorted list for a custom comparator would help > YARN-1011 where we want to sort by allocation and utilization for > continuous/asynchronous and opportunistic scheduling respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4719: --- Attachment: (was: yarn-4719-3.patch) > Add a helper library to maintain node state and allows common queries > - > > Key: YARN-4719 > URL: https://issues.apache.org/jira/browse/YARN-4719 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch > > > The scheduler could use a helper library to maintain node state and allowing > matching/sorting queries. Several reasons for this: > # Today, a lot of the node state management is done separately in each > scheduler. Having a single library will take us that much closer to reducing > duplication among schedulers. > # Adding a filtering/matching API would simplify node labels and locality > significantly. > # An API that returns a sorted list for a custom comparator would help > YARN-1011 where we want to sort by allocation and utilization for > continuous/asynchronous and opportunistic scheduling respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4719: --- Attachment: yarn-4719-3.patch Updated patch should fix the test failures and findbugs warnings. > Add a helper library to maintain node state and allows common queries > - > > Key: YARN-4719 > URL: https://issues.apache.org/jira/browse/YARN-4719 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch > > > The scheduler could use a helper library to maintain node state and allowing > matching/sorting queries. Several reasons for this: > # Today, a lot of the node state management is done separately in each > scheduler. Having a single library will take us that much closer to reducing > duplication among schedulers. > # Adding a filtering/matching API would simplify node labels and locality > significantly. > # An API that returns a sorted list for a custom comparator would help > YARN-1011 where we want to sort by allocation and utilization for > continuous/asynchronous and opportunistic scheduling respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176186#comment-15176186 ] Jason Lowe commented on YARN-4744: -- bq. Can we use similar check like LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx). That function is implemented in terms of signalContainer (so we have the same issue), and the process could exit between the check and the subsequent kill attempt. bq. My feeling is that the PrivilegedOperationExecutor should log failures irrespective of the error code There's always going to be a race where a container can exit before it gets killed, and I'm not sure we accomplish much besides alarming users when we log warnings when that occurs. IMHO PrivilegedOperationExecutor should not be the one that decides what should and shouldn't be logged, since it doesn't have any context on whether the error is severe enough to warrant it. Instead I think we should ensure the same data is present in the PrivilegedOperationException and let the code handling that error perform the logging if it is appropriate to do so. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at >
[jira] [Commented] (YARN-4634) Scheduler UI/Metrics need to consider cases like non-queue label mappings
[ https://issues.apache.org/jira/browse/YARN-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176164#comment-15176164 ] Hadoop QA commented on YARN-4634: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 25s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 54s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 158m 47s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12790948/0004-YARN-4634.patch | | JIRA
[jira] [Updated] (YARN-4719) Add a helper library to maintain node state and allows common queries
[ https://issues.apache.org/jira/browse/YARN-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4719: --- Attachment: yarn-4719-3.patch > Add a helper library to maintain node state and allows common queries > - > > Key: YARN-4719 > URL: https://issues.apache.org/jira/browse/YARN-4719 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4719-1.patch, yarn-4719-2.patch, yarn-4719-3.patch > > > The scheduler could use a helper library to maintain node state and allowing > matching/sorting queries. Several reasons for this: > # Today, a lot of the node state management is done separately in each > scheduler. Having a single library will take us that much closer to reducing > duplication among schedulers. > # Adding a filtering/matching API would simplify node labels and locality > significantly. > # An API that returns a sorted list for a custom comparator would help > YARN-1011 where we want to sort by allocation and utilization for > continuous/asynchronous and opportunistic scheduling respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175997#comment-15175997 ] Varun Saxena commented on YARN-4712: bq. it make cpu metric to be either 0 or 1 which is not expected here? Are you saying this because you are expecting cpuUsageTotalCoresPercentage to be in the range 0-1 ? I was thinking the same initially and hence in the initial patch we were multiplying this value with 100. But that doesnt seem to be the case. Upon testing found that this value is not between 0-1. If there are 4 cores and 2 cores are fully used, this value will be 50. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175980#comment-15175980 ] Junping Du commented on YARN-4712: -- Back on this patch, it seems more things need to be fixed for UNAVAILABLE case, like code blow: {code} // Multiply by 1000 to avoid losing data when converting to int int milliVcoresUsed = (int) (cpuUsageTotalCoresPercentage * 1000 * maxVCoresAllottedForContainers /nodeCpuPercentageForYARN); {code} It sounds weird if cpuUsageTotalCoresPercentage is -1 in UNAVAILABLE case. In addition, code below sounds not right: {code} +cpuMetric.addValue(currentTimeMillis, +(long) Math.round(cpuUsageTotalCoresPercentage)); {code} it make cpu metric to be either 0 or 1 which is not expected here? > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175973#comment-15175973 ] Junping Du commented on YARN-4712: -- That's correct. We shouldn't let Eclipse's bug affect our code convention. My practice is show the print margin of 80 chars and address it myself ahead. Or you can run a local checkstyle tools before submit the patch. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175878#comment-15175878 ] Varun Saxena commented on YARN-4712: [~djp], the checkstyle issues I was referring to were line > 80 characters, the ones fixed in last patch. Naga was telling me offline that he uses eclipse formatter given on Hadoop Wiki page which does not always take care of > 80 characters issue especially if its just 5-6 characters extra. And as he uses it, these checkstyle issues(for > 80 chars) keep on cropping up in every patch. So for this branch, we need to fix line > 80 characters issue. Right ? > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175864#comment-15175864 ] Junping Du commented on YARN-4712: -- Regarding checkstyle, I think YARN-2928 dev branch should follow the same standard/criteria as trunk branch or it will have trouble when merge back. The common practice for trunk on checksyte issues is we need to fix them as much as we can. However, for some annoy warnings like "method too long" (like this case), "method parameter too many", etc., we don't need to worry about it unless there is strong justification for refactor the code. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175841#comment-15175841 ] Varun Saxena commented on YARN-4712: Regarding checkstyle, you can fix them for now. We can confirm with [~sjlee0] in tomorrow's meeting if for this branch we need to follow checkstyle or do not consider checkstyle issues which appear despite using the eclipse formatter given on Hadoop Wiki. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175838#comment-15175838 ] Varun Saxena commented on YARN-4712: Thanks [~Naganarasimha] for the patch. Looks good overall. A couple of nits. # In {{NMTimelinePublisher}}, cast to long is not required as Math#round returns an int/long depending on input value. # TestNMTimelinePublisher line 84, use ResourceCalculatorProcessTree.UNAVAILABLE instead of -1. > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175802#comment-15175802 ] Varun Saxena commented on YARN-4700: [~Naganarasimha], had a glance at the patch. It looks good to me in general. # Changes in TestTimelineReaderWebServicesHBaseStorage l.801 are not required. # I think javadoc should be fixable. # In FlowActivityEntityRowKey#getRowKey, the javadoc says we are passing top of the day timestamp. But we are not. We are calculating it inside. We can change the param name and description(say to something like event timestamp). # Although created time should be fine but should we use event timestamp at both the places ? Just for consistency. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4484) Available Resource calculation for a queue is not correct when used with labels
[ https://issues.apache.org/jira/browse/YARN-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175803#comment-15175803 ] Sunil G commented on YARN-4484: --- Hi [~leftnoteasy], Could you pls help to check the patch. > Available Resource calculation for a queue is not correct when used with > labels > --- > > Key: YARN-4484 > URL: https://issues.apache.org/jira/browse/YARN-4484 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4484.patch, 0002-YARN-4484.patch, > 0003-YARN-4484.patch > > > To calculate available resource for a queue, we have to get the total > resource allocated for all labels in queue compare to its usage. > Also address the comments given in > [YARN-4304-comments|https://issues.apache.org/jira/browse/YARN-4304?focusedCommentId=15064874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15064874 > ] given by [~leftnoteasy] for same. > ClusterMetrics related issues will also get handled once we fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4634) Scheduler UI/Metrics need to consider cases like non-queue label mappings
[ https://issues.apache.org/jira/browse/YARN-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4634: -- Attachment: 0004-YARN-4634.patch Updating patch correcting findbugs warning. > Scheduler UI/Metrics need to consider cases like non-queue label mappings > - > > Key: YARN-4634 > URL: https://issues.apache.org/jira/browse/YARN-4634 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4634.patch, 0002-YARN-4634.patch, > 0003-YARN-4634.patch, 0004-YARN-4634.patch > > > Currently when label-queue mappings are not available, there are few > assumptions taken in UI and in metrics. > In above case where labels are enabled and available in cluster but without > any queue mappings, UI displays queues under labels. This is not correct. > Currently labels enabled check and availability of labels are considered to > render scheduler UI. Henceforth we also need to check whether > - queue-mappings are available > - nodes are mapped with labels with proper exclusivity flags on > This ticket also will try to see the default configurations in queue when > labels are not mapped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4680) TimerTasks leak in ATS V1.5 Writer
[ https://issues.apache.org/jira/browse/YARN-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175792#comment-15175792 ] Jakob Stengård commented on YARN-4680: -- Hi. What are the symptoms of this issue? Im having a problem with hiveserver2, which is creating a lot of timer tasks named "LogFDsCachecleanInActiveFDsTimer" and "LogFDsCacheFlushTimer". Eventually, hiveserver2 crashes. Could this be related to this bug? > TimerTasks leak in ATS V1.5 Writer > -- > > Key: YARN-4680 > URL: https://issues.apache.org/jira/browse/YARN-4680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-4680.1.patch, YARN-4680.20160108.patch, > YARN-4680.20160109.patch, YARN-4680.20160222.patch > > > We have seen TimerTasks leak which could cause application server done (such > as oozie server done due to too many active threads) > Although we have fixed some potentially leak situations in upper application > level, such as > https://issues.apache.org/jira/browse/MAPREDUCE-6618 > https://issues.apache.org/jira/browse/MAPREDUCE-6621, we still can not > guarantee that we fixed the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175772#comment-15175772 ] Sunil G commented on YARN-4755: --- {{appACLsUpdated}} is invoked in {{createAndPopulateNewRMApp}}. But {{appCreated}} will be send only whne RMApp is created and START event is fired. So for apps which goes with secure way OR Rejected apps wont get appACLsUpdated if we club thse 2 together. I am not very sure whether this is needed. [~rohithsharma], do you recollect anything in line? > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175759#comment-15175759 ] Jonathan Maron commented on YARN-4737: -- Enabling CSRF w/o auth will require the inclusion of the custom header for all invocations, regardless of whether they are secure invocations or not. I don't believe that is the expected usage model for the filter. As far as identifying auth mechanisms - I'm trying to find instances that would show the use of custom auth filters but I'm not really finding any. One theory I have is that looking up a value other than "Simple" for "hadoop.http.authentication.type" might provide a more general indicator of auth being enabled? Does that seem correct? POST requests from java clients should not be an issue - the filter only executes when a browser user agent is detected. BTW, the license issues (asflicense) don't appear even remotely related to this patch. > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175752#comment-15175752 ] Naganarasimha G R commented on YARN-4754: - Initially i also suspected the same and then realised that if its ClientResponse is read then the stream is closed. so not sure whats leaking the events .. [~rohithsharma] any other error logs while processing the events ? > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175747#comment-15175747 ] Naganarasimha G R commented on YARN-4755: - bq. I think appACLsUpdated cannot be go with appCreated as its little early. SO if appCreated can be delayed, this can be accomodated. But it comes with a cost of delayed notification to timeline. Sorry dint get this, why its early? ACL is actually got from app submission context, so there is no point in sending as part of another event and not in AppCreated right ? Correct me if my understanding is wrong {code} String appViewACLs = submissionContext.getAMContainerSpec() .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); rmContext.getSystemMetricsPublisher().appACLsUpdated( application, appViewACLs, System.currentTimeMillis()); {code} > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4754: --- Assignee: (was: Varun Saxena) > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175734#comment-15175734 ] Varun Vasudev commented on YARN-4737: - bq. Is the ATS leveraging another auth mechanism (or not using WebApps to construct the endpoint)? I took a look and it looks like the ATS doesn't use WebApps.Builder. Can you take a look at the startWebApp function in ApplicationHistoryServer.java? It handles the server setup. The impact of enabling CSRF on the ATS will have to evaluated though - the RM and the Tez AM write to it via POST requests. bq. Is there another auth mechanism that can be enabled independent of API calls to WebApps.Builder? Admins can setup custom web authentication filters. You can look at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/HttpAuthentication.html for more details. What's the impact of enabling csrf with no authentication? > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175731#comment-15175731 ] Sunil G commented on YARN-4755: --- Yes [~Naganarasimha Garla], those discussions and conclusions are perfectly fine. However by seeing this scale of events, I got the doubt again as its a tradeoff. Ideal case, whichever events can be clubbed while recovering finished apps will be perfectly fine. I think appACLsUpdated cannot be go with appCreated as its little early. SO if appCreated can be delayed, this can be accomodated. But it comes with a cost of delayed notification to timeline. I think for this case, appACLsUpdated can be a part of appFinished. But need to see how existing code can be retained w/o impacts. > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175711#comment-15175711 ] Varun Saxena commented on YARN-4754: The relevant code in {{TimelineWriter#putEntities}}. close will close the underlying input stream. {code} public TimelinePutResponse putEntities( TimelineEntity... entities) throws IOException, YarnException { . ClientResponse resp = doPosting(entitiesContainer, null); return resp.getEntity(TimelinePutResponse.class); // ClientResponse object is not closed here. } {code} > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Varun Saxena >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175697#comment-15175697 ] Jonathan Maron commented on YARN-4737: -- 1) Will do 2) will perform renaming. As for the ATS - the only three web apps instances I identified that have an authentication mechanism enabled were the three I modified. Is the ATS leveraging another auth mechanism (or not using WebApps to construct the endpoint)? 3) The CSRF protection doesn't make sense in the context of not auth mechanism, and the only auth mechanism I see enabled with WebApps in SPNEGO? Is there another auth mechanism that can be enabled independent of API calls to WebApps.Builder? > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175693#comment-15175693 ] Naganarasimha G R commented on YARN-4755: - I think we discussed these topics in the following JIRAs' YARN-3127 and YARN-4392 and the conclusion was, we were ok with republishing the events with exact data rather than not publishing at all because its not guaranteed that ATS events for apps in state store are successfully published. To actually see *appACLsUpdated* need not be separately published again we can directly publish this operation along with appCreatedEvent thus avoids one additional entity processing. Need to check what would be the ideal place to have acl information and also ensure its compatible with the current code. > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175689#comment-15175689 ] Naganarasimha G R commented on YARN-4700: - As [~varun_saxena] pointed offline and the test results some hdfs modifications for my minihbase cluster to run has got into the patch, will re upload the patch without these changes ... > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4700-YARN-2928.v1.001.patch, > YARN-4700-YARN-2928.v1.002.patch, YARN-4700-YARN-2928.wip.patch > > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175690#comment-15175690 ] Varun Saxena commented on YARN-4754: I think this is happening because we are not calling {{ClientResponse#close}}. This should be a problem in trunk too. > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Varun Saxena >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-4754: -- Assignee: Varun Saxena > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Varun Saxena >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175666#comment-15175666 ] Sunil G commented on YARN-4755: --- Hi [~Naganarasimha Garla] Does this mean that, event will be raised to timeline only once for a completed app as AppFinished event? Is that the idea.? > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175623#comment-15175623 ] Naganarasimha G R commented on YARN-4755: - Thanks [~rohithsharma], assigning ! > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-4755: --- Assignee: Naganarasimha G R > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Naganarasimha G R > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175597#comment-15175597 ] Rohith Sharma K S commented on YARN-4755: - Approach sounds good. You can take up this JIRA!! > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175578#comment-15175578 ] Steve Loughran commented on YARN-4696: -- Findbugs is in code to determine scheme. It's correct; this is no longer neededed once {{FileSystem.newInstance()}} is used to instantiate a new instance of an FS > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, > YARN-4696-010.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175572#comment-15175572 ] Varun Vasudev commented on YARN-4744: - Actually, there are two warn statements that are logged. One is in executePrivilegedOperation() in PrivilegedOperationExecutor and the second one is in signalContainer() in DefaultLinuxContainerRuntime. I'm unsure of how to handle this. My feeling is that the PrivilegedOperationExecutor should log failures irrespective of the error code but that the DefaultLinuxContainerRuntime shouldn't log the warning for invalid pids(similar to what LinuxContainerExecutor used to do before the refactoring). [~jlowe], [~vinodkv], [~rohithsharma] - what do you think? > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at
[jira] [Comment Edited] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175562#comment-15175562 ] Varun Vasudev edited comment on YARN-4737 at 3/2/16 1:11 PM: - Thanks for the patch [~jmaron]. 1) Can you please address the checkstyle, javadoc, and ASF license warnings in the pre-commit build? 2) Rename "yarn.resourcemanager.rest-csrf.\*" to "yarn.resourcemanager.webapp.rest-csrf.\*". Similar changes for nodemanager and JHS as well. I also noticed that you haven't added CSRF protection for the ATS. Is that going to be done in a follow up patch? 3) Currently the CSRF protection is enabled by {code} +if (hasSpnegoConf && hasCSRFEnabled(params)) { + String restCsrfClassName = RestCsrfPreventionFilter.class.getName(); + HttpServer2.defineFilter(server.getWebAppContext(), restCsrfClassName, + restCsrfClassName, params, new String[] {"/*"}); +} {code} which means that users with custom web auth cannot use the filter. Can we remove the hasSpnegoConf check? was (Author: vvasudev): Thanks for the patch [~jmaron]. 1) Can you please address the checkstyle, javadoc, and ASF license warnings in the pre-commit build? 2) Rename "yarn.resourcemanager.rest-csrf.*" to "yarn.resourcemanager.webapp.rest-csrf.*". Similar changes for nodemanager and JHS as well. I also noticed that you haven't added CSRF protection for the ATS. Is that going to be done in a follow up patch? 3) Currently the CSRF protection is enabled by {code} +if (hasSpnegoConf && hasCSRFEnabled(params)) { + String restCsrfClassName = RestCsrfPreventionFilter.class.getName(); + HttpServer2.defineFilter(server.getWebAppContext(), restCsrfClassName, + restCsrfClassName, params, new String[] {"/*"}); +} {code} which means that users with custom web auth cannot use the filter. Can we remove the hasSpnegoConf check? > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4737) Use CSRF Filter in YARN
[ https://issues.apache.org/jira/browse/YARN-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175562#comment-15175562 ] Varun Vasudev commented on YARN-4737: - Thanks for the patch [~jmaron]. 1) Can you please address the checkstyle, javadoc, and ASF license warnings in the pre-commit build? 2) Rename "yarn.resourcemanager.rest-csrf.*" to "yarn.resourcemanager.webapp.rest-csrf.*". Similar changes for nodemanager and JHS as well. I also noticed that you haven't added CSRF protection for the ATS. Is that going to be done in a follow up patch? 3) Currently the CSRF protection is enabled by {code} +if (hasSpnegoConf && hasCSRFEnabled(params)) { + String restCsrfClassName = RestCsrfPreventionFilter.class.getName(); + HttpServer2.defineFilter(server.getWebAppContext(), restCsrfClassName, + restCsrfClassName, params, new String[] {"/*"}); +} {code} which means that users with custom web auth cannot use the filter. Can we remove the hasSpnegoConf check? > Use CSRF Filter in YARN > --- > > Key: YARN-4737 > URL: https://issues.apache.org/jira/browse/YARN-4737 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, webapp >Reporter: Jonathan Maron >Assignee: Jonathan Maron > Attachments: YARN-4737.001.patch > > > A CSRF filter was added to hadoop common > (https://issues.apache.org/jira/browse/HADOOP-12691). The aim of this JIRA > is to come up with a mechanism to integrate this filter into the webapps for > which it is applicable (web apps that may establish an authenticated > identity). That includes the RM, NM, and mapreduce jobhistory web app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175537#comment-15175537 ] Naganarasimha G R commented on YARN-4754: - [~rohithsharma], is this 2.7.2 version ? > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175384#comment-15175384 ] Naganarasimha G R commented on YARN-4755: - Hi [~rohithsharma], i was planning to add this as part of App itself so that new event is not required for the same. Thoughts ? If ok i can take this issue up. > Optimize sending appACLsUpdated event to TimelineServer while recovering > completed applications > --- > > Key: YARN-4755 > URL: https://issues.apache.org/jira/browse/YARN-4755 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S > > In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent > to timelineserver for every application that get created. > {code} > private RMAppImpl createAndPopulateNewRMApp( > ApplicationSubmissionContext submissionContext, long submitTime, > String user, boolean isRecovery) throws YarnException { > // > // > String appViewACLs = submissionContext.getAMContainerSpec() > .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); > rmContext.getSystemMetricsPublisher().appACLsUpdated( > application, appViewACLs, System.currentTimeMillis()); > return application; > } > {code} > Say if we have 10K completed applications to recover, 30K events will be > generated i.e app_created, app_finished and app_acl_updated. For completed > applications, I think need not to send app-acl-updated event with which > gradually reduce load on the dispatcher. > Eventhough MultiDispatcher is used to publish timeline events, it is bottle > neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4755) Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications
Rohith Sharma K S created YARN-4755: --- Summary: Optimize sending appACLsUpdated event to TimelineServer while recovering completed applications Key: YARN-4755 URL: https://issues.apache.org/jira/browse/YARN-4755 Project: Hadoop YARN Issue Type: Improvement Reporter: Rohith Sharma K S In method {{RMAppManager#createAndPopulateNewRMApp}}, appACLsUpdated is sent to timelineserver for every application that get created. {code} private RMAppImpl createAndPopulateNewRMApp( ApplicationSubmissionContext submissionContext, long submitTime, String user, boolean isRecovery) throws YarnException { // // String appViewACLs = submissionContext.getAMContainerSpec() .getApplicationACLs().get(ApplicationAccessType.VIEW_APP); rmContext.getSystemMetricsPublisher().appACLsUpdated( application, appViewACLs, System.currentTimeMillis()); return application; } {code} Say if we have 10K completed applications to recover, 30K events will be generated i.e app_created, app_finished and app_acl_updated. For completed applications, I think need not to send app-acl-updated event with which gradually reduce load on the dispatcher. Eventhough MultiDispatcher is used to publish timeline events, it is bottle neck when max-completed is configured very high value may be 100K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4754: Attachment: ConnectionLeak.rar > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: ConnectionLeak.rar > > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175325#comment-15175325 ] Rohith Sharma K S commented on YARN-4754: - As a result of above sometimes RM itself wont get resources to publish which causes entity publish fails. Exception trace- {noformat} 2016-03-01 11:34:34,325 ERROR org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: Error when publishing entity [YARN_APPLICATION,application_1456545891178_0950] com.sun.jersey.api.client.ClientHandlerException: java.net.SocketException: Too many open files at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:235) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:184) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:246) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:481) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:324) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:321) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1711) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:306) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:456) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationACLsUpdatedEvent(SystemMetricsPublisher.java:320) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:232) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:473) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:468) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:189) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:117) at java.lang.Thread.run(Thread.java:745) {noformat} > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188
[jira] [Moved] (YARN-4754) Too many connection opened to TimelineServer while publishing entities
[ https://issues.apache.org/jira/browse/YARN-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S moved HADOOP-12863 to YARN-4754: -- Key: YARN-4754 (was: HADOOP-12863) Project: Hadoop YARN (was: Hadoop Common) > Too many connection opened to TimelineServer while publishing entities > -- > > Key: YARN-4754 > URL: https://issues.apache.org/jira/browse/YARN-4754 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > > It is observed that there are too many connections are kept opened to > TimelineServer while publishing entities via SystemMetricsPublisher. This > cause sometimes resource shortage for other process or RM itself > {noformat} > tcp0 0 10.18.99.110:3999 10.18.214.60:59265 > ESTABLISHED 115302/java > tcp0 0 10.18.99.110:25001 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25002 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25003 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25004 :::*LISTEN > 115302/java > tcp0 0 10.18.99.110:25005 :::*LISTEN > 115302/java > tcp1 0 10.18.99.110:48866 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48137 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47553 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48424 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48139 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:48096 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:47558 10.18.99.110:8188 > CLOSE_WAIT 115302/java > tcp1 0 10.18.99.110:49270 10.18.99.110:8188 > CLOSE_WAIT 115302/java > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175284#comment-15175284 ] Bibin A Chundatt commented on YARN-4744: [~sidharta-s] Can we use similar check like {{LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext ctx)}}. > Too many signal to container failure in case of LCE > --- > > Key: YARN-4744 > URL: https://issues.apache.org/jira/browse/YARN-4744 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Sidharta Seethana > > Install HA cluster in secure mode > Enable LCE with cgroups > Start server with dsperf user > Submit mapreduce application terasort/teragen with user yarn/dsperf > Too many signal to container failure > Submit with user the exception is thrown > {noformat} > 2014-03-02 09:20:38,689 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for testing (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB > 2014-03-02 09:20:40,158 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Event EventType: KILL_CONTAINER sent to absent container > container_e02_1393731146548_0001_01_13 > 2014-03-02 09:20:43,071 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_e02_1393731146548_0001_01_09 succeeded > 2014-03-02 09:20:43,072 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e02_1393731146548_0001_01_09 transitioned from > RUNNING to EXITED_WITH_SUCCESS > 2014-03-02 09:20:43,073 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e02_1393731146548_0001_01_09 > 2014-03-02 09:20:43,075 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: > Using container runtime: DefaultLinuxContainerRuntime > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 9. Privileged Execution Operation Output: > main : command provided 2 > main : run as user is yarn > main : requested yarn user is yarn > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > yarn, yarn, 2, 9370, 15] > 2014-03-02 09:20:43,081 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: > Signal container failed. Exception: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=9: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=9: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 9 more > 2014-03-02 09:20:43,113 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yarn > OPERATION=Container