[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005667#comment-14005667 ] Tsuyoshi OZAWA commented on YARN-2017: -- Good job! Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Fix For: 2.5.0 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, YARN-2017.6.patch, YARN-2017.7.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005674#comment-14005674 ] Tsuyoshi OZAWA commented on YARN-1474: -- I'm rebasing a patch on YARN-2017. Please wait a moment. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo reassigned YARN-1801: - Assignee: Hong Zhiguo NPE in public localizer --- Key: YARN-1801 URL: https://issues.apache.org/jira/browse/YARN-1801 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Hong Zhiguo Priority: Critical While investigating YARN-1800 found this in the NM logs that caused the public localizer to shutdown: {noformat} 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440382009, FILE, null } 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728)) - Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-1801: -- Attachment: YARN-1801.patch {code} Path local = completed.get(); {code} may throw ExecutionException and assoc may be null. When both of them happen, we got NPE in {code} LOG.info(Failed to download rsrc + assoc.getResource(), e.getCause()); {code} And this is exactly the line ResourceLocalizationService.java:712 of commit dd9c059 (2013-10-05 YARN-1254). NPE in public localizer --- Key: YARN-1801 URL: https://issues.apache.org/jira/browse/YARN-1801 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Hong Zhiguo Priority: Critical Attachments: YARN-1801.patch While investigating YARN-1800 found this in the NM logs that caused the public localizer to shutdown: {noformat} 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440382009, FILE, null } 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728)) - Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005700#comment-14005700 ] Hadoop QA commented on YARN-2049: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646158/YARN-2049.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3788//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3788//console This message is automatically generated. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, YARN-2049.4.patch, YARN-2049.5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005718#comment-14005718 ] Steve Loughran commented on YARN-2092: -- This seems like from the HADOOP-10104 patch. Which went in because the 2.2+ version of jackson was so out of date is was breaking other things. I'm not sure its so much incompatible as that TEZ is trying to push in its own version of jackon, which is then leading to classpath mixing problems. Even if you try to push in one set of the JARs ahead of the other, things are going to break. I know, I've tried. jackson 1.x should be compatible at run time with code build for previous versions. If there's a link problem there then it's something we can take up with the Jackson team. Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT Key: YARN-2092 URL: https://issues.apache.org/jira/browse/YARN-2092 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Came across this when trying to integrate with the timeline server. Using a 1.8.8 dependency of jackson works fine against 2.4.0 but fails against 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user jars are first in the classpath. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005722#comment-14005722 ] Binglin Chang commented on YARN-2088: - Hi Zhiguo, Thanks for the comments, nice catch. Those two lines are used in every record class... so delete them in a single place actually break code conversion, and it's not related to this jira. We may discuss whether to delete them all in other jira. Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005753#comment-14005753 ] Binglin Chang commented on YARN-2030: - Hi Jian He, Thanks for the comments, looks like PBImpl already has ProtoBase as super class, so we can't change interface to abstract class {code} public class ApplicationAttemptStateDataPBImpl extends ProtoBaseApplicationAttemptStateDataProto implements ApplicationAttemptStateData { {code} Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Binglin Chang Attachments: YARN-2030.v1.patch, YARN-2030.v2.patch Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005804#comment-14005804 ] Steve Loughran commented on YARN-2092: -- I should add that the underlying issue is that the AM gets then entire CP from the {{yarn.lib.classpath}}. That's mandatory to pick up a version of the hadoop binaries (and -site.xml files) compatible with the rest of the cluster. But it brings in all the other dependencies which hadoop itself relies on. As hadoop evolves, this problem will only continue. The only viable long-term solution is to somehow support OSGi-launched AMs, so the AM only gets the org.apache.hadoop classes from the hadoop JARs, and has to explicitly add everything itself. See HADOOP-7977 for this -maybe it's something we could target for hadoop 3.0 driven by the needs of AMs Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT Key: YARN-2092 URL: https://issues.apache.org/jira/browse/YARN-2092 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Came across this when trying to integrate with the timeline server. Using a 1.8.8 dependency of jackson works fine against 2.4.0 but fails against 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user jars are first in the classpath. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005863#comment-14005863 ] Rohith commented on YARN-1366: -- bq. I mean what will go wrong is we allow unregister without register? Is it fundamentally wrong? Allowing unregister without register, move application to FINISHED state(after handling unregistered event at launched) which supposed to be Failed state. If it can be acceptable, then its fine to go ahead. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006066#comment-14006066 ] Eric Payne commented on YARN-415: - The Generic Application History server stores all of the information about containers that are needed to calculate memory seconds and vcore seconds. Right now, since the Generic Application Server is tied closely with the Timeline Server, this does not work on a secured cluster. Also, the information is only available via REST API right now, and there would need to be some scripting and parsing of the REST APIs to rolled up metrics for each app. So, I think this JIRA still would be very helpful and useful. FYI, On an unsecured cluster with the Generic Application History Server and the Timeline Server configured and running, the following REST APIs will give enough information about an app to calculate memory seconds and vcore seconds: {panel:title=Get list of app attempts for a specified appID|titleBGColor=#F7D6C1} curl --compressed -H Accept: application/json -X GET http://hostname:port/ws/v1/applicationhistory/apps/appID/appattempts {panel} {panel:title=For each app attempt, get all container info|titleBGColor=#F7D6C1} curl --compressed -H Accept: application/json -X GET http://hostname:port/ws/v1/applicationhistory/apps/appID/appattempts/appAttemptID/containers {panel} Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.16.patch Rebased on trunk. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006222#comment-14006222 ] Abin Shahab commented on YARN-1964: --- Does others have comments on it. [~acmurthy]] ? Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006248#comment-14006248 ] Wei Yan commented on YARN-596: -- Hey, [~sandyr], sorry for the late reply. Still confuse here. So as you said, a queue is safe and doesn't allow preemption only it satisfies the condition (usage.memory = fairshare.memory) (usage.vcores = fairshare.vcores). This condition works fine for DRF. But for FairSharePolicy, as the fairshare.vcores is always 0 (except for root), so this condition cannot be satisfied and all queues are always allowed to preempt. In fair scheduler, intra-application container priorities affect inter-application preemption decisions --- Key: YARN-596 URL: https://issues.apache.org/jira/browse/YARN-596 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006453#comment-14006453 ] Vinod Kumar Vavilapalli commented on YARN-2049: --- Thanks for working on this, Zhijie! Some comments on the patch TimelineKerberosAuthenticator - Not clear what TimelineDelegationTokenResponse validateAndParseResponse() is doing with class loading, construction etc. Can you explain and may be also add code comments? TimelineAuthenticationFilter - Explain what getConfiguration() overrides and add a code comment? TimelineKerberosAuthenticationHandler - This borrows a lot of code from HttpFSKerberosAuthenticationHandler.java. We should refactor either here or in a separate JIRA. Nits - TestDistributedShell change is unnecessary - TimelineDelegationTokenSelector: Wrap the debug logging in debugEnabled checks. - ApplicationHistoryServer.java -- Forced config setting of the filter: What happens if the cluster has another authentication filter? Is the guideline to override it (which is what the patch is doing)? h4. Source code refactor TimelineKerberosAuthenticationHandler - Rename to TimelineClientAuthenticationService? TimelineKerberosAuthenticator - It seems like TimelineKerberosAuthenticator is completely client side code and so should be moved to the client module - To do that we will extract some of the constants and the DelegationTokenOperation enum as top level entities into the common module. TimelineAuthenticationFilterInitializer - This is almost the same as the common AuthenticationFilterInitializer.java. Let's just refactor AuthenticationFilterInitializer.java and extend it to only change class names. Similarly to how TimelineAuthenticationFilter extends AuthenticationFilter. TimelineDelegationTokenSecretManagerService: - We are sharing the configs for update/renewal etc with the ResourceManager. That seems fine for now - logically you want both the tokens to follow similar expiry and life-cycle - This also shares a bunch of code with org/apache/hadoop/lib/service/security/DelegationTokenManagerService. We may or may not want to reuse some code - just throwing it out. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, YARN-2049.4.patch, YARN-2049.5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006459#comment-14006459 ] Mayank Bansal commented on YARN-2074: - Thanks [~jianhe] for the patch. Overall looks good. some nits {code} maxAppAttempts = attempts.size() {code} Can we use this? {code} maxAppAttempts == getAttemptFailureCount() {code} {code} public boolean isPreempted() { return getDiagnostics().contains(SchedulerUtils.PREEMPTED_CONTAINER); } {code} I think we need to compare the exit status (-102) instead of relying on string message. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006465#comment-14006465 ] Jian He commented on YARN-1408: --- Hi [~sunilg], agree that we should remove container from newlyAllocatedContainers when preemption happens. As per the race condition you mentioned, we may also preempt ACQUIRED container? In fact, I think the best container to be preempted is the ALLOCATED container as these containers are not yet alive from the user's perspective. As per the race condition that [RM lost the resource request], today the resource request is decremented when container is allocated. we may change it to decrement the resource request only when the container is pulled by the AM ? We can do this separately if this makes sense. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2074: -- Attachment: YARN-2074.3.patch Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006525#comment-14006525 ] Jian He commented on YARN-2074: --- Thanks Xuan and Mayank for the review ! bq. maxAppAttempts == getAttemptFailureCount() good point. Fixed the attempt to compare against the exit status to determine preempted or not. Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2049: -- Attachment: YARN-2049.6.patch Update the patch accordingly Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006542#comment-14006542 ] Zhijie Shen commented on YARN-2049: --- Thanks for review, Vinod and Varun! Please see the response bellow: bq. 1. In the function managementOperation, should there be a null check for token? There're following code before processing each dtOp: {code} if (dtOp.requiresKerberosCredentials() token == null) { response.sendError(HttpServletResponse.SC_UNAUTHORIZED, MessageFormat.format( Operation [{0}] requires SPNEGO authentication established, dtOp)); requestContinues = false; {code} Get and renew both require kerberos credentials, such that if token == null, the code will fall into this part. Cancel didn't require credentials before refer to HttpFS's code. However, I think we should enforce kerberos credentials for cancel as well. After that, the NPE risk is gone. bq. In the function managementOperation, you call secretManager.cancelToken(dt, UserGroupInformation.getCurrentUser().getUserName()) - should you use getCurrentuser().getUserName? or ownerUGI.getUserName()? Good catch, we should use token.getUserName here as well. bq. TimelineKerberosAuthenticator Some errors may cause TimelineAuthenticator not getting the correct response. If the status code is not 200, the json content may contain the exception information from the server, we can use the information recover exception object. This is inspired by HttpFSUtils.validateResponse, but I changed to use Jackson to parse the json content here. bq. TimelineAuthenticationFilter In the configuration we can simply set the authentication type to kerberos, but in the timeline sever, we want to replace it with the class name of the customized authentication service. Otherwise, the standard authentication handler will be used instead. I added the code comments there. bq. TimelineKerberosAuthenticationHandler bq. TimelineDelegationTokenSecretManagerService. Yeah, we need to look into how to reuse the existing code, but how about postpone it later? I'm going to file a separate Jira for code refactoring. bq. TestDistributedShell change is unnecessary Removed. bq. TimelineDelegationTokenSelector: Wrap the debug logging in debugEnabled checks. Added the debugEnabled checks. bq. ApplicationHistoryServer.java Actually it will not override the other initializers. Instead, I just append a TimelineAuthenticationFilterInitializer. Anyway, I enhance the condition here: not only the security should be enabled, but also kerberos authentication is desired. bq. TimelineKerberosAuthenticationHandler Done. bq. TimelineKerberosAuthenticator. Good suggestion. I split the code accordingly. bq. TimelineAuthenticationFilterInitializer AuthenticationFilterInitializer has a single method to do everything, and the prefix is a static variable, which makes me a bit difficult to override part of code without changing AuthenticationFilterInitializer. One another issue is that AuthenticationFilterInitializer requires user to supply a secret file, which is not actually required by AuthenticationFilter (HADOOP-10600). Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006548#comment-14006548 ] Mayank Bansal commented on YARN-1408: - I agree with [~jianhe] and [~devaraj.k] We should be able to preempt the container in ALLOCATED state. bq. oday the resource request is decremented when container is allocated. we may change it to decrement the resource request only when the container is pulled by the AM ? I am not sure if thats the right thing as you dont want to run into other race conditions when container is been allocated however capacity is given to some other AM's? Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006609#comment-14006609 ] Sandy Ryza commented on YARN-596: - Ah, I see what you're saying. Good point. In that case we'll probably need to push that check into the SchedulingPolicy and call it inside the loop in preemptContainer(). In fair scheduler, intra-application container priorities affect inter-application preemption decisions --- Key: YARN-596 URL: https://issues.apache.org/jira/browse/YARN-596 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006613#comment-14006613 ] Jian He commented on YARN-1408: --- Seems more problem with the approach I mentioned, if the request is not updated at the time container is allocated, and AM doesn't do the following allocate, more containers will be allocated from the same request when NM heartbeats Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006615#comment-14006615 ] Wei Yan commented on YARN-596: -- yes, we can check the queue's policy in the preCheck function. If DRF, we use Resources.fitsIn(); if Fair, we use DEFAULT_CALCULATOR. Sounds good? In fair scheduler, intra-application container priorities affect inter-application preemption decisions --- Key: YARN-596 URL: https://issues.apache.org/jira/browse/YARN-596 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2012: - Description: Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). was: Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This queue should be an existing queue,if not we fall back to root.default queue hence keeping this rule as terminal. This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute - Key: YARN-2012 URL: https://issues.apache.org/jira/browse/YARN-2012 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006641#comment-14006641 ] Sandy Ryza commented on YARN-596: - Sounds good In fair scheduler, intra-application container priorities affect inter-application preemption decisions --- Key: YARN-596 URL: https://issues.apache.org/jira/browse/YARN-596 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006645#comment-14006645 ] Hadoop QA commented on YARN-2049: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646401/YARN-2049.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/3789//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.TestRMAdminCLI {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3789//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3789//console This message is automatically generated. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006676#comment-14006676 ] Sandy Ryza commented on YARN-596: - The current patch uses the queue's policy to preemptContainerPreCheck. We should the parent's policy. (Consider the case of a leaf queue with FIFO under a parent queue with DRF - we should use DRF to decide whether we should skip the leaf queue). Also, we should add a new method to SchedulingPolicy instead of checking with instanceof. {code} + if (Resources.fitsIn(getResourceUsage(), getFairShare())) { +return false; + } else { +return true; + } {code} Can just use return Resources.fitsIn(getResourceUsage(), getFairShare()). In fair scheduler, intra-application container priorities affect inter-application preemption decisions --- Key: YARN-596 URL: https://issues.apache.org/jira/browse/YARN-596 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006682#comment-14006682 ] Sandy Ryza commented on YARN-2073: -- {code} + /** Preemption related variables */ {code} Nit: use // like the other comments. Can you add the new property in the Fair Scheduler doc? {code} + updateRootQueueMetrics(); {code} My understanding is that this shouldn't be needed in shouldAttemptPreemption. Have you observed otherwise? Would it be possible to move the TestFairScheduler refactoring to a separate JIRA? If it's too difficult to entangle at this point, I'm ok with it. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch, yarn-2073-3.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2095) Large MapReduce Job stops responding
[ https://issues.apache.org/jira/browse/YARN-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-2095. --- Resolution: Invalid [~sunliners81], we have run much bigger jobs (100K maps) and those that run for long time without any issues. There is only one limitation that I know of - in secure clusters tokens expire after 7 days. In any case, please pursue this on user mailing lists and create a bug when you are sure there is one. Closing this as invalid for now, please reopen if you disagree. Large MapReduce Job stops responding Key: YARN-2095 URL: https://issues.apache.org/jira/browse/YARN-2095 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.3 (x86_64) on vmware 10 running HDP-2.0.6 Reporter: Clay McDonald Priority: Blocker Very large jobs (7,455 Mappers and 999 Reducers) hang. Jobs run well but logging to container logs stop after running 33 hours. The job appears to be hung. The status of the job is RUNNING. No error messages found in logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006705#comment-14006705 ] Hadoop QA commented on YARN-2073: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646425/yarn-2073-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3790//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3790//console This message is automatically generated. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch, yarn-2073-3.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006735#comment-14006735 ] Mayank Bansal commented on YARN-2074: - +1 LGTM Thanks, Mayank Preemption of AM containers shouldn't count towards AM failures --- Key: YARN-2074 URL: https://issues.apache.org/jira/browse/YARN-2074 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Jian He Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications. We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2024) IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state.
[ https://issues.apache.org/jira/browse/YARN-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2024: -- Issue Type: Sub-task (was: Bug) Parent: YARN-431 IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state. Key: YARN-2024 URL: https://issues.apache.org/jira/browse/YARN-2024 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 0.23.10, 2.4.0 Reporter: Eric Payne Priority: Critical Multiple issues were encountered when AppLogAggregatorImpl encountered an IOException in AppLogAggregatorImpl#uploadLogsForContainer while aggregating yarn-logs for an application that had very large (150G each) error logs. - An IOException was encountered during the LogWriter#append call, and a message was printed, but no stacktrace was provided. Message: ERROR: Couldn't upload logs for container_n_nnn_nn_nn. Skipping this container. - After the IOExceptin, the TFile is in a bad state, so subsequent calls to LogWriter#append fail with the following stacktrace: 2014-04-16 13:29:09,772 [LogAggregationService #17907] ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[LogAggregationService #17907,5,main] threw an Exception. java.lang.IllegalStateException: Incorrect state to start a new key: IN_VALUE at org.apache.hadoop.io.file.tfile.TFile$Writer.prepareAppendKey(TFile.java:528) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:262) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:164) ... - At this point, the yarn-logs cleaner still thinks the thread is aggregating, so the huge yarn-logs never get cleaned up for that application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2082) Support for alternative log aggregation mechanism
[ https://issues.apache.org/jira/browse/YARN-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006753#comment-14006753 ] Vinod Kumar Vavilapalli commented on YARN-2082: --- We should also consider some scalable solutions on HDFS itself - to post process the logs automatically to reduce the file-count and may be NMs forming a tree of aggregation (with network copy of logs) before hitting HDFS. IAC, the pluggability is sort of a dup of the proposal at YARN-1440 (albeit for a different reason)? Support for alternative log aggregation mechanism - Key: YARN-2082 URL: https://issues.apache.org/jira/browse/YARN-2082 Project: Hadoop YARN Issue Type: New Feature Reporter: Ming Ma I will post a more detailed design later. Here is the brief summary and would like to get early feedback. Problem Statement: Current implementation of log aggregation create one HDFS file for each {application, nodemanager }. These files are relative small, in the range of 1-2 MB. In a large cluster with lots of application and many nodemanagers, it ends up creating lots of small files in HDFS. This creates pressure on HDFS NN on the following ways. 1. It increases NN Memory size. It is mitigated by having history server deletes old log files in HDFS. 2. Runtime RPC hit on HDFS. Each log aggregation file introduced several NN RPCs such as create, getAdditionalBlock, complete, rename. When the cluster is busy, such RPC hit has impact on NN performance. In addition, to support non-MR applications on YARN, we might need to support aggregation for long running applications. Design choices: 1. Don't aggregate all the logs, as in YARN-221. 2. Create a dedicated HDFS namespace used only for log aggregation. 3. Write logs to some key-value store like HBase. HBase's RPC hit on NN will be much less. 4. Decentralize the application level log aggregation to NMs. All logs for a given application are aggregated first by a dedicated NM before it is pushed to HDFS. 5. Have NM aggregate logs on a regular basis; each of these log files will have data from different applications and there needs to be some index for quick lookup. Proposal: 1. Make yarn log aggregation pluggable for both read and write path. Note that Hadoop FileSystem provides an abstraction and we could ask alternative log aggregator implement compatable FileSystem, but that seems to an overkill. 2. Provide a log aggregation plugin that write to HBase. The scheme design needs to support efficient read on a per application as well as per application+container basis; in addition, it shouldn't create hotspot in a cluster where certain users might create more jobs than others. For example, we can use hash($user+$applicationId} + containerid as the row key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1545) [Umbrella] Prevent DoS of YARN components by putting in limits
[ https://issues.apache.org/jira/browse/YARN-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006754#comment-14006754 ] Hong Zhiguo commented on YARN-1545: --- You mean we should define the upper bound of the number or length of fields inside the messages. Should we have these bounds configurable? or pre-defined as constants? How about the rate of messages? For example, a bad client performs query of getApplications at it's full speed. [Umbrella] Prevent DoS of YARN components by putting in limits -- Key: YARN-1545 URL: https://issues.apache.org/jira/browse/YARN-1545 Project: Hadoop YARN Issue Type: Improvement Reporter: Vinod Kumar Vavilapalli I did a pass and found many places that can cause DoS on various YARN services. Need to fix them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2073) FairScheduler starts preempting resources even with free resources on the cluster
[ https://issues.apache.org/jira/browse/YARN-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2073: --- Attachment: yarn-2073-4.patch Thanks for the review, Sandy. Updated the patch to reflect your suggestions except the Test refactoring. For the tests, it was easier to split and I think it is the right direction forward. If you don't mind, I would like to leave the patch as is. FairScheduler starts preempting resources even with free resources on the cluster - Key: YARN-2073 URL: https://issues.apache.org/jira/browse/YARN-2073 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-2073-0.patch, yarn-2073-1.patch, yarn-2073-2.patch, yarn-2073-3.patch, yarn-2073-4.patch Preemption should kick in only when the currently available slots don't match the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006774#comment-14006774 ] Hadoop QA commented on YARN-2049: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646431/YARN-2049.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.TestRMAdminCLI {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3791//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3791//console This message is automatically generated. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, YARN-2049.4.patch, YARN-2049.5.patch, YARN-2049.6.patch, YARN-2049.7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006776#comment-14006776 ] Zhijie Shen commented on YARN-1936: --- Vinod, thanks for review. See my response bellow: bq. Make the event-put as one of the options -put Good point. I make use of CommandLine to do simple CLI. bq. Add delegation token only if timeline-service is enabled. Added the check bq. Also move this main to TimelineClientImpl moved bq. selectToken() can use a TimelineDelegationTokenSelector to find the token? Use selector instead, and do some refactoring required. bq. Can we add a simple test to validate the addition of the Delegation Token to the client credentials? Added a test case Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch, YARN-1936.2.patch, YARN-1936.3.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1936: -- Attachment: YARN-1936.3.patch Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch, YARN-1936.2.patch, YARN-1936.3.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1545) [Umbrella] Prevent DoS of YARN components by putting in limits
[ https://issues.apache.org/jira/browse/YARN-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006777#comment-14006777 ] Vinod Kumar Vavilapalli commented on YARN-1545: --- I covered the details on the individual tickets - it's mostly about bounding buffers, lists etc. When I filed this I was only focusing on application level stuff. A bad client firing off RPCs in rapid fire can and should be addressed at in the RPC layer itself IMO. [Umbrella] Prevent DoS of YARN components by putting in limits -- Key: YARN-1545 URL: https://issues.apache.org/jira/browse/YARN-1545 Project: Hadoop YARN Issue Type: Improvement Reporter: Vinod Kumar Vavilapalli I did a pass and found many places that can cause DoS on various YARN services. Need to fix them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006778#comment-14006778 ] Tsuyoshi OZAWA commented on YARN-1474: -- [~kkambatl], could you kick the Jenkins and check the latest patch? Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006801#comment-14006801 ] Hudson commented on YARN-1938: -- FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/563/]) YARN-1938. Added kerberos login for the Timeline Server. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596710) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java Kerberos authentication for the timeline server --- Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.5.0 Attachments: YARN-1938.1.patch, YARN-1938.2.patch, YARN-1938.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006796#comment-14006796 ] Hudson commented on YARN-2089: -- FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/563/]) YARN-2089. FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations. (Zhihai Xu via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596765) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations --- Key: YARN-2089 URL: https://issues.apache.org/jira/browse/YARN-2089 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.0 Reporter: Anubhav Dhoot Assignee: zhihai xu Labels: newbie Fix For: 2.5.0 Attachments: yarn-2089.patch We should mark QueuePlacementPolicy and QueuePlacementRule with audience annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006797#comment-14006797 ] Hudson commented on YARN-2017: -- FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/563/]) YARN-2017. Merged some of the common scheduler code. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596753) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java *
[jira] [Commented] (YARN-1962) Timeline server is enabled by default
[ https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006798#comment-14006798 ] Hudson commented on YARN-1962: -- FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/563/]) YARN-2081. Fixed TestDistributedShell failure after YARN-1962. Contributed by Zhiguo Hong. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596724) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java Timeline server is enabled by default - Key: YARN-1962 URL: https://issues.apache.org/jira/browse/YARN-1962 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Fix For: 2.4.1 Attachments: YARN-1962.1.patch, YARN-1962.2.patch Since Timeline server is not matured and secured yet, enabling it by default might create some confusion. We were playing with 2.4.0 and found a lot of exceptions for distributed shell example related to connection refused error. Btw, we didn't run TS because it is not secured yet. Although it is possible to explicitly turn it off through yarn-site config. In my opinion, this extra change for this new service is not worthy at this point,. This JIRA is to turn it off by default. If there is an agreement, i can put a simple patch about this. {noformat} 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response from the timeline server. com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response from the timeline server. com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515) at
[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
[ https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006794#comment-14006794 ] Hudson commented on YARN-2050: -- FAILURE: Integrated in Hadoop-Yarn-trunk #563 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/563/]) YARN-2050. Fix LogCLIHelpers to create the correct FileContext. Contributed by Ming Ma (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1596310) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java Fix LogCLIHelpers to create the correct FileContext --- Key: YARN-2050 URL: https://issues.apache.org/jira/browse/YARN-2050 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Fix For: 3.0.0, 2.5.0 Attachments: YARN-2050-2.patch, YARN-2050.patch LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)