[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1512#comment-1512 ] Hadoop QA commented on YARN-3367: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 9s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 48s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 26s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 6s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 29s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 38s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 49s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 28s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 40s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 18s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 15s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 53s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 10s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 10s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 23s {color} | {color:red} root: patch generated 9 new + 712 unchanged - 11 fixed = 721 total (was 723) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 50s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 13s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 40s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 26s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 3m 55s {color} |
[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130206#comment-15130206 ] Jian He commented on YARN-4138: --- Patch looks good to me overall, one question for this test case: After step 6, rmContainer.getLastConfirmedResource() will return 3G, when the expire event gets triggered, won't it reset it back to 3G? {code} /** * 1. Allocate 1 container: containerId2 (1G) * 2. Increase resource of containerId2: 1G -> 3G * 3. AM acquires the token * 4. Increase resource of containerId2 again: 3G -> 6G * 5. AM acquires the token * 6. AM uses the 1st token to increase the container in NM to 3G * 7. AM does NOT use the second token * 8. Verify containerId2 eventually uses 1G after token expires {code} - I think RMContainerImpl will not receive EXPIRE event at RUNNING state after this patch ? if so, we can remove this. {code} .addTransition(RMContainerState.RUNNING, RMContainerState.RUNNING, RMContainerEventType.EXPIRE) {code} > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130015#comment-15130015 ] Varun Saxena commented on YARN-3367: testSyncCall is failing again. I think instead of a fixed sleep period maybe we can sleep in a loop(until condition is met) and put an overall timeout for the test case. I will check the patch in detail. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-YARN-2928.v1.005.patch, > YARN-3367-YARN-2928.v1.006.patch, YARN-3367-YARN-2928.v1.007.patch, > YARN-3367-YARN-2928.v1.008.patch, YARN-3367-YARN-2928.v1.009.patch, > YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch, > sjlee-suggestion.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130109#comment-15130109 ] Jun Gong commented on YARN-3998: [~vvasudev], I just attached a new patch to address above problems. Thanks for review. 1) When finding container's previous working directory and log directory, just locate corresponding files in good directories which could be read/write and not full. 2) Limiting diagnostic message's message to 1 bytes. If the length is greater than it, delete the first line whose separator is "\n". 3) After some container retries, env variable *MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX*(DEFAULT_NM_ADMIN_USER_ENV) will be expanded to *MALLOC_ARENA_MAX=::*(a lot of ":"). I fixed it in *Apps#addToEnvironment*. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4625) Make ApplicationSubmissionContext and ApplicationSubmissionContextInfo more consistent
[ https://issues.apache.org/jira/browse/YARN-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130246#comment-15130246 ] Hudson commented on YARN-4625: -- FAILURE: Integrated in Hadoop-trunk-Commit #9237 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9237/]) YARN-4625. Make ApplicationSubmissionContext and (vvasudev: rev 1adb64e09bd453f97e83d31b1587079e30b4b274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ApplicationSubmissionContextInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LogAggregationContextInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AMBlackListingRequestInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md * hadoop-yarn-project/CHANGES.txt > Make ApplicationSubmissionContext and ApplicationSubmissionContextInfo more > consistent > -- > > Key: YARN-4625 > URL: https://issues.apache.org/jira/browse/YARN-4625 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.9.0 > > Attachments: YARN-4625.2.patch, YARN-4625.20160121.1.patch, > YARN-4625.3.patch > > > There're some differences between ApplicationSubmissionContext and > ApplicationSubmissionContextInfo, for example, we can not submit Application > with logAggregationContext specified thru RM web Service . We could make them > more consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3998: --- Attachment: YARN-3998.06.patch > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130229#comment-15130229 ] Hadoop QA commented on YARN-3998: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 37s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 27s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 24s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 33s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 30s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 40s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 522 unchanged - 4 fixed = 526 total (was 526) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 23s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 19s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 47s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 40s {color} | {color:green}
[jira] [Updated] (YARN-4446) Refactor reader API for better extensibility
[ https://issues.apache.org/jira/browse/YARN-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4446: --- Attachment: (was: YARN-4446-YARN-2928.03.patch) > Refactor reader API for better extensibility > > > Key: YARN-4446 > URL: https://issues.apache.org/jira/browse/YARN-4446 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4446-YARN-2928.01.patch, > YARN-4446-YARN-2928.02.patch, YARN-4446-YARN-2928.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4435) Add RM Delegation Token DtFetcher Implementation for DtUtil
[ https://issues.apache.org/jira/browse/YARN-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130380#comment-15130380 ] Steve Loughran commented on YARN-4435: -- you'll need to add one for the timeline delegation token too —without that you can't submit work to a cluster which has ATS enabled. Again, this is a YARN service & follows the same lifecycle > Add RM Delegation Token DtFetcher Implementation for DtUtil > --- > > Key: YARN-4435 > URL: https://issues.apache.org/jira/browse/YARN-4435 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Matthew Paduano >Assignee: Matthew Paduano > Attachments: proposed_solution > > > Add a class to yarn project that implements the DtFetcher interface to return > a RM delegation token object. > I attached a proposed class implementation that does this, but it cannot be > added as a patch until the interface is merged in HADOOP-12563 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130272#comment-15130272 ] Junping Du commented on YARN-4635: -- Thanks [~jianhe] for review and comments. First, I would like to claim an assumption that the blacklist mechanism for AM launching is not for tracking nodes that completely not work (unhealthy) but tracking nodes that has suspect to fail the AM container due to previous failed experience. This is because we already have unhealthy report mechanism to report serious issue for NM so here is another one which should have a higher bar (as in some sense, AM container is more important than other container) according to the history. My response will be based on above assumption. bq. why should below container exit status back list the node ? This container failure could due to resource congestion (like KILLED_EXCEEDED_PMEM) or unknown reason (ABORTED, INVALID) that make this NM higher suspect than normal nodes. bq. For DISKS_FAILED which is considered as global blacklist node in this jira, I think in this case, the node will report as unhealthy and RM should remove the node already. Some DISKS_FAILED could happens due to the failed container write disk to full. But it could still have other directories available to use by node. It could still get launched with normal containers but not suitable to risk AM container. bq. AMBlackListingRequest contains a boolean flag and a threshold number. Do you think it’s ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means enabled? If so, it means the job submitter have to understand how many nodes the current cluster have and the job parameter should be updated if it get submitted to different cluster (with different nodes). IMO, That sounds more complexity to users. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130336#comment-15130336 ] Naganarasimha G R commented on YARN-3367: - Thanks [~varun_saxena], True will update the patch accordingly in a short while > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-YARN-2928.v1.005.patch, > YARN-3367-YARN-2928.v1.006.patch, YARN-3367-YARN-2928.v1.007.patch, > YARN-3367-YARN-2928.v1.008.patch, YARN-3367-YARN-2928.v1.009.patch, > YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch, > sjlee-suggestion.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130065#comment-15130065 ] Varun Saxena commented on YARN-3367: A couple of comments. # When we are stopping the dispatcher, we are saying in log that we are draining it but we are not really doing so. I think we can try to drain the queue on stop and process the async events or some sync event sitting in the queue. We would need to do this before we call shutdownNow as that will interrupt the thread. # nit : In TestTimelineClientV2Impl#testSyncCall, we have made an extra call to {{client.setSleepBeforeReturn(true);}} which is not required. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-YARN-2928.v1.005.patch, > YARN-3367-YARN-2928.v1.006.patch, YARN-3367-YARN-2928.v1.007.patch, > YARN-3367-YARN-2928.v1.008.patch, YARN-3367-YARN-2928.v1.009.patch, > YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch, > sjlee-suggestion.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4307) Blacklisted nodes for AM container is not getting displayed in the Web UI
[ https://issues.apache.org/jira/browse/YARN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130247#comment-15130247 ] Varun Vasudev commented on YARN-4307: - +1 for the latest patch. I'll commit this tomorrow if no one objects. > Blacklisted nodes for AM container is not getting displayed in the Web UI > - > > Key: YARN-4307 > URL: https://issues.apache.org/jira/browse/YARN-4307 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: AppInfoPage.png, RMappAttempt.png, > YARN-4307.v1.001.patch, YARN-4307.v1.002.patch, YARN-4307.v1.003.patch, > YARN-4307.v1.004.patch, YARN-4307.v1.005.patch, webpage.png, > yarn-capacity-scheduler-debug.log > > > In pseudo cluster had 2 NM's and had launched app with incorrect > configuration *./hadoop org.apache.hadoop.mapreduce.SleepJob > -Dmapreduce.job.node-label-expression=labelX > -Dyarn.app.mapreduce.am.env=JAVA_HOME=/no/jvm/here -m 5 -mt 1200*. > First attempt failed and 2nd attempt was launched, but the application was > hung. In the scheduler logs found that localhost was blacklisted but in the > UI (app& apps listing page) count was shown as zero and as well no hosts > listed in the app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131201#comment-15131201 ] MENG DING commented on YARN-4138: - The failed tests are not related. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, > YARN-4138.5.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required
[ https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131124#comment-15131124 ] Colin Patrick McCabe commented on YARN-4594: Thanks, [~jlowe]. > container-executor fails to remove directory tree when chmod required > - > > Key: YARN-4594 > URL: https://issues.apache.org/jira/browse/YARN-4594 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Fix For: 2.9.0 > > Attachments: YARN-4594.001.patch, YARN-4594.002.patch, > YARN-4594.003.patch, YARN-4594.004.patch > > > test-container-executor.c doesn't work: > * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually > /usr/bin/ls on many systems. > * The recursive delete logic in container-executor.c fails -- nftw does the > wrong thing when confronted with directories with the wrong mode (permission > bits), leading to an attempt to run rmdir on a non-empty directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131157#comment-15131157 ] Hadoop QA commented on YARN-4138: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 6 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 8s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 18s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 1 new + 234 unchanged - 2 fixed = 235 total (was 236) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 52s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 13s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 144m 29s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_91 Failed junit tests |
[jira] [Commented] (YARN-4667) RM Admin CLI for refreshNodesResources throws NPE when nothing is configured
[ https://issues.apache.org/jira/browse/YARN-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131216#comment-15131216 ] Hadoop QA commented on YARN-4667: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 7s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 154m 30s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem ||
[jira] [Updated] (YARN-4386) refreshNodesGracefully() looks at active RMNode list for recommissioning decommissioned nodes
[ https://issues.apache.org/jira/browse/YARN-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-4386: -- Attachment: YARN-4386-v2.patch Updating patch with a test to check if a decommissioned node can ever transition to running state by graceful decommissioning process. The test TestRMNodeTransitions#testRecommissionNode covers the other case where a node can be recommissioned after being in decommissioning state. Since we know that only inactiveRMNodes will contain the decommissioned node, the check for such in a node in active list is not useful. > refreshNodesGracefully() looks at active RMNode list for recommissioning > decommissioned nodes > - > > Key: YARN-4386 > URL: https://issues.apache.org/jira/browse/YARN-4386 > Project: Hadoop YARN > Issue Type: Bug > Components: graceful >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: YARN-4386-v1.patch, YARN-4386-v2.patch > > > In refreshNodesGracefully(), during recommissioning, the entryset from > getRMNodes() which has only active nodes (RUNNING, DECOMMISSIONING etc.) is > used for checking 'decommissioned' nodes which are present in > getInactiveRMNodes() map alone. > {code} > for (Entryentry:rmContext.getRMNodes().entrySet()) { > . > // Recommissioning the nodes > if (entry.getValue().getState() == NodeState.DECOMMISSIONING > || entry.getValue().getState() == NodeState.DECOMMISSIONED) { > this.rmContext.getDispatcher().getEventHandler() > .handle(new RMNodeEvent(nodeId, RMNodeEventType.RECOMMISSION)); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131146#comment-15131146 ] Eric Payne commented on YARN-3769: -- Thanks [~djp]] I will look into it. > Consider user limit when calculating total pending resource for preemption > policy in Capacity Scheduler > --- > > Key: YARN-3769 > URL: https://issues.apache.org/jira/browse/YARN-3769 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0, 2.7.0, 2.8.0 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.7.3 > > Attachments: YARN-3769-branch-2.002.patch, > YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, > YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, > YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, > YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, > YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch > > > We are seeing the preemption monitor preempting containers from queue A and > then seeing the capacity scheduler giving them immediately back to queue A. > This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131212#comment-15131212 ] Hadoop QA commented on YARN-3367: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 37s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 31s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 22s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 54s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 21s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 33s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 48s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 22s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 35s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 12s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 11s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 55s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 18s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 4s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 4s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 35s {color} | {color:red} root: patch generated 9 new + 711 unchanged - 11 fixed = 720 total (was 722) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 35s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 11s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 58s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 54s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 35s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 43s {color} |
[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required
[ https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130555#comment-15130555 ] Jason Lowe commented on YARN-4594: -- Thanks for updating the patch! There's just a couple of remaining bugs, both related to remnants from when error codes were negated: {code} ret = recursive_unlink_children(full_path); if (ret == ENOENT) { return 0; } if (ret != 0) { fprintf(LOGFILE, "Error while deleting %s: %d (%s)\n", full_path, -ret, strerror(-ret)); {code} It's negating ret when it shouldn't at the fprintf call. Same thing for the following instance: {code} if (rmdir(full_path) != 0) { ret = errno; if (ret != ENOENT) { fprintf(LOGFILE, "Couldn't delete directory %s - %s\n", full_path, strerror(-ret)); {code} It would also be nice to cleanup the whitespace nits, although it's no trouble cleaning those up as part of the commit. > container-executor fails to remove directory tree when chmod required > - > > Key: YARN-4594 > URL: https://issues.apache.org/jira/browse/YARN-4594 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: YARN-4594.001.patch, YARN-4594.002.patch, > YARN-4594.003.patch > > > test-container-executor.c doesn't work: > * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually > /usr/bin/ls on many systems. > * The recursive delete logic in container-executor.c fails -- nftw does the > wrong thing when confronted with directories with the wrong mode (permission > bits), leading to an attempt to run rmdir on a non-empty directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130520#comment-15130520 ] Jason Lowe commented on YARN-4665: -- Wouldn't a REST interface follow the same principal? I haven't looked at the REST API lately, but I'd expect the submission logic to be a POST followed by GET polling until the state is ACCEPTED or later. If the GET results in a no-such-app error then the client retries the POST and continues polling. Yes, this is not the most ideal REST interface design, but unless I'm missing something it should be functionally equivalent to the RPC path. In either case the client is going to have to do some kind of retry to handle failovers. Even with a synchronous interface we can end up with submissions that appear to fail from the client's perspective but actually succeed (because it was successfully recorded in the state store before failing to deliver a response to the client), so it's not just fire-and-forget from the client's perspective. > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3367: Attachment: YARN-3367-YARN-2928.v1.010.patch Thanks [~varun_saxena] for the comments bq. I think we can try to drain the queue on stop and process the async events or some sync event sitting in the queue. We would need to do this before we call shutdownNow as that will interrupt the thread. I had earlier tried to take care of trying to drain but missed in later patches. But IMO we should not wait in definitely as there might be chances that server might be down, so what i have done in the patch is to use shutdown so that the live workers are not stopped and it waits for 10 seconds and then exit. Or if we want to have more sophisticated way then we need to introduce some additional logic so that it doesn't get blocked and drains everything. Thoughts ? > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-YARN-2928.v1.005.patch, > YARN-3367-YARN-2928.v1.006.patch, YARN-3367-YARN-2928.v1.007.patch, > YARN-3367-YARN-2928.v1.008.patch, YARN-3367-YARN-2928.v1.009.patch, > YARN-3367-YARN-2928.v1.010.patch, YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch, > sjlee-suggestion.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4669) Fix logging statements in resource manager's Application class
[ https://issues.apache.org/jira/browse/YARN-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-4669: Attachment: YARN-4669.001.patch uploaded patch with logging fix. > Fix logging statements in resource manager's Application class > -- > > Key: YARN-4669 > URL: https://issues.apache.org/jira/browse/YARN-4669 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana >Priority: Trivial > Attachments: YARN-4669.001.patch > > > There seem to be a couple of System.out.println() calls that should be > replaced by info/debug logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Enhance filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131470#comment-15131470 ] Sangjin Lee commented on YARN-3863: --- This needs to be redone after YARN-4446, correct? > Enhance filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131476#comment-15131476 ] Sangjin Lee commented on YARN-2005: --- Would this be a good candidate for backporting to 2.6.x and 2.7.x? [~adhoot], thoughts? > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Fix For: 2.8.0 > > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch, YARN-2005.009.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4669) Fix logging statements in resource manager's Application class
Sidharta Seethana created YARN-4669: --- Summary: Fix logging statements in resource manager's Application class Key: YARN-4669 URL: https://issues.apache.org/jira/browse/YARN-4669 Project: Hadoop YARN Issue Type: Bug Reporter: Sidharta Seethana Assignee: Sidharta Seethana Priority: Trivial There seem to be a couple of System.out.println() calls that should be replaced by info/debug logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4386) refreshNodesGracefully() looks at active RMNode list for recommissioning decommissioned nodes
[ https://issues.apache.org/jira/browse/YARN-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131390#comment-15131390 ] Hadoop QA commented on YARN-4386: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 43s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 50s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 21s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 152m 44s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem ||
[jira] [Commented] (YARN-3669) Attempt-failures validatiy interval should have a global admin configurable lower limit
[ https://issues.apache.org/jira/browse/YARN-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131361#comment-15131361 ] Hadoop QA commented on YARN-3669: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 25s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 4s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 38s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green}
[jira] [Commented] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code
[ https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131464#comment-15131464 ] Sangjin Lee commented on YARN-4409: --- Hi [~varun_saxena], could you refresh this patch to apply cleanly on the branch? Thanks. > Fix javadoc and checkstyle issues in timelineservice code > - > > Key: YARN-4409 > URL: https://issues.apache.org/jira/browse/YARN-4409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4409-YARN-2928.wip.01.patch > > > There are a large number of javadoc and checkstyle issues currently open in > timelineservice code. We need to fix them before we merge it into trunk. > Refer to > https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267 > We still have 94 open checkstyle issues and javadocs failing for Java 8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4670) add logging when a node is AM-blacklisted
Sangjin Lee created YARN-4670: - Summary: add logging when a node is AM-blacklisted Key: YARN-4670 URL: https://issues.apache.org/jira/browse/YARN-4670 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.8.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Trivial Today there is not much logging happening when a node is blacklisted for an AM (see YARN-2005). We can add a little more logging to see this activity easily from the RM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Fix two AM containers get allocated when AM restart
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-4502: -- Target Version/s: 2.7.3, 2.6.5 (was: 2.6.5) > Fix two AM containers get allocated when AM restart > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4386) refreshNodesGracefully() looks at active RMNode list for recommissioning decommissioned nodes
[ https://issues.apache.org/jira/browse/YARN-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131526#comment-15131526 ] Kuhu Shukla commented on YARN-4386: --- bq. Updating patch with a test to check if a decommissioned node can ever transition to running state by graceful decommissioning process. The test TestRMNodeTransitions#testRecommissionNode covers the other case where a node can be recommissioned after being in decommissioning state. Since we know that only inactiveRMNodes will contain the decommissioned node, the check for such in a node in active list is not useful. [~djp], [~sunilg] Request for comments/review. Thanks a lot! > refreshNodesGracefully() looks at active RMNode list for recommissioning > decommissioned nodes > - > > Key: YARN-4386 > URL: https://issues.apache.org/jira/browse/YARN-4386 > Project: Hadoop YARN > Issue Type: Bug > Components: graceful >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: YARN-4386-v1.patch, YARN-4386-v2.patch > > > In refreshNodesGracefully(), during recommissioning, the entryset from > getRMNodes() which has only active nodes (RUNNING, DECOMMISSIONING etc.) is > used for checking 'decommissioned' nodes which are present in > getInactiveRMNodes() map alone. > {code} > for (Entryentry:rmContext.getRMNodes().entrySet()) { > . > // Recommissioning the nodes > if (entry.getValue().getState() == NodeState.DECOMMISSIONING > || entry.getValue().getState() == NodeState.DECOMMISSIONED) { > this.rmContext.getDispatcher().getEventHandler() > .handle(new RMNodeEvent(nodeId, RMNodeEventType.RECOMMISSION)); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4670) add logging when a node is AM-blacklisted
[ https://issues.apache.org/jira/browse/YARN-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131643#comment-15131643 ] Naganarasimha G R commented on YARN-4670: - hi [~sjlee0], To an extent now we wil be able to find it out after YARN-4307 and YARN-3946, and also in the trunk code i am able to see a debug log for the same in {{SchedulerAppUtils.isBlackListed}}, anything more is planned for this ? > add logging when a node is AM-blacklisted > - > > Key: YARN-4670 > URL: https://issues.apache.org/jira/browse/YARN-4670 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Trivial > > Today there is not much logging happening when a node is blacklisted for an > AM (see YARN-2005). We can add a little more logging to see this activity > easily from the RM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4669) Fix logging statements in resource manager's Application class
[ https://issues.apache.org/jira/browse/YARN-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131600#comment-15131600 ] Sidharta Seethana commented on YARN-4669: - Test failures seem unrelated. > Fix logging statements in resource manager's Application class > -- > > Key: YARN-4669 > URL: https://issues.apache.org/jira/browse/YARN-4669 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana >Priority: Trivial > Attachments: YARN-4669.001.patch > > > There seem to be a couple of System.out.println() calls that should be > replaced by info/debug logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4667) RM Admin CLI for refreshNodesResources throws NPE when nothing is configured
[ https://issues.apache.org/jira/browse/YARN-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131654#comment-15131654 ] Naganarasimha G R commented on YARN-4667: - {{TestClientRMTokens}} and {{TestAMAuthorization}} are already tracked in other jiras... > RM Admin CLI for refreshNodesResources throws NPE when nothing is configured > > > Key: YARN-4667 > URL: https://issues.apache.org/jira/browse/YARN-4667 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4667.v1.001.patch > > > {quote} > $ ./yarn rmadmin -refreshNodesResources > 16/02/03 10:54:27 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > refreshNodesResources: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshNodesResources(AdminService.java:655) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshNodesResources(ResourceManagerAdministrationProtocolPBServiceImpl.java:246) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:287) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4669) Fix logging statements in resource manager's Application class
[ https://issues.apache.org/jira/browse/YARN-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131525#comment-15131525 ] Hadoop QA commented on YARN-4669: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 47s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 142m 23s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem ||
[jira] [Updated] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code
[ https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4409: --- Attachment: YARN-4409-YARN-2928.01.patch > Fix javadoc and checkstyle issues in timelineservice code > - > > Key: YARN-4409 > URL: https://issues.apache.org/jira/browse/YARN-4409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4409-YARN-2928.01.patch, > YARN-4409-YARN-2928.wip.01.patch > > > There are a large number of javadoc and checkstyle issues currently open in > timelineservice code. We need to fix them before we merge it into trunk. > Refer to > https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267 > We still have 94 open checkstyle issues and javadocs failing for Java 8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4446) Refactor reader API for better extensibility
[ https://issues.apache.org/jira/browse/YARN-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131564#comment-15131564 ] Varun Saxena commented on YARN-4446: Thanks [~sjlee0] for the review and commit. > Refactor reader API for better extensibility > > > Key: YARN-4446 > URL: https://issues.apache.org/jira/browse/YARN-4446 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Fix For: YARN-2928 > > Attachments: YARN-4446-YARN-2928.01.patch, > YARN-4446-YARN-2928.02.patch, YARN-4446-YARN-2928.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code
[ https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131568#comment-15131568 ] Varun Saxena commented on YARN-4409: Yes, will do so. I have the patch ready. > Fix javadoc and checkstyle issues in timelineservice code > - > > Key: YARN-4409 > URL: https://issues.apache.org/jira/browse/YARN-4409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4409-YARN-2928.wip.01.patch > > > There are a large number of javadoc and checkstyle issues currently open in > timelineservice code. We need to fix them before we merge it into trunk. > Refer to > https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267 > We still have 94 open checkstyle issues and javadocs failing for Java 8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131632#comment-15131632 ] Jian He commented on YARN-4635: --- bq. Some DISKS_FAILED could happens due to the failed container write disk to full. But it could still have other directories available to use by node. It could still get launched with normal containers but not suitable to risk AM container. In current code, the DISKS_FAILED status is set when this condition is true {code} if (!dirsHandler.areDisksHealthy()) { ret = ContainerExitStatus.DISKS_FAILED; throw new IOException("Most of the disks failed. " + dirsHandler.getDisksHealthReport(false)); } {code} The same check {{dirsHandler.areDisksHealthy}} is used by DiskHealth monitor. {code} boolean isHealthy() { boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true : nodeHealthScriptRunner.isHealthy(); return scriptHealthStatus && dirsHandler.areDisksHealthy(); } {code} Essentially, if this condition is false, the node will be reported as unhealthy in the first place, which makes RM remove the node. And the global blacklisted becomes not useful in practice because the node is already removed. Maybe I missed something, a unit test can prove this. bq. If so, it means the job submitter have to understand how many nodes the current cluster have Sorry, I don't understand why job submitter needs to understand the number of nodes. what I meant is that, right now a boolean flag(false) is used to indicate that this feature is disabled. alternatively, a 0 threshold can achieve the same result (with logic change on RM side). I said this because I feel the API may look simpler and we don't need a separate nested AMBlackListingRequest class. Having the threshold set in submissionContext will be enough. But I don't have strong opinion on this. Current way is ok too. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4594) container-executor fails to remove directory tree when chmod required
[ https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-4594: --- Attachment: YARN-4594.004.patch Sigh. The negation is a really hard habit to break... it's the pattern for how errors are handled in the kernel. This should fix it. I also changed it to use "fullpath" when printing error messages, to make it easier to figure out which file had a problem. Fixed the whitespace nits as well. Thanks > container-executor fails to remove directory tree when chmod required > - > > Key: YARN-4594 > URL: https://issues.apache.org/jira/browse/YARN-4594 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: YARN-4594.001.patch, YARN-4594.002.patch, > YARN-4594.003.patch, YARN-4594.004.patch > > > test-container-executor.c doesn't work: > * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually > /usr/bin/ls on many systems. > * The recursive delete logic in container-executor.c fails -- nftw does the > wrong thing when confronted with directories with the wrong mode (permission > bits), leading to an attempt to run rmdir on a non-empty directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required
[ https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130640#comment-15130640 ] Hadoop QA commented on YARN-4594: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 8s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 7s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 48s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 17s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 29m 6s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12786037/YARN-4594.004.patch | | JIRA Issue | YARN-4594 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 5021c6c9ce40 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 1adb64e | | Default Java | 1.7.0_91 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 | | JDK v1.7.0_91 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/10482/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Max memory used | 77MB | | Powered by | Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10482/console | This message was automatically generated. > container-executor
[jira] [Updated] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code
[ https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4409: --- Attachment: (was: YARN-4409-YARN-2928.wip.01.patch) > Fix javadoc and checkstyle issues in timelineservice code > - > > Key: YARN-4409 > URL: https://issues.apache.org/jira/browse/YARN-4409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4409-YARN-2928.01.patch > > > There are a large number of javadoc and checkstyle issues currently open in > timelineservice code. We need to fix them before we merge it into trunk. > Refer to > https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267 > We still have 94 open checkstyle issues and javadocs failing for Java 8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4307) Display blacklisted nodes for AM container in the RM web UI
[ https://issues.apache.org/jira/browse/YARN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4307: Summary: Display blacklisted nodes for AM container in the RM web UI (was: Blacklisted nodes for AM container is not getting displayed in the Web UI) > Display blacklisted nodes for AM container in the RM web UI > --- > > Key: YARN-4307 > URL: https://issues.apache.org/jira/browse/YARN-4307 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: AppInfoPage.png, RMappAttempt.png, > YARN-4307.v1.001.patch, YARN-4307.v1.002.patch, YARN-4307.v1.003.patch, > YARN-4307.v1.004.patch, YARN-4307.v1.005.patch, webpage.png, > yarn-capacity-scheduler-debug.log > > > In pseudo cluster had 2 NM's and had launched app with incorrect > configuration *./hadoop org.apache.hadoop.mapreduce.SleepJob > -Dmapreduce.job.node-label-expression=labelX > -Dyarn.app.mapreduce.am.env=JAVA_HOME=/no/jvm/here -m 5 -mt 1200*. > First attempt failed and 2nd attempt was launched, but the application was > hung. In the scheduler logs found that localhost was blacklisted but in the > UI (app& apps listing page) count was shown as zero and as well no hosts > listed in the app page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code
[ https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131866#comment-15131866 ] Hadoop QA commented on YARN-4409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 10s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 5s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 17s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 58s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 45s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 43s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 36s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 54s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 20s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 34s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 2 new + 36 unchanged - 346 fixed = 38 total (was 382) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 36s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 3m 17s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-api-jdk1.8.0_66 with JDK v1.8.0_66 generated 6 new + 94 unchanged - 6 fixed = 100 total (was 100) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 57s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 31s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 15s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK
[jira] [Commented] (YARN-4670) add logging when a node is AM-blacklisted
[ https://issues.apache.org/jira/browse/YARN-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131820#comment-15131820 ] Naganarasimha G R commented on YARN-4670: - YARN-3946 informs if the app's AM is stuck and if it misses to schedule on a node due to its blacklisting for launching AM's for this app. bq. Regarding logging in SchedulerAppUtils.isBlackListed(), does that get used for the AM blacklisting too? It's not obvious to me. *SchedulerAppUtils.isBlackListed* -> *SchedulerApplicationAttempt.isBlackListed* -> *AppSchedulingInfo.isBlackListed* and finally in the last call they are checking for the AM black list so in a way logging is there but as you said its not too obvious. ??volume of logging?? depends on the number of nodes which are free and and how many are blacklisted for the application and how long other nodes are occupied. but IMO it would be rare scenario volume explodes and we can have logging for it in *AppSchedulingInfo.isBlackListed* , Thoughts ? > add logging when a node is AM-blacklisted > - > > Key: YARN-4670 > URL: https://issues.apache.org/jira/browse/YARN-4670 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Trivial > > Today there is not much logging happening when a node is blacklisted for an > AM (see YARN-2005). We can add a little more logging to see this activity > easily from the RM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131755#comment-15131755 ] Jian He commented on YARN-4138: --- bq. We only confirm resource when NM reported resource is the same as RM resource. thanks for the explanation. I wonder why the decision was made to reset to the initial resource, in this case, the first increase happened successfully from app's point of view, will this confuse the apps if the resource somehow decrease back to the initial resource. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, > YARN-4138.5.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4662) Document some newly added metrics
[ https://issues.apache.org/jira/browse/YARN-4662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131688#comment-15131688 ] Xuan Gong commented on YARN-4662: - +1 LGTM. Checking this in > Document some newly added metrics > - > > Key: YARN-4662 > URL: https://issues.apache.org/jira/browse/YARN-4662 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-4662.1.patch, YARN-4662.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4662) Document some newly added metrics
[ https://issues.apache.org/jira/browse/YARN-4662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131699#comment-15131699 ] Xuan Gong commented on YARN-4662: - Committed into trunk/branch-2/branch-2.8. Thanks, Jian ! > Document some newly added metrics > - > > Key: YARN-4662 > URL: https://issues.apache.org/jira/browse/YARN-4662 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Fix For: 2.8.0 > > Attachments: YARN-4662.1.patch, YARN-4662.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4662) Document some newly added metrics
[ https://issues.apache.org/jira/browse/YARN-4662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131707#comment-15131707 ] Hudson commented on YARN-4662: -- FAILURE: Integrated in Hadoop-trunk-Commit #9243 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9243/]) YARN-4662. Document some newly added metrics. Contributed by Jian He (xgong: rev 63c63e298cf9ff252532297deedde15e77323809) * hadoop-common-project/hadoop-common/src/site/markdown/Metrics.md * hadoop-yarn-project/CHANGES.txt > Document some newly added metrics > - > > Key: YARN-4662 > URL: https://issues.apache.org/jira/browse/YARN-4662 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Fix For: 2.8.0 > > Attachments: YARN-4662.1.patch, YARN-4662.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4670) add logging when a node is AM-blacklisted
[ https://issues.apache.org/jira/browse/YARN-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131726#comment-15131726 ] Sangjin Lee commented on YARN-4670: --- Thanks for the info. I missed YARN-4307. Regarding logging in SchedulerAppUtils.isBlackListed(), does that get used for the *AM* blacklisting too? It's not obvious to me. I was looking more at RMAppAttemptImpl.sendAMContainerToNM(). Also, it would be good if this can be logged at the INFO level, as I don't think the volume of this logging is going to be too much and logging this during the normal operation would be useful? > add logging when a node is AM-blacklisted > - > > Key: YARN-4670 > URL: https://issues.apache.org/jira/browse/YARN-4670 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Trivial > > Today there is not much logging happening when a node is blacklisted for an > AM (see YARN-2005). We can add a little more logging to see this activity > easily from the RM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130324#comment-15130324 ] Sunil G commented on YARN-4635: --- Thanks [~jianhe] for the comments and thanks [~djp] for the clarifications. bq.Do you think it’s ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means enabled Adding one more minor advantage using a threshold. If app specifies {{AMBlackListingRequest}} flag as false, then global blacklisting will not be applicable for this app. Such control is easier with a flag i think, how do you feel. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130701#comment-15130701 ] Varun Vasudev commented on YARN-4665: - Jason's understanding on the REST API is correct - the user submits the app using POST and polls using GET. Internally the functionality uses the same code flow as the RPC path - all calls flow through ClientRMService#submitApplication. The RMAppManager has a check - {code} if (rmContext.getRMApps().putIfAbsent(applicationId, application) {code} so subsequent re-submits should not result in anything destructive. > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required
[ https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130733#comment-15130733 ] Hudson commented on YARN-4594: -- FAILURE: Integrated in Hadoop-trunk-Commit #9239 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9239/]) YARN-4594. container-executor fails to remove directory tree when chmod (jlowe: rev fa328e2d39eda1c479389b99a5c121e640a1e0ad) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c > container-executor fails to remove directory tree when chmod required > - > > Key: YARN-4594 > URL: https://issues.apache.org/jira/browse/YARN-4594 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Fix For: 2.9.0 > > Attachments: YARN-4594.001.patch, YARN-4594.002.patch, > YARN-4594.003.patch, YARN-4594.004.patch > > > test-container-executor.c doesn't work: > * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually > /usr/bin/ls on many systems. > * The recursive delete logic in container-executor.c fails -- nftw does the > wrong thing when confronted with directories with the wrong mode (permission > bits), leading to an attempt to run rmdir on a non-empty directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130971#comment-15130971 ] Sangjin Lee commented on YARN-3367: --- I agree it might be slightly better to try to drain the queue when it's shutting down. But we need to be clear that is still on a best-effort basis. Also, let's not increase the wait time. It might add to the stop time of things unnecessarily. I think there are ways to do it, but given the structure of the dispatcher code, it might be more practical to use a finally clause (outside the while loop). Note that the shutdown will come to this thread in the form of an interrupt. Otherwise, more restructuring of that code is needed. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-YARN-2928.v1.005.patch, > YARN-3367-YARN-2928.v1.006.patch, YARN-3367-YARN-2928.v1.007.patch, > YARN-3367-YARN-2928.v1.008.patch, YARN-3367-YARN-2928.v1.009.patch, > YARN-3367-YARN-2928.v1.010.patch, YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch, > sjlee-suggestion.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4446) Refactor reader API for better extensibility
[ https://issues.apache.org/jira/browse/YARN-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130974#comment-15130974 ] Sangjin Lee commented on YARN-4446: --- +1. I'll commit it soon. Please let me know now if you have any additional feedback. > Refactor reader API for better extensibility > > > Key: YARN-4446 > URL: https://issues.apache.org/jira/browse/YARN-4446 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4446-YARN-2928.01.patch, > YARN-4446-YARN-2928.02.patch, YARN-4446-YARN-2928.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token
[ https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130988#comment-15130988 ] Sangjin Lee commented on YARN-4183: --- I am +1 with the latest patch, but I'd wait until Mit and/or Jon chime in. [~mitdesai], [~jeagles], what are your thoughts? Is the conclusion here an acceptable conclusion for you guys? > Enabling generic application history forces every job to get a timeline > service delegation token > > > Key: YARN-4183 > URL: https://issues.apache.org/jira/browse/YARN-4183 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Naganarasimha G R > Attachments: YARN-4183.1.patch, YARN-4183.v1.001.patch, > YARN-4183.v1.002.patch > > > When enabling just the Generic History Server and not the timeline server, > the system metrics publisher will not publish the events to the timeline > store as it checks if the timeline server and system metrics publisher are > enabled before creating a timeline client. > To make it work, if the timeline service flag is turned on, it will force > every yarn application to get a delegation token. > Instead of checking if timeline service is enabled, we should be checking if > application history server is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required
[ https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130708#comment-15130708 ] Jason Lowe commented on YARN-4594: -- +1 lgtm. Committing this. > container-executor fails to remove directory tree when chmod required > - > > Key: YARN-4594 > URL: https://issues.apache.org/jira/browse/YARN-4594 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: YARN-4594.001.patch, YARN-4594.002.patch, > YARN-4594.003.patch, YARN-4594.004.patch > > > test-container-executor.c doesn't work: > * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually > /usr/bin/ls on many systems. > * The recursive delete logic in container-executor.c fails -- nftw does the > wrong thing when confronted with directories with the wrong mode (permission > bits), leading to an attempt to run rmdir on a non-empty directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138.5.patch Hi, [~jianhe] bq. After step 6, rmContainer.getLastConfirmedResource() will return 3G, when the expire event gets triggered, won't it reset it back to 3G? No, it won't reset it back to 3G. rmContainer.getLastConfirmedResource() will not return 3G after step 6, it is still 1G. We only confirm resource when NM reported resource is the same as RM resource. In this test case, NM reported resource is 3G, but RM allocated resource is 6G, so 3G is NOT confirmed. This issues was discussed in this thread a while ago: https://issues.apache.org/jira/browse/YARN-4138?focusedCommentId=14737229=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737229 bq. I think RMContainerImpl will not receive EXPIRE event at RUNNING state after this patch ? if so, we can remove this. You are right, we can remove this. Attaching the latest patch that remove this. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, > YARN-4138.5.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-4665: --- Assignee: Naganarasimha G R (was: Daniel Templeton) > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Naganarasimha G R > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130906#comment-15130906 ] Naganarasimha G R commented on YARN-4665: - In that case would it be helpful if we have a retry logic in {{RMWebServices.submitApplication}} ? so that it either gets succeeded or fails during RM failover ? > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Naganarasimha G R > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4665: Assignee: Daniel Templeton (was: Naganarasimha G R) > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4616) Default RM retry interval (30s) is too long
[ https://issues.apache.org/jira/browse/YARN-4616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-4616: -- Fix Version/s: (was: 2.8.0) > Default RM retry interval (30s) is too long > --- > > Key: YARN-4616 > URL: https://issues.apache.org/jira/browse/YARN-4616 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > > I think the default 30s for the RM retry interval is too long. > The default node-heartbeat-interval is only 1s -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4667) RM Admin CLI for refreshNodesResources throws NPE when nothing is configured
[ https://issues.apache.org/jira/browse/YARN-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4667: Attachment: YARN-4667.v1.001.patch Attaching a patch to fix this issue, [~rohithsharma]/[~devaraj.k], can one of your review this simple fix ? > RM Admin CLI for refreshNodesResources throws NPE when nothing is configured > > > Key: YARN-4667 > URL: https://issues.apache.org/jira/browse/YARN-4667 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4667.v1.001.patch > > > {quote} > $ ./yarn rmadmin -refreshNodesResources > 16/02/03 10:54:27 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > refreshNodesResources: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshNodesResources(AdminService.java:655) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshNodesResources(ResourceManagerAdministrationProtocolPBServiceImpl.java:246) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:287) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130825#comment-15130825 ] Naganarasimha G R commented on YARN-4665: - Thanks for some clarification [~vvasudev] & [~jlowe], bq. but I'd expect the submission logic to be a POST followed by GET polling until the state is ACCEPTED or later. If the GET results in a no-such-app error then the client retries the POST and continues polling. IIUC REST API user *needs to take care explicitly* in the above mentioned way so that its successfully submitted, if yes then we should better capture it in the document as nothing about this is mentioned 2.7.2 doc. Or correct me if i am missing something. [~vvasudev], bq. Internally the functionality uses the same code flow as the RPC path - all calls flow through ClientRMService#submitApplication. IIUC here the concern is, as the app submission is asynchronous so the submit call might return successfully but the statestore operation fails so on RM failover the submitted app is lost. In case of {{YarnClient}}, client takes care of re-requesting till the app state is appropriate but in case of REST, caller/user needs to take care of calling GET apps after doing a POST submission of a app. ??subsequent re-submits?? is handled in the server side but client needs to retry until it doesn't get a no-such-app error, right ? > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4665) Asynch submit can lose application submissions
[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130847#comment-15130847 ] Varun Vasudev commented on YARN-4665: - {quote} IIUC here the concern is, as the app submission is asynchronous so the submit call might return successfully but the statestore operation fails so on RM failover the submitted app is lost. In case of YarnClient, client takes care of re-requesting till the app state is appropriate but in case of REST, caller/user needs to take care of calling GET apps after doing a POST submission of a app. subsequent re-submits is handled in the server side but client needs to retry until it doesn't get a no-such-app error, right ? {quote} Yes. In the REST case, the submit call will return a 202 Accepted. It's the responsibility of the REST client to poll to figure out the state and re-submit if necessary. > Asynch submit can lose application submissions > -- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3669) Attempt-failures validatiy interval should have a global admin configurable lower limit
[ https://issues.apache.org/jira/browse/YARN-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3669: Attachment: YARN-3669.2.patch > Attempt-failures validatiy interval should have a global admin configurable > lower limit > --- > > Key: YARN-3669 > URL: https://issues.apache.org/jira/browse/YARN-3669 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Xuan Gong > Labels: newbie > Attachments: YARN-3669.1.patch, YARN-3669.2.patch > > > Found this while reviewing YARN-3480. > bq. When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. > I think we need to have a lower limit on the failure-validaty interval to > avoid situations like this. > Having this will avoid pardoning too-many failures in too-short a duration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)