[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560663#comment-14560663 ] Hong Zhiguo commented on YARN-3678: --- First, stop container happens frequently. Second, the pid recycle doesn't need to have a whole round in 250ms. Only need to have one or more rounds during the container lifetime. If we have 100 times of stop container happen on one node per day, we have 100/32768, about 0.3% chance for one node one day. That's not very low, especially when we have 5000 nodes. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560607#comment-14560607 ] Varun Vasudev commented on YARN-3678: - [~zhiguohong] thanks for the detailed explanation! When you say your fix reduced the rate to nearly zero, do you know why the accidental kill continued to happen? DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3719) Improve Solaris support in YARN
[ https://issues.apache.org/jira/browse/YARN-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560760#comment-14560760 ] Alan Burlison commented on YARN-3719: - It appears that Solaris will have the same setsid-related issue as BSD: YARN-3066 Hadoop leaves orphaned tasks running after job is killed Improve Solaris support in YARN --- Key: YARN-3719 URL: https://issues.apache.org/jira/browse/YARN-3719 Project: Hadoop YARN Issue Type: New Feature Components: build Affects Versions: 2.7.0 Environment: Solaris x86, Solaris sparc Reporter: Alan Burlison At present the YARN native components aren't fully supported on Solaris primarily due to differences between Linux and Solaris. This top-level task will be used to group together both existing and new issues related to this work. A second goal is to improve YARN performance and functionality on Solaris wherever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560808#comment-14560808 ] Hudson commented on YARN-3686: -- FAILURE: Integrated in Hadoop-Yarn-trunk #940 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/940/]) YARN-3686. CapacityScheduler should trim default_node_label_expression. (Sunil G via wangda) (wangda: rev cdbd66be111c93c85a409d47284e588c453ecae9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/QueueInfoPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560812#comment-14560812 ] Hudson commented on YARN-3632: -- FAILURE: Integrated in Hadoop-Yarn-trunk #940 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/940/]) YARN-3632. Ordering policy should be allowed to reorder an application when demand changes. Contributed by Craig Welch (jianhe: rev 10732d515f62258309f98e4d7d23249f80b1847d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/OrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FairOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/AbstractComparatorOrderingPolicy.java Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch, YARN-3632.6.patch, YARN-3632.7.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560810#comment-14560810 ] Hudson commented on YARN-160: - FAILURE: Integrated in Hadoop-Yarn-trunk #940 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/940/]) YARN-160. Enhanced NodeManager to automatically obtain cpu/memory values from underlying OS when configured to do so. Contributed by Varun Vasudev. (vinodkv: rev 500a1d9c76ec612b4e737888f4be79951c11591d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java * hadoop-tools/hadoop-gridmix/src/test/java/org/apache/hadoop/mapred/gridmix/DummyResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, YARN-160.008.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3714: --- Attachment: YARN-3714.001.patch Attached patch make WebAppUtils#getProxyHostsAndPortsForAmFilter to get RM webapp addresses from {{yarn.resourcemanager.hostname._rm-id_}} and default port number if {{yarn.resourcemanager.webapp.(https.)address._rm-id_}} are not set. AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560735#comment-14560735 ] Varun Saxena commented on YARN-3678: Yeah that's why I said if we can increase value of {{pid_max}} on a 64-bit machine to highest value it can take i.e. 2^22, that should mitigate the risk of this happening. But anyways, as I mentioned above, we can fix this though. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560658#comment-14560658 ] Hadoop QA commented on YARN-3714: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 38s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 32s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 15s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 58s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-server-web-proxy. | | | | 39m 53s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735545/YARN-3714.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cdbd66b | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8097/artifact/patchprocess/whitespace.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8097/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-web-proxy test log | https://builds.apache.org/job/PreCommit-YARN-Build/8097/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8097/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8097/console | This message was automatically generated. AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-3714.001.patch Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-41: -- Attachment: YARN-41-8.patch The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, YARN-41-8.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1042) add ability to specify affinity/anti-affinity in container requests
[ https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560776#comment-14560776 ] Weiwei Yang commented on YARN-1042: --- I am thinking about following approach, appreciate for suggestions : ) In ApplicationSubmissionContext class, add a new argument to indicate the container allocation rule in terms of affinity/anti-affinity. RM will follow the certain rules to allocate containers for this application. The argument is an instance of class ContainerAllocationRule(new), this class defines several types allocation rule, such as * AFFINITY_REQUIRED: containers MUST be allocated on the same host/rack * AFFINITY_PREFERED: prefer to allocate containers on same host/rack if possible * ANTI_AFFINITY_REQUIRED: containers MUST be allocated on different hosts/racks * ANTI_AFFINITY_PREFERED: prefer to allocate containers on different hosts/racks if possible Each of these rules will have a handler on the RM side to add some control on container allocation. When a client submits an application with a certain ContainerAllocationRule to RM, this information will be added into ApplicationAttemptId (because the allocation rule is defined per application), when RM uses registered scheduler to allocate containers, it can retrieve the rule from ApplicationAttemptId and call particular handler during the allocation. The code can be added into SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens so to avoid modifying all schedulers. add ability to specify affinity/anti-affinity in container requests --- Key: YARN-1042 URL: https://issues.apache.org/jira/browse/YARN-1042 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Arun C Murthy Attachments: YARN-1042-demo.patch container requests to the AM should be able to request anti-affinity to ensure that things like Region Servers don't come up on the same failure zones. Similarly, you may be able to want to specify affinity to same host or rack without specifying which specific host/rack. Example: bringing up a small giraph cluster in a large YARN cluster would benefit from having the processes in the same rack purely for bandwidth reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3712) ContainersLauncher: handle event CLEANUP_CONTAINER asynchronously
[ https://issues.apache.org/jira/browse/YARN-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560679#comment-14560679 ] Jun Gong commented on YARN-3712: [~vinodkv] Our case: NM receives a event SHUTDOWN, and starts to clean up containers. If doing it synchronously and cleaning up takes a little long time, some containers might not be killed cleaned up, then corresponding launching container process ContainersLauncher #.. will not exit until container finishes. It will result problem likes YARN-3585, NM hang. ContainersLauncher: handle event CLEANUP_CONTAINER asynchronously - Key: YARN-3712 URL: https://issues.apache.org/jira/browse/YARN-3712 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3712.01.patch, YARN-3712.02.patch It will save some time by handling event CLEANUP_CONTAINER asynchronously. This improvement will be useful for cases that cleaning up container cost a little long time(e.g. for our case: we are running Docker container on NM, it will take above 1 seconds to clean up one docker container. ) and many containers to clean up(e.g. NM need clean up all running containers when NM shutdown). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3066) Hadoop leaves orphaned tasks running after job is killed
[ https://issues.apache.org/jira/browse/YARN-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560762#comment-14560762 ] Alan Burlison commented on YARN-3066: - As Linux, OSX, Solaris and BSD all support the setsid(2) syscall and it's part of POSIX (http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm), isn't a better solution just to wrap setsid() + exec() in a little bit of JNI? That would avoid the need to install external executables. Hadoop leaves orphaned tasks running after job is killed Key: YARN-3066 URL: https://issues.apache.org/jira/browse/YARN-3066 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Hadoop 2.4.1 (probably all later too), FreeBSD-10.1 Reporter: Dmitry Sivachenko When spawning user task, node manager checks for setsid(1) utility and spawns task program via it. See hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java for instance: String exec = Shell.isSetsidAvailable? exec setsid : exec; FreeBSD, unlike Linux, does not have setsid(1) utility. So plain exec is used to spawn user task. If that task spawns other external programs (this is common case if a task program is a shell script) and user kills job via mapred job -kill Job, these child processes remain running. 1) Why do you silently ignore the absence of setsid(1) and spawn task process via exec: this is the guarantee to have orphaned processes when job is prematurely killed. 2) FreeBSD has a replacement third-party program called ssid (which does almost the same as Linux's setsid). It would be nice to detect which binary is present during configure stage and put @SETSID@ macros into java file to use the correct name. I propose to make Shell.isSetsidAvailable test more strict and fail to start if it is not found: at least we will know about the problem at start rather than guess why there are orphaned tasks running forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560774#comment-14560774 ] Rohith commented on YARN-3535: -- Thanks [~peng.zhang] for working on this issue.. Some comments # I think the method {{recoverResourceRequestForContainer}} should be synchronized, any thought? # Why do we require {{RMContextImpl.java}} changes? I think this we can avoid, not necessarily required. Tests : # Any specific reason for chaning {{TestAMRestart.java}}? # IIUC, this issue can occur in all the scheduler given AM-RM heart beat is lesser than NM-RM heart beat interval. So can it include FT test case that applicable for both CS and FS. May it you can add test in the extending class {{ParameterizedSchedulerTestBase}} i.e TestAbstractYarnScheduler. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Labels: BB2015-05-TBR Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560824#comment-14560824 ] Hudson commented on YARN-3686: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #210 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/210/]) YARN-3686. CapacityScheduler should trim default_node_label_expression. (Sunil G via wangda) (wangda: rev cdbd66be111c93c85a409d47284e588c453ecae9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/QueueInfoPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3712) ContainersLauncher: handle event CLEANUP_CONTAINER asynchronously
[ https://issues.apache.org/jira/browse/YARN-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560632#comment-14560632 ] Jun Gong commented on YARN-3712: [~sidharta-s] [~ashahab] Thanks for the suggestion. I am referring to both cleaning the docker image and container instance. With adding a feature that restarts stopped container, we have modified DockerContainerExecutor, and seperated docker run ... -rm to docker run -d and docker rm $CONTAINER_NAME. docker rm takes above 1 seconds. ContainersLauncher: handle event CLEANUP_CONTAINER asynchronously - Key: YARN-3712 URL: https://issues.apache.org/jira/browse/YARN-3712 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3712.01.patch, YARN-3712.02.patch It will save some time by handling event CLEANUP_CONTAINER asynchronously. This improvement will be useful for cases that cleaning up container cost a little long time(e.g. for our case: we are running Docker container on NM, it will take above 1 seconds to clean up one docker container. ) and many containers to clean up(e.g. NM need clean up all running containers when NM shutdown). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3627) Preemption not triggered in Fair scheduler when maxResources is set on parent queue
[ https://issues.apache.org/jira/browse/YARN-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt resolved YARN-3627. Resolution: Not A Problem Closing this issue as per comments Preemption not triggered in Fair scheduler when maxResources is set on parent queue --- Key: YARN-3627 URL: https://issues.apache.org/jira/browse/YARN-3627 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, scheduler Environment: Suse 11 SP3, 2 NM Reporter: Bibin A Chundatt Consider the below scenario of fair configuration Root (10Gb cluster resource) --Q1 (maxResources 4gb) Q1.1 (maxResources 4gb) Q1.2 (maxResources 4gb) --Q2 (maxResources 6GB) No applications are running in Q2 Submit one application with to Q1.1 with 50 maps 4Gb gets allocated to Q1.1 Now submit application to Q1.2 the same will be starving for memory always. Preemption will never get triggered since yarn.scheduler.fair.preemption.cluster-utilization-threshold =.8 and the cluster utilization is below .8. *Fairscheduler.java* {code} private boolean shouldAttemptPreemption() { if (preemptionEnabled) { return (preemptionUtilizationThreshold Math.max( (float) rootMetrics.getAllocatedMB() / clusterResource.getMemory(), (float) rootMetrics.getAllocatedVirtualCores() / clusterResource.getVirtualCores())); } return false; } {code} Are we supposed to configure in running cluster maxResources 0mb and 0 cores so that all queues can take full cluster resources always if available?? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560651#comment-14560651 ] Naganarasimha G R commented on YARN-3678: - Hi [~vvasudev] [~zhiguohong], For us it happened in secure setup and one key point is both the NM user and user of the container is same . But irrespective of this it could have killed any other process[container] for same/another app running in the same node, submitted by the same user. One suggestion(crude fix not sure how to get it working for other OS) is can we grep for the containerID and confirm its the same process we are targetting and then kill it ? DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560676#comment-14560676 ] Varun Vasudev commented on YARN-3678: - [~zhiguohong] - sorry my question was - after applying your fix, the problem should have gone away. However, you said - With this fix, the accident rate is reduced from several times per day to nearly zero. Do you know why it still happened? DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560630#comment-14560630 ] Varun Saxena commented on YARN-3678: Secure. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560643#comment-14560643 ] Varun Saxena commented on YARN-3678: As [~zhiguohong] mentioned, even in our case same user is used for NM and app-submitter. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560748#comment-14560748 ] Hong Zhiguo commented on YARN-3678: --- the event sequence: call SEND SIGTERM - pid recycle - call SEND SIGKILL - check process live time(based on current time) The time between [call SEND SIGTERM] and [call SEND SIGKILL] is 250ms The time between [pid recycle] and [check process live time] may be shorter or longer than 250ms. When it's longer than 250ms, there's chance we make false positive judgement. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3718) hadoop-yarn-server-nodemanager's use of Linux Cgroups is non-portable
[ https://issues.apache.org/jira/browse/YARN-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560766#comment-14560766 ] Alan Burlison commented on YARN-3718: - As far as I can tell, the solution on BSD is just to disable all resource management features at compile-time. Whilst that approach should probably be taken on initially Solaris, if it makes sense to use RM features on Linux it almost certainly does on Solaris as well. To do that requires taking a close look at how the Linux Cgroup features are currently used and if necessary abstracting that functionality so it can be implemented using both Linux and Solaris RM functionality. hadoop-yarn-server-nodemanager's use of Linux Cgroups is non-portable - Key: YARN-3718 URL: https://issues.apache.org/jira/browse/YARN-3718 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Environment: BSD OSX Solaris Windows Linux Reporter: Alan Burlison hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c makes use of the Linux-only Cgroups feature (http://en.wikipedia.org/wiki/Cgroups) when Hadoop is built on Linux, but there is no corresponding functionality for non-Linux platforms. Other platforms provide similar functionality, e.g. Solaris has an extensive range of resource management features (http://docs.oracle.com/cd/E23824_01/html/821-1460/index.html). Work is needed to abstract the resource management features of Yarn so that the same facilities for resource management can be provided on all platforms that provide the requisite functionality, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560615#comment-14560615 ] Varun Saxena commented on YARN-3678: I think if we increase the value of {{pid_max}}, issue is unlikely to occur. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3627) Preemption not triggered in Fair scheduler when maxResources is set on parent queue
[ https://issues.apache.org/jira/browse/YARN-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560636#comment-14560636 ] Bibin A Chundatt commented on YARN-3627: [~sunilg] . Thnk you for looking into the issue. Preemption not triggered in Fair scheduler when maxResources is set on parent queue --- Key: YARN-3627 URL: https://issues.apache.org/jira/browse/YARN-3627 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, scheduler Environment: Suse 11 SP3, 2 NM Reporter: Bibin A Chundatt Consider the below scenario of fair configuration Root (10Gb cluster resource) --Q1 (maxResources 4gb) Q1.1 (maxResources 4gb) Q1.2 (maxResources 4gb) --Q2 (maxResources 6GB) No applications are running in Q2 Submit one application with to Q1.1 with 50 maps 4Gb gets allocated to Q1.1 Now submit application to Q1.2 the same will be starving for memory always. Preemption will never get triggered since yarn.scheduler.fair.preemption.cluster-utilization-threshold =.8 and the cluster utilization is below .8. *Fairscheduler.java* {code} private boolean shouldAttemptPreemption() { if (preemptionEnabled) { return (preemptionUtilizationThreshold Math.max( (float) rootMetrics.getAllocatedMB() / clusterResource.getMemory(), (float) rootMetrics.getAllocatedVirtualCores() / clusterResource.getVirtualCores())); } return false; } {code} Are we supposed to configure in running cluster maxResources 0mb and 0 cores so that all queues can take full cluster resources always if available?? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560817#comment-14560817 ] Hadoop QA commented on YARN-41: --- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 46s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 12 new or modified test files. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 24s | The applied patch generated 1 new checkstyle issues (total was 14, now 12). | | {color:green}+1{color} | whitespace | 0m 34s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 5m 9s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 6m 8s | Tests passed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 50m 11s | Tests passed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 53s | Tests passed in hadoop-yarn-server-tests. | | | | 102m 10s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735565/YARN-41-8.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bb18163 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8098/artifact/patchprocess/diffcheckstylehadoop-yarn-server-common.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8098/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8098/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8098/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8098/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/8098/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8098/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8098/console | This message was automatically generated. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, YARN-41-8.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560948#comment-14560948 ] Jason Lowe commented on YARN-3585: -- Do you have the shutdown logs from the NM that hung? It seems very likely that somehow we did not close the leveldb state store cleanly, if you're seeing a leveldb non-daemon thread holding up the JVM shutdown. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3690) 'mvn site' fails on JDK8
[ https://issues.apache.org/jira/browse/YARN-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560976#comment-14560976 ] Brahma Reddy Battula commented on YARN-3690: [~ajisakaa] Attached the patch.Kindly Review. 'mvn site' fails on JDK8 Key: YARN-3690 URL: https://issues.apache.org/jira/browse/YARN-3690 Project: Hadoop YARN Issue Type: Bug Components: api, site Environment: CentOS 7.0, Oracle JDK 8u45. Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Attachments: YARN-3690-patch 'mvn site' failed by the following error: {noformat} [ERROR] /home/aajisaka/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factories/package-info.java:18: error: package org.apache.hadoop.yarn.factories has already been annotated [ERROR] @InterfaceAudience.LimitedPrivate({ MapReduce, YARN }) [ERROR] ^ [ERROR] java.lang.AssertionError [ERROR] at com.sun.tools.javac.util.Assert.error(Assert.java:126) [ERROR] at com.sun.tools.javac.util.Assert.check(Assert.java:45) [ERROR] at com.sun.tools.javac.code.SymbolMetadata.setDeclarationAttributesWithCompletion(SymbolMetadata.java:161) [ERROR] at com.sun.tools.javac.code.Symbol.setDeclarationAttributesWithCompletion(Symbol.java:215) [ERROR] at com.sun.tools.javac.comp.MemberEnter.actualEnterAnnotations(MemberEnter.java:952) [ERROR] at com.sun.tools.javac.comp.MemberEnter.access$600(MemberEnter.java:64) [ERROR] at com.sun.tools.javac.comp.MemberEnter$5.run(MemberEnter.java:876) [ERROR] at com.sun.tools.javac.comp.Annotate.flush(Annotate.java:143) [ERROR] at com.sun.tools.javac.comp.Annotate.enterDone(Annotate.java:129) [ERROR] at com.sun.tools.javac.comp.Enter.complete(Enter.java:512) [ERROR] at com.sun.tools.javac.comp.Enter.main(Enter.java:471) [ERROR] at com.sun.tools.javadoc.JavadocEnter.main(JavadocEnter.java:78) [ERROR] at com.sun.tools.javadoc.JavadocTool.getRootDocImpl(JavadocTool.java:186) [ERROR] at com.sun.tools.javadoc.Start.parseAndExecute(Start.java:346) [ERROR] at com.sun.tools.javadoc.Start.begin(Start.java:219) [ERROR] at com.sun.tools.javadoc.Start.begin(Start.java:205) [ERROR] at com.sun.tools.javadoc.Main.execute(Main.java:64) [ERROR] at com.sun.tools.javadoc.Main.main(Main.java:54) [ERROR] javadoc: error - fatal error [ERROR] [ERROR] Command line was: /usr/java/jdk1.8.0_45/jre/../bin/javadoc -J-Xmx1024m @options @packages [ERROR] [ERROR] Refer to the generated Javadoc files in '/home/aajisaka/git/hadoop/target/site/hadoop-project/api' dir. [ERROR] - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3690) 'mvn site' fails on JDK8
[ https://issues.apache.org/jira/browse/YARN-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3690: --- Attachment: YARN-3690-patch 'mvn site' fails on JDK8 Key: YARN-3690 URL: https://issues.apache.org/jira/browse/YARN-3690 Project: Hadoop YARN Issue Type: Bug Components: api, site Environment: CentOS 7.0, Oracle JDK 8u45. Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula Attachments: YARN-3690-patch 'mvn site' failed by the following error: {noformat} [ERROR] /home/aajisaka/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factories/package-info.java:18: error: package org.apache.hadoop.yarn.factories has already been annotated [ERROR] @InterfaceAudience.LimitedPrivate({ MapReduce, YARN }) [ERROR] ^ [ERROR] java.lang.AssertionError [ERROR] at com.sun.tools.javac.util.Assert.error(Assert.java:126) [ERROR] at com.sun.tools.javac.util.Assert.check(Assert.java:45) [ERROR] at com.sun.tools.javac.code.SymbolMetadata.setDeclarationAttributesWithCompletion(SymbolMetadata.java:161) [ERROR] at com.sun.tools.javac.code.Symbol.setDeclarationAttributesWithCompletion(Symbol.java:215) [ERROR] at com.sun.tools.javac.comp.MemberEnter.actualEnterAnnotations(MemberEnter.java:952) [ERROR] at com.sun.tools.javac.comp.MemberEnter.access$600(MemberEnter.java:64) [ERROR] at com.sun.tools.javac.comp.MemberEnter$5.run(MemberEnter.java:876) [ERROR] at com.sun.tools.javac.comp.Annotate.flush(Annotate.java:143) [ERROR] at com.sun.tools.javac.comp.Annotate.enterDone(Annotate.java:129) [ERROR] at com.sun.tools.javac.comp.Enter.complete(Enter.java:512) [ERROR] at com.sun.tools.javac.comp.Enter.main(Enter.java:471) [ERROR] at com.sun.tools.javadoc.JavadocEnter.main(JavadocEnter.java:78) [ERROR] at com.sun.tools.javadoc.JavadocTool.getRootDocImpl(JavadocTool.java:186) [ERROR] at com.sun.tools.javadoc.Start.parseAndExecute(Start.java:346) [ERROR] at com.sun.tools.javadoc.Start.begin(Start.java:219) [ERROR] at com.sun.tools.javadoc.Start.begin(Start.java:205) [ERROR] at com.sun.tools.javadoc.Main.execute(Main.java:64) [ERROR] at com.sun.tools.javadoc.Main.main(Main.java:54) [ERROR] javadoc: error - fatal error [ERROR] [ERROR] Command line was: /usr/java/jdk1.8.0_45/jre/../bin/javadoc -J-Xmx1024m @options @packages [ERROR] [ERROR] Refer to the generated Javadoc files in '/home/aajisaka/git/hadoop/target/site/hadoop-project/api' dir. [ERROR] - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3690) 'mvn site' fails on JDK8
[ https://issues.apache.org/jira/browse/YARN-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3690: --- Component/s: (was: documentation) site api 'mvn site' fails on JDK8 Key: YARN-3690 URL: https://issues.apache.org/jira/browse/YARN-3690 Project: Hadoop YARN Issue Type: Bug Components: api, site Environment: CentOS 7.0, Oracle JDK 8u45. Reporter: Akira AJISAKA Assignee: Brahma Reddy Battula 'mvn site' failed by the following error: {noformat} [ERROR] /home/aajisaka/git/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factories/package-info.java:18: error: package org.apache.hadoop.yarn.factories has already been annotated [ERROR] @InterfaceAudience.LimitedPrivate({ MapReduce, YARN }) [ERROR] ^ [ERROR] java.lang.AssertionError [ERROR] at com.sun.tools.javac.util.Assert.error(Assert.java:126) [ERROR] at com.sun.tools.javac.util.Assert.check(Assert.java:45) [ERROR] at com.sun.tools.javac.code.SymbolMetadata.setDeclarationAttributesWithCompletion(SymbolMetadata.java:161) [ERROR] at com.sun.tools.javac.code.Symbol.setDeclarationAttributesWithCompletion(Symbol.java:215) [ERROR] at com.sun.tools.javac.comp.MemberEnter.actualEnterAnnotations(MemberEnter.java:952) [ERROR] at com.sun.tools.javac.comp.MemberEnter.access$600(MemberEnter.java:64) [ERROR] at com.sun.tools.javac.comp.MemberEnter$5.run(MemberEnter.java:876) [ERROR] at com.sun.tools.javac.comp.Annotate.flush(Annotate.java:143) [ERROR] at com.sun.tools.javac.comp.Annotate.enterDone(Annotate.java:129) [ERROR] at com.sun.tools.javac.comp.Enter.complete(Enter.java:512) [ERROR] at com.sun.tools.javac.comp.Enter.main(Enter.java:471) [ERROR] at com.sun.tools.javadoc.JavadocEnter.main(JavadocEnter.java:78) [ERROR] at com.sun.tools.javadoc.JavadocTool.getRootDocImpl(JavadocTool.java:186) [ERROR] at com.sun.tools.javadoc.Start.parseAndExecute(Start.java:346) [ERROR] at com.sun.tools.javadoc.Start.begin(Start.java:219) [ERROR] at com.sun.tools.javadoc.Start.begin(Start.java:205) [ERROR] at com.sun.tools.javadoc.Main.execute(Main.java:64) [ERROR] at com.sun.tools.javadoc.Main.main(Main.java:54) [ERROR] javadoc: error - fatal error [ERROR] [ERROR] Command line was: /usr/java/jdk1.8.0_45/jre/../bin/javadoc -J-Xmx1024m @options @packages [ERROR] [ERROR] Refer to the generated Javadoc files in '/home/aajisaka/git/hadoop/target/site/hadoop-project/api' dir. [ERROR] - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560853#comment-14560853 ] Devaraj K commented on YARN-41: --- {code:xml} -1 checkstyle 2m 24s The applied patch generated 1 new checkstyle issues (total was 14, now 12). {code} {code:xml} ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/UnRegisterNodeManagerResponse.java:0: Missing package-info.java file. {code} This checkstyle issue doesn't seem to be directly related to UnRegisterNodeManagerResponse.java. I have added another class UnRegisterNodeManagerRequest.java in the same package which doesn't show up any checkstyle and also locally I don't get any checkstyle error for this class. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, YARN-41-8.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560945#comment-14560945 ] Hudson commented on YARN-3632: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #208 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/208/]) YARN-3632. Ordering policy should be allowed to reorder an application when demand changes. Contributed by Craig Welch (jianhe: rev 10732d515f62258309f98e4d7d23249f80b1847d) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/OrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/AbstractComparatorOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FairOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch, YARN-3632.6.patch, YARN-3632.7.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3679) Add documentation for timeline server filter ordering
[ https://issues.apache.org/jira/browse/YARN-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560959#comment-14560959 ] Mit Desai commented on YARN-3679: - [~zjshen]/[~jeagles] did you guys get a chance to take a look at this? Add documentation for timeline server filter ordering - Key: YARN-3679 URL: https://issues.apache.org/jira/browse/YARN-3679 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-3679.patch Currently the auth filter is before static user filter by default. After YARN-3624, the filter order is no longer reversed. So the pseudo auth's allowing anonymous config is useless with both filters loaded in the new order, because static user will be created before presenting it to auth filter. The user can remove static user filter from the config to get anonymous user work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560941#comment-14560941 ] Hudson commented on YARN-3686: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #208 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/208/]) YARN-3686. CapacityScheduler should trim default_node_label_expression. (Sunil G via wangda) (wangda: rev cdbd66be111c93c85a409d47284e588c453ecae9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/QueueInfoPBImpl.java * hadoop-yarn-project/CHANGES.txt CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560943#comment-14560943 ] Hudson commented on YARN-160: - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #208 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/208/]) YARN-160. Enhanced NodeManager to automatically obtain cpu/memory values from underlying OS when configured to do so. Contributed by Varun Vasudev. (vinodkv: rev 500a1d9c76ec612b4e737888f4be79951c11591d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLinuxResourceCalculatorPlugin.java * hadoop-tools/hadoop-gridmix/src/test/java/org/apache/hadoop/mapred/gridmix/DummyResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, YARN-160.008.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560965#comment-14560965 ] Rohith commented on YARN-3585: -- I have attached NM logs and thread dump in YARN-3640. Would get it from YARN-3640? NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561014#comment-14561014 ] Jason Lowe commented on YARN-3585: -- Ah, my apologies. I didn't realize it is failing with the exact same logs, even after YARN-3641. Could you to instrument logs in the state store code to verify the leveldb database is indeed being closed even when it hangs? Trying to determine if this is a bug in Hadoop code or a bug in the leveldb code. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560891#comment-14560891 ] Hadoop QA commented on YARN-3678: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735592/YARN-3678.patch | | Optional Tests | | | git revision | trunk / bb18163 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8099/console | This message was automatically generated. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Attachments: YARN-3678.patch Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-3678: - Attachment: YARN-3678.patch DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Attachments: YARN-3678.patch Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3051: --- Attachment: YARN-3051-YARN-2928.003.patch [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051-YARN-2928.003.patch, YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561391#comment-14561391 ] Joep Rottinghuis commented on YARN-3706: Initially I was stuck on YARN-3721, but I have my environment setup properly now. I'll work on sanitizing the patch and upload a new version. I don't expect the overall structure and approach to significantly change. The updated patch will have deletions and renames from existing classes included (and may therefore be somewhat harder to read). Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561393#comment-14561393 ] Hadoop QA commented on YARN-3051: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735644/YARN-3051-YARN-2928.003.patch | | Optional Tests | shellcheck javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / e19566a | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8100/console | This message was automatically generated. [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051-YARN-2928.003.patch, YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3569) YarnClient.getAllQueues returns a list of queues that do not display running apps.
[ https://issues.apache.org/jira/browse/YARN-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561433#comment-14561433 ] Jian He commented on YARN-3569: --- [~spandan], what is your use case ? If you want to get applications for the given queues, you can use below API in YarnClient. {code} public abstract ListApplicationReport getApplications(SetString queues, SetString users, SetString applicationTypes, EnumSetYarnApplicationState applicationStates) {code} YarnClient.getAllQueues returns a list of queues that do not display running apps. -- Key: YARN-3569 URL: https://issues.apache.org/jira/browse/YARN-3569 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.8.0 Reporter: Spandan Dutta Assignee: Spandan Dutta Attachments: YARN-3569.patch YarnClient.getAllQueues() returns a list of queues. If we pick a queue from this list and call getApplications on it, we always get an empty list even-though applications are running on that queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3581) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3581: Attachment: YARN-3581.20150528-1.patch Hi [~wangda], updated a patch with your earlier review comments fixed but please check my previous comment and also confirm if you require patch for 2.7.1 branch (as per offline discussion you wanted fix in 2.7.1, so that it will be used and later deleted in 2.8)? Deprecate -directlyAccessNodeLabelStore in RMAdminCLI - Key: YARN-3581 URL: https://issues.apache.org/jira/browse/YARN-3581 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-3581.20150525-1.patch, YARN-3581.20150528-1.patch In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make RM can start with label-configured queue settings. After YARN-2918, we don't need this option any more, admin can configure queue setting, start RM and configure node label via RMAdminCLI without any error. In addition, this option is very restrictive, first it needs to run on the same node where RM is running if admin configured to store labels in local disk. Second, when admin run the option when RM is running, multiple process write to a same file can happen, this could make node label store becomes invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3652) A SchedulerMetrics may be need for evaluating the scheduler's performance
[ https://issues.apache.org/jira/browse/YARN-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561210#comment-14561210 ] Varun Vasudev commented on YARN-3652: - My apologies for the delay [~xinxianyin]. We do need a SchedulerMetrics class. The general idea is that SchedulerHealth should pick up values from the SchedulerMetrics class but that the SchedulerMetrics class should ideally provide more information. As an example, the SchedulerHealth cares about the number of reserved containers, which the SchedulerMetrics class should provide. Ideally, though, the SchedulerMetrics class would also give me some extra information such as the mean, the distribution and the variance of the the number of reserved containers. I think purely for the purposes of YARN-3630, you should use modify the SchedulerHealth class to expose the number of waiting events, but we can independently work on a SchedulerMetrics class as well. A SchedulerMetrics may be need for evaluating the scheduler's performance - Key: YARN-3652 URL: https://issues.apache.org/jira/browse/YARN-3652 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Reporter: Xianyin Xin As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating the scheduler's performance. The performance indexes includes #events waiting for being handled by scheduler, the throughput, the scheduling delay and/or other indicators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561137#comment-14561137 ] Hudson commented on YARN-3686: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2156 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2156/]) YARN-3686. CapacityScheduler should trim default_node_label_expression. (Sunil G via wangda) (wangda: rev cdbd66be111c93c85a409d47284e588c453ecae9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/QueueInfoPBImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceRequestPBImpl.java CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3721) build is broken on YARN-2928 branch due to possible dependency cycle
[ https://issues.apache.org/jira/browse/YARN-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561190#comment-14561190 ] Sangjin Lee commented on YARN-3721: --- [~gtCarrera9], as you point out, Failure to find org.apache.hadoop:hadoop-yarn-server-timelineservice:jar:3.0.0-SNAPSHOT is caused by the cycle in the dependencies. I just want to see whether excluding mini-cluster from the hbase-testing-util module is the correct fix. For that to be the case, none of our unit tests should depend on the mini-cluster module. A follow-up question is, if that is the case, then what do we need from hbase-testing-util? It seems like HBaseTestingUtility (used in TestHBaseTimelineWriterImpl) is provided by hbase-server:test (which is pulled in indirectly by the phoenix dependency?). Then what are we getting from hbase-testing-util that we need? [~swagle]? If we can isolate the thing we need from hbase-testing-util dependencies, then we could possibly remove hbase-testing-util from the dependencies and use that instead. I'm wondering out loud. I suppose it all depends on what we actually use from hbase-testing-util and its dependencies. [~vrushalic], could you take a look at the unit test failure Li mentioned? Is that independent of this issue? Thanks! build is broken on YARN-2928 branch due to possible dependency cycle Key: YARN-3721 URL: https://issues.apache.org/jira/browse/YARN-3721 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Li Lu Priority: Blocker Attachments: YARN-3721-YARN-2928.001.patch The build is broken on the YARN-2928 branch at the hadoop-yarn-server-timelineservice module. It's been broken for a while, but we didn't notice it because the build happens to work despite this if the maven local cache is not cleared. To reproduce, remove all hadoop (3.0.0-SNAPSHOT) artifacts from your maven local cache and build it. Almost certainly it was introduced by YARN-3529. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3723) Need to clearly document primaryFilter and otherInfo value type
Zhijie Shen created YARN-3723: - Summary: Need to clearly document primaryFilter and otherInfo value type Key: YARN-3723 URL: https://issues.apache.org/jira/browse/YARN-3723 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561134#comment-14561134 ] MENG DING commented on YARN-1197: - Correct a typo in my previous post, it should be: bq. As an example, if a container is currently using 2G, and AM asks to increase its resource to 4G, and then asks again to increase to 6G, but AM doesn't actually use any of the token to increase the resource on NM. In this case, with the current design, RM can only revert the resource allocation back to 4G after expiration, not 2G. Forgot to discuss another important piece. We probably should not use the existing ResourceCalculator to compare two resource capabilities in this project, because: - The DefaultResourceCalculator only compares memory, which won't work if we want to only change CPU cores. - The DominantResourceCalculator may end up comparing different dimensions between two Resources, which doesn't make sense in our project. The way to compare two resource in this project should be straightforward as follows. Let me know if you think otherwise. - For increase request, no dimension in the target resource can be smaller than the corresponding dimension in the current resource, and at least one dimension in the target resource must be larger than the corresponding dimension in the current resource. - For decrease request, no dimension in the target resource can be larger than the corresponding dimension in the current resource, and at least one dimension in the target resource must be smaller than the corresponding dimension in the current resource. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197_Design.pdf, mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561141#comment-14561141 ] Hudson commented on YARN-3632: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2156 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2156/]) YARN-3632. Ordering policy should be allowed to reorder an application when demand changes. Contributed by Craig Welch (jianhe: rev 10732d515f62258309f98e4d7d23249f80b1847d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/OrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FairOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/AbstractComparatorOrderingPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoOrderingPolicy.java Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch, YARN-3632.6.patch, YARN-3632.7.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561139#comment-14561139 ] Hudson commented on YARN-160: - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2156 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2156/]) YARN-160. Enhanced NodeManager to automatically obtain cpu/memory values from underlying OS when configured to do so. Contributed by Varun Vasudev. (vinodkv: rev 500a1d9c76ec612b4e737888f4be79951c11591d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java * hadoop-tools/hadoop-gridmix/src/test/java/org/apache/hadoop/mapred/gridmix/DummyResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, YARN-160.008.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3721) build is broken on YARN-2928 branch due to possible dependency cycle
[ https://issues.apache.org/jira/browse/YARN-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561218#comment-14561218 ] Vrushali C commented on YARN-3721: -- bq. Vrushali C, could you take a look at the unit test failure Li mentioned? Is that independent of this issue? Thanks! Yes, looking into this now. build is broken on YARN-2928 branch due to possible dependency cycle Key: YARN-3721 URL: https://issues.apache.org/jira/browse/YARN-3721 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Li Lu Priority: Blocker Attachments: YARN-3721-YARN-2928.001.patch The build is broken on the YARN-2928 branch at the hadoop-yarn-server-timelineservice module. It's been broken for a while, but we didn't notice it because the build happens to work despite this if the maven local cache is not cleared. To reproduce, remove all hadoop (3.0.0-SNAPSHOT) artifacts from your maven local cache and build it. Almost certainly it was introduced by YARN-3529. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561059#comment-14561059 ] Hudson commented on YARN-160: - FAILURE: Integrated in Hadoop-Hdfs-trunk #2138 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2138/]) YARN-160. Enhanced NodeManager to automatically obtain cpu/memory values from underlying OS when configured to do so. Contributed by Varun Vasudev. (vinodkv: rev 500a1d9c76ec612b4e737888f4be79951c11591d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-tools/hadoop-gridmix/src/test/java/org/apache/hadoop/mapred/gridmix/DummyResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, YARN-160.008.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561061#comment-14561061 ] Hudson commented on YARN-3632: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2138 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2138/]) YARN-3632. Ordering policy should be allowed to reorder an application when demand changes. Contributed by Craig Welch (jianhe: rev 10732d515f62258309f98e4d7d23249f80b1847d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/AbstractComparatorOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/OrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FairOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch, YARN-3632.6.patch, YARN-3632.7.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3558) Additional containers getting reserved from RM in case of Fair scheduler
[ https://issues.apache.org/jira/browse/YARN-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3558: --- Attachment: rm.log Amlog.txt [~sunilg]. Attaching Rm log and AM log Additional containers getting reserved from RM in case of Fair scheduler Key: YARN-3558 URL: https://issues.apache.org/jira/browse/YARN-3558 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.7.0 Environment: OS :Suse 11 Sp3 Setup : 2 RM 2 NM Scheduler : Fair scheduler Reporter: Bibin A Chundatt Attachments: Amlog.txt, rm.log Submit PI job with 16 maps Total container expected : 16 MAPS + 1 Reduce + 1 AM Total containers reserved by RM is 21 Below set of containers are not being used for execution container_1430213948957_0001_01_20 container_1430213948957_0001_01_19 RM Containers reservation and states {code} Processing container_1430213948957_0001_01_01 of type START Processing container_1430213948957_0001_01_01 of type ACQUIRED Processing container_1430213948957_0001_01_01 of type LAUNCHED Processing container_1430213948957_0001_01_02 of type START Processing container_1430213948957_0001_01_03 of type START Processing container_1430213948957_0001_01_02 of type ACQUIRED Processing container_1430213948957_0001_01_03 of type ACQUIRED Processing container_1430213948957_0001_01_04 of type START Processing container_1430213948957_0001_01_05 of type START Processing container_1430213948957_0001_01_04 of type ACQUIRED Processing container_1430213948957_0001_01_05 of type ACQUIRED Processing container_1430213948957_0001_01_02 of type LAUNCHED Processing container_1430213948957_0001_01_04 of type LAUNCHED Processing container_1430213948957_0001_01_06 of type RESERVED Processing container_1430213948957_0001_01_03 of type LAUNCHED Processing container_1430213948957_0001_01_05 of type LAUNCHED Processing container_1430213948957_0001_01_07 of type START Processing container_1430213948957_0001_01_07 of type ACQUIRED Processing container_1430213948957_0001_01_07 of type LAUNCHED Processing container_1430213948957_0001_01_08 of type RESERVED Processing container_1430213948957_0001_01_02 of type FINISHED Processing container_1430213948957_0001_01_06 of type START Processing container_1430213948957_0001_01_06 of type ACQUIRED Processing container_1430213948957_0001_01_06 of type LAUNCHED Processing container_1430213948957_0001_01_04 of type FINISHED Processing container_1430213948957_0001_01_09 of type START Processing container_1430213948957_0001_01_09 of type ACQUIRED Processing container_1430213948957_0001_01_09 of type LAUNCHED Processing container_1430213948957_0001_01_10 of type RESERVED Processing container_1430213948957_0001_01_03 of type FINISHED Processing container_1430213948957_0001_01_08 of type START Processing container_1430213948957_0001_01_08 of type ACQUIRED Processing container_1430213948957_0001_01_08 of type LAUNCHED Processing container_1430213948957_0001_01_05 of type FINISHED Processing container_1430213948957_0001_01_11 of type START Processing container_1430213948957_0001_01_11 of type ACQUIRED Processing container_1430213948957_0001_01_11 of type LAUNCHED Processing container_1430213948957_0001_01_07 of type FINISHED Processing container_1430213948957_0001_01_12 of type START Processing container_1430213948957_0001_01_12 of type ACQUIRED Processing container_1430213948957_0001_01_12 of type LAUNCHED Processing container_1430213948957_0001_01_13 of type RESERVED Processing container_1430213948957_0001_01_06 of type FINISHED Processing container_1430213948957_0001_01_10 of type START Processing container_1430213948957_0001_01_10 of type ACQUIRED Processing container_1430213948957_0001_01_10 of type LAUNCHED Processing container_1430213948957_0001_01_09 of type FINISHED Processing container_1430213948957_0001_01_14 of type START Processing container_1430213948957_0001_01_14 of type ACQUIRED Processing container_1430213948957_0001_01_14 of type LAUNCHED Processing container_1430213948957_0001_01_15 of type RESERVED Processing container_1430213948957_0001_01_08 of type FINISHED Processing container_1430213948957_0001_01_13 of type START Processing container_1430213948957_0001_01_16 of type RESERVED Processing container_1430213948957_0001_01_13 of type ACQUIRED Processing container_1430213948957_0001_01_13 of type LAUNCHED Processing
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561057#comment-14561057 ] Hudson commented on YARN-3686: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2138 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2138/]) YARN-3686. CapacityScheduler should trim default_node_label_expression. (Sunil G via wangda) (wangda: rev cdbd66be111c93c85a409d47284e588c453ecae9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/QueueInfoPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3066) Hadoop leaves orphaned tasks running after job is killed
[ https://issues.apache.org/jira/browse/YARN-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561118#comment-14561118 ] Allen Wittenauer commented on YARN-3066: bq. As Linux, OSX, Solaris and BSD all support the setsid(2) syscall and it's part of POSIX (http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm), isn't a better solution just to wrap setsid() + exec() in a little bit of JNI? That would avoid the need to install external executables. That would break platforms that don't have a working libhadoop (which are plentiful). However, there could be a test here that says if libhadoop is available, use it. Hadoop leaves orphaned tasks running after job is killed Key: YARN-3066 URL: https://issues.apache.org/jira/browse/YARN-3066 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Hadoop 2.4.1 (probably all later too), FreeBSD-10.1 Reporter: Dmitry Sivachenko When spawning user task, node manager checks for setsid(1) utility and spawns task program via it. See hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java for instance: String exec = Shell.isSetsidAvailable? exec setsid : exec; FreeBSD, unlike Linux, does not have setsid(1) utility. So plain exec is used to spawn user task. If that task spawns other external programs (this is common case if a task program is a shell script) and user kills job via mapred job -kill Job, these child processes remain running. 1) Why do you silently ignore the absence of setsid(1) and spawn task process via exec: this is the guarantee to have orphaned processes when job is prematurely killed. 2) FreeBSD has a replacement third-party program called ssid (which does almost the same as Linux's setsid). It would be nice to detect which binary is present during configure stage and put @SETSID@ macros into java file to use the correct name. I propose to make Shell.isSetsidAvailable test more strict and fail to start if it is not found: at least we will know about the problem at start rather than guess why there are orphaned tasks running forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561077#comment-14561077 ] Hudson commented on YARN-3632: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #198 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/198/]) YARN-3632. Ordering policy should be allowed to reorder an application when demand changes. Contributed by Craig Welch (jianhe: rev 10732d515f62258309f98e4d7d23249f80b1847d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/OrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FairOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FifoOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/AbstractComparatorOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch, YARN-3632.6.patch, YARN-3632.7.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression
[ https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561073#comment-14561073 ] Hudson commented on YARN-3686: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #198 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/198/]) YARN-3686. CapacityScheduler should trim default_node_label_expression. (Sunil G via wangda) (wangda: rev cdbd66be111c93c85a409d47284e588c453ecae9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/QueueInfoPBImpl.java CapacityScheduler should trim default_node_label_expression --- Key: YARN-3686 URL: https://issues.apache.org/jira/browse/YARN-3686 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Priority: Critical Fix For: 2.7.1 Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 0003-YARN-3686.patch, 0004-YARN-3686.patch We should trim default_node_label_expression for queue before using it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561075#comment-14561075 ] Hudson commented on YARN-160: - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #198 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/198/]) YARN-160. Enhanced NodeManager to automatically obtain cpu/memory values from underlying OS when configured to do so. Contributed by Varun Vasudev. (vinodkv: rev 500a1d9c76ec612b4e737888f4be79951c11591d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-tools/hadoop-gridmix/src/test/java/org/apache/hadoop/mapred/gridmix/DummyResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, YARN-160.008.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561098#comment-14561098 ] MENG DING commented on YARN-1197: - Thanks [~vinodkv] and [~leftnoteasy] for the great comments! *To [~vinodkv]:* bq. Expanding containers at ACQUIRED state sounds useful in theory. But agree with you that we can punt it for later. Thanks for the confirmation :-) bq. To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container? Actually we really wanted to be able to achieve this, but with the current asymmetric logic of increasing resource from RM, and decreasing resource from NM, it doesn't seem to be possible :-( The reason is because: * The increase action starts from AM requesting the increase from RM, being granted a resource increase token, then initiating the increase action on NM, until finally NM confirming with RM about the increase. * Once an increase token has been granted to AM, and before it expires (10 minutes by default), if AM does not initiate the increase action on NM, *NM will have no idea that an increase is already in progress*. * If, at this moment, AM initiates a resource decrease action on NM, NM will go ahead and honor it. So in effect, there can be concurrent decrease/increase action going on, and there doesn't seem to be a way to block this. bq. If we do the above, this will also simplify most of the code, as we will simply have the notion of a Change, instead of an explicit increase/decrease everywhere. For e.g., we will just have a ContainerResourceChangeExpirer. I believe the ContainerResourceChangeExpirer only applies to the container resource increase action. The container decrease action goes directly through NM so it does not need an expiration logic. bq. There will be races with container-states toggling from RUNNING to finished states, depending on when AM requests a size-change and when NMs report that a container finished. We can simply say that the state at the ResourceManager wins. Agreed. bq. Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right? This is the same as the reasons listed above. bq. Instead of adding new records for ContainerResourceIncrease / decrease in AllocationResponse, should we add a new field in the API record itself stating if it is a New/Increased/Decreased container? If we move to a single change model, it's likely we will not even need this. I am open to this suggestion. We could add a field in the existing *ContainerProto* to indicate if this Container is new/increased/decreased container. The only thing I am not sure is if we can still change the AllocateResponseProto now that the ContainerResourceIncrease/Decrease is already in the trunk? bq. Any obviously invalid change-requests should be rejected right-away. For e.g, an increase to more than cluster's max container size. Seemed like you are suggesting we ignore the invalid requests. Agreed that any invalid increase requests from AM to RM, and invalid decrease requests from AM to NM should be directly rejected. The 'ignore' case I was referring to is in the context of NodeUpdate from NM to RM. bq. Nit: In the design doc, the high-level flow for container-increase point #7 incorrectly talks about decrease instead of increase. Yes, this is a mistake, and I will correct it. bq. I propose we do this in a branch Definitely. There is already a YARN-1197 branch, and we can simply work in that branch. *To [~leftnoteasy]:* bq. Actually the appoarch in design doc is this (Meng plz let me know if I misunderstood). In scheduler's implementation, it allows only one pending change request for same container, later change-request will either overwrite prior one or rejected. The current design only allows one increase request in the whole system, which is guaranteed by the ContainerResourceIncreaseExpirer object. However, as explained above, we cannot block decrease action while an increase action is still in progress. bq. 1) For the protocols between servers/AMs, mostly same to previous doc, the biggest change I can see is the ContainerResourceChangeProto in NodeHeartbeatResponseProto, which makes sense to me. Yes, the ContainerResourceChangeProto is the biggest change. Glad that you agree with this new protocol :-) bq. 2) For the client side change: 2.2.1, +1 to option 3. Great. I will remove option 1 and option 2 from the design doc. bq. 3) For 2.3.3.2 scheduling part, {{The scheduling of an outstanding resource increase request to a container will be skipped if there are either:}}. Both of the two may not needed since AM can require for more resource when container increase (e.g. container increased to 4G, and AM wants it to be 6G before notify NM).
[jira] [Commented] (YARN-3066) Hadoop leaves orphaned tasks running after job is killed
[ https://issues.apache.org/jira/browse/YARN-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561130#comment-14561130 ] Alan Burlison commented on YARN-3066: - Yes, that's a good point about not every platform having libhadoop. Solaris for example has the syscall but not the executable, so in that case it's a better solution to use the syscall but that's not always going to be the case. Hadoop leaves orphaned tasks running after job is killed Key: YARN-3066 URL: https://issues.apache.org/jira/browse/YARN-3066 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Hadoop 2.4.1 (probably all later too), FreeBSD-10.1 Reporter: Dmitry Sivachenko When spawning user task, node manager checks for setsid(1) utility and spawns task program via it. See hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java for instance: String exec = Shell.isSetsidAvailable? exec setsid : exec; FreeBSD, unlike Linux, does not have setsid(1) utility. So plain exec is used to spawn user task. If that task spawns other external programs (this is common case if a task program is a shell script) and user kills job via mapred job -kill Job, these child processes remain running. 1) Why do you silently ignore the absence of setsid(1) and spawn task process via exec: this is the guarantee to have orphaned processes when job is prematurely killed. 2) FreeBSD has a replacement third-party program called ssid (which does almost the same as Linux's setsid). It would be nice to detect which binary is present during configure stage and put @SETSID@ macros into java file to use the correct name. I propose to make Shell.isSetsidAvailable test more strict and fail to start if it is not found: at least we will know about the problem at start rather than guess why there are orphaned tasks running forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561256#comment-14561256 ] Rohith commented on YARN-3585: -- bq. Could you to instrument logs in the state store code to verify the leveldb database is indeed being closed even when it hangs? sorry, did not get it exactly what and where should I add logs? Do you mean should I add log after {{NMLeveldbStateStoreService#closeStorage()}} being called? NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3066) Hadoop leaves orphaned tasks running after job is killed
[ https://issues.apache.org/jira/browse/YARN-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561234#comment-14561234 ] Dmitry Sivachenko commented on YARN-3066: - Solaris can use the same ssid program (it is just a simple wrapper for setsid() syscall). I just proposed a simplest fix for that problem. JNI wrapper sounds like better approach. What I want to see in any case is the loud error message in case setsid binary (or setsid() syscall if we go JNI way) is unavailable. Right now it pretends to work and I spent some time digging out whats going wrong and why I see a lot of orphans. Hadoop leaves orphaned tasks running after job is killed Key: YARN-3066 URL: https://issues.apache.org/jira/browse/YARN-3066 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Hadoop 2.4.1 (probably all later too), FreeBSD-10.1 Reporter: Dmitry Sivachenko When spawning user task, node manager checks for setsid(1) utility and spawns task program via it. See hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java for instance: String exec = Shell.isSetsidAvailable? exec setsid : exec; FreeBSD, unlike Linux, does not have setsid(1) utility. So plain exec is used to spawn user task. If that task spawns other external programs (this is common case if a task program is a shell script) and user kills job via mapred job -kill Job, these child processes remain running. 1) Why do you silently ignore the absence of setsid(1) and spawn task process via exec: this is the guarantee to have orphaned processes when job is prematurely killed. 2) FreeBSD has a replacement third-party program called ssid (which does almost the same as Linux's setsid). It would be nice to detect which binary is present during configure stage and put @SETSID@ macros into java file to use the correct name. I propose to make Shell.isSetsidAvailable test more strict and fail to start if it is not found: at least we will know about the problem at start rather than guess why there are orphaned tasks running forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Moved] (YARN-3724) Native compilation on Solaris fails on Yarn due to use of FTS
[ https://issues.apache.org/jira/browse/YARN-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison moved HADOOP-11952 to YARN-3724: -- Assignee: (was: Alan Burlison) Target Version/s: (was: 2.8.0) Key: YARN-3724 (was: HADOOP-11952) Project: Hadoop YARN (was: Hadoop Common) Native compilation on Solaris fails on Yarn due to use of FTS - Key: YARN-3724 URL: https://issues.apache.org/jira/browse/YARN-3724 Project: Hadoop YARN Issue Type: Bug Environment: Solaris 11.2 Reporter: Malcolm Kavalsky Original Estimate: 24h Remaining Estimate: 24h Compiling the Yarn Node Manager results in fts not found. On Solaris we have an alternative ftw with similar functionality. This is isolated to a single file container-executor.c Note that this will just fix the compilation error. A more serious issue is that Solaris does not support cgroups as Linux does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3066) Hadoop leaves orphaned tasks running after job is killed
[ https://issues.apache.org/jira/browse/YARN-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3066: External issue ID: Bug 21156330 - Solaris should provide a setsid(1) command to run a command in a new session 21156330 is the Solaris bug which covers adding a setsid command-line utility to Solaris Hadoop leaves orphaned tasks running after job is killed Key: YARN-3066 URL: https://issues.apache.org/jira/browse/YARN-3066 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Hadoop 2.4.1 (probably all later too), FreeBSD-10.1 Reporter: Dmitry Sivachenko When spawning user task, node manager checks for setsid(1) utility and spawns task program via it. See hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java for instance: String exec = Shell.isSetsidAvailable? exec setsid : exec; FreeBSD, unlike Linux, does not have setsid(1) utility. So plain exec is used to spawn user task. If that task spawns other external programs (this is common case if a task program is a shell script) and user kills job via mapred job -kill Job, these child processes remain running. 1) Why do you silently ignore the absence of setsid(1) and spawn task process via exec: this is the guarantee to have orphaned processes when job is prematurely killed. 2) FreeBSD has a replacement third-party program called ssid (which does almost the same as Linux's setsid). It would be nice to detect which binary is present during configure stage and put @SETSID@ macros into java file to use the correct name. I propose to make Shell.isSetsidAvailable test more strict and fail to start if it is not found: at least we will know about the problem at start rather than guess why there are orphaned tasks running forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3724) Native compilation on Solaris fails on Yarn due to use of FTS
[ https://issues.apache.org/jira/browse/YARN-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3724: Issue Type: Sub-task (was: Bug) Parent: YARN-3719 Native compilation on Solaris fails on Yarn due to use of FTS - Key: YARN-3724 URL: https://issues.apache.org/jira/browse/YARN-3724 Project: Hadoop YARN Issue Type: Sub-task Environment: Solaris 11.2 Reporter: Malcolm Kavalsky Original Estimate: 24h Remaining Estimate: 24h Compiling the Yarn Node Manager results in fts not found. On Solaris we have an alternative ftw with similar functionality. This is isolated to a single file container-executor.c Note that this will just fix the compilation error. A more serious issue is that Solaris does not support cgroups as Linux does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3700) ATS Web Performance issue at load time when large number of jobs
[ https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561268#comment-14561268 ] Zhijie Shen commented on YARN-3700: --- Almost good to me, two nits: 1. getAllApplications - getApplications? 2. Can we use -1 or 0 instead of Long.MAX_VALUE to indicate appsNum not provided? {code} appsNum == Long.MAX_VALUE ? this.maxLoadedApplications : appsNum {code} ATS Web Performance issue at load time when large number of jobs Key: YARN-3700 URL: https://issues.apache.org/jira/browse/YARN-3700 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3700.1.patch, YARN-3700.2.1.patch, YARN-3700.2.2.patch, YARN-3700.2.patch, YARN-3700.3.patch Currently, we will load all the apps when we try to load the yarn timelineservice web page. If we have large number of jobs, it will be very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3700) ATS Web Performance issue at load time when large number of jobs
[ https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3700: Attachment: YARN-3700.4.patch ATS Web Performance issue at load time when large number of jobs Key: YARN-3700 URL: https://issues.apache.org/jira/browse/YARN-3700 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3700.1.patch, YARN-3700.2.1.patch, YARN-3700.2.2.patch, YARN-3700.2.patch, YARN-3700.3.patch, YARN-3700.4.patch Currently, we will load all the apps when we try to load the yarn timelineservice web page. If we have large number of jobs, it will be very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3700) ATS Web Performance issue at load time when large number of jobs
[ https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561576#comment-14561576 ] Xuan Gong commented on YARN-3700: - bq. getAllApplications - getApplications? Done bq. Can we use -1 or 0 instead of Long.MAX_VALUE to indicate appsNum not provided? Can not. The default value for GetApplicationsRequest--getLimit() is Long.MAX_VALUE Also, test the patch locally. * Set yarn.timeline-service.generic-application-history.max-applications as 1 * run two MR pi examples * go to http://localhost:8188/applicationhistory/apps and http://localhost:8188/ws/v1/applicationhistory/apps. Both of them are showing only one application which is the latest application * http://localhost:8188/applicationhistory/apps?apps.num=2 and http://localhost:8188/ws/v1/applicationhistory/apps?limit=2. Both of them are showing two applications ATS Web Performance issue at load time when large number of jobs Key: YARN-3700 URL: https://issues.apache.org/jira/browse/YARN-3700 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3700.1.patch, YARN-3700.2.1.patch, YARN-3700.2.2.patch, YARN-3700.2.patch, YARN-3700.3.patch, YARN-3700.4.patch Currently, we will load all the apps when we try to load the yarn timelineservice web page. If we have large number of jobs, it will be very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561573#comment-14561573 ] Wangda Tan commented on YARN-1197: -- [~mding]. For the comparison of resources, I think for both increase/decrease, it should be = or = for all dimensions. But if resource calculator is default, increase v-core makes no sense. So I think ResourceCalculator has to be used, but also needs to check all individual dimensions. So the logic will be: {code} if (increase): delta = target - now if delta.mem 0 || delta.vcore 0: throw exception if resourceCalculator.lessOrEqualThan(delta, 0): throw exception // .. move forward {code} Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197_Design.pdf, mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3581) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561590#comment-14561590 ] Wangda Tan commented on YARN-3581: -- [~Naganarasimha], Thanks for update, the latest patch looks good. And I think it's better to add to 2.7.1 as well to avoid people use it. We will not remove these options in 2.8, but we should let people know about the risk. Wangda Deprecate -directlyAccessNodeLabelStore in RMAdminCLI - Key: YARN-3581 URL: https://issues.apache.org/jira/browse/YARN-3581 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-3581.20150525-1.patch, YARN-3581.20150528-1.patch In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make RM can start with label-configured queue settings. After YARN-2918, we don't need this option any more, admin can configure queue setting, start RM and configure node label via RMAdminCLI without any error. In addition, this option is very restrictive, first it needs to run on the same node where RM is running if admin configured to store labels in local disk. Second, when admin run the option when RM is running, multiple process write to a same file can happen, this could make node label store becomes invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3581) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3581: - Target Version/s: 2.8.0, 2.7.1 (was: 2.8.0) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI - Key: YARN-3581 URL: https://issues.apache.org/jira/browse/YARN-3581 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-3581.20150525-1.patch, YARN-3581.20150528-1.patch In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make RM can start with label-configured queue settings. After YARN-2918, we don't need this option any more, admin can configure queue setting, start RM and configure node label via RMAdminCLI without any error. In addition, this option is very restrictive, first it needs to run on the same node where RM is running if admin configured to store labels in local disk. Second, when admin run the option when RM is running, multiple process write to a same file can happen, this could make node label store becomes invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3581) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561644#comment-14561644 ] Hadoop QA commented on YARN-3581: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 29s | The applied patch generated 6 new checkstyle issues (total was 40, now 42). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 44s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 52s | Tests passed in hadoop-yarn-client. | | | | 42m 29s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735661/YARN-3581.20150528-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c46d4ba | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8102/artifact/patchprocess/diffcheckstylehadoop-yarn-client.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8102/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8102/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8102/console | This message was automatically generated. Deprecate -directlyAccessNodeLabelStore in RMAdminCLI - Key: YARN-3581 URL: https://issues.apache.org/jira/browse/YARN-3581 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-3581.20150525-1.patch, YARN-3581.20150528-1.patch In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make RM can start with label-configured queue settings. After YARN-2918, we don't need this option any more, admin can configure queue setting, start RM and configure node label via RMAdminCLI without any error. In addition, this option is very restrictive, first it needs to run on the same node where RM is running if admin configured to store labels in local disk. Second, when admin run the option when RM is running, multiple process write to a same file can happen, this could make node label store becomes invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3723) Need to clearly document primaryFilter and otherInfo value type
[ https://issues.apache.org/jira/browse/YARN-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3723: -- Attachment: YARN-3723.1.patch Add some description about the value type as well as fix a minor format issue in the document. Need to clearly document primaryFilter and otherInfo value type --- Key: YARN-3723 URL: https://issues.apache.org/jira/browse/YARN-3723 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Attachments: YARN-3723.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561641#comment-14561641 ] Varun Saxena commented on YARN-3051: In the API designed in the patch, there are few things I wanted to discuss. # We can either return a single timeline entity for a flow ID(having aggregated metric values) or multiple entities indicating multiple flows runs for a flow ID. I have included an API for the former as of now. I think there can be uses cases for both though. [~vrushalic], did hRaven have the facility for both kinds of queries ? I mean, is there a known use case ? # Do we plan to include additional info in the user table which can be used for filtering user level entites ? Could not think of any use case but just for flexibility I have added filters in the API {{getUserEntities}}. # I have included an API to query flow information based on the appid. As of now I return the flow to which app belongs to(includes multiple runs) instead of flow run it belongs to. Which is a more viable scenario ? Or we need to support both ? # In the HBase schema design, there are 2 flow summary tables aggregated daily and weekly respectively. So to limit the number of metric records or to see metrics in a specific time window, I have added metric start and metric end timestamps in the API design. But if metrics are aggregated daily and weekly, we wont be able to get something like value of specific metric for a flow from say Thursday 4 pm to Friday 9 am. [~vrushalic], can you confirm ? If this is so, a timestamp doesnt make much sense. Dates can be specified instead. # Will there be queue table(s) in addition to user table(s) ? If yes, how will queue data be aggregated ? Based on entity type ? I may need an additional API for queues then. # The doubt I have regarding flow version will anyways be addressed by YARN-3699 [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Attachments: YARN-3051-YARN-2928.003.patch, YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch Per design in YARN-2928, create backing storage read interface that can be implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3569) YarnClient.getAllQueues returns a list of queues that do not display running apps.
[ https://issues.apache.org/jira/browse/YARN-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561580#comment-14561580 ] Hadoop QA commented on YARN-3569: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 17s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 47s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 34s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 50s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 57s | Tests failed in hadoop-yarn-client. | | | | 46m 51s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.client.TestResourceTrackerOnHA | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12731637/YARN-3569.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c46d4ba | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8101/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8101/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8101/console | This message was automatically generated. YarnClient.getAllQueues returns a list of queues that do not display running apps. -- Key: YARN-3569 URL: https://issues.apache.org/jira/browse/YARN-3569 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.8.0 Reporter: Spandan Dutta Assignee: Spandan Dutta Attachments: YARN-3569.patch YarnClient.getAllQueues() returns a list of queues. If we pick a queue from this list and call getApplications on it, we always get an empty list even-though applications are running on that queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561613#comment-14561613 ] Jason Lowe commented on YARN-3585: -- Yes, the idea is to show whether we successfully closed the database or not when the problem occurs. Sorry I wasn't clear on that. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3603: -- Attachment: 0001-YARN-3603.patch Uploading an initial version patch. * Container ID is shown only for Running containers in App Attempt page. Change the column name to Running Container ID * AM Container is showing the container link when Attempt is running, else showing the container ID in plain text. Here we can change label to AM Container Link in case when AM is running and AM Container ID while AM is finished or killed * AM Container logs are shown in App page but not app attempt page. An entry is added for same as AM Container Logs Application Attempts page confusing --- Key: YARN-3603 URL: https://issues.apache.org/jira/browse/YARN-3603 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Thomas Graves Assignee: Sunil G Attachments: 0001-YARN-3603.patch The application attempts page (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) is a bit confusing on what is going on. I think the table of containers there is for only Running containers and when the app is completed or killed its empty. The table should have a label on it stating so. Also the AM Container field is a link when running but not when its killed. That might be confusing. There is no link to the logs in this page but there is in the app attempt table when looking at http:// rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561659#comment-14561659 ] Varun Saxena commented on YARN-3411: In ATSv1, we consider the timestamp when entity is added to backend store in addition to entity creation time. This is used while filtering out entities during querying. I cannot see this being captured specifically in this patch. It can be easily added to Column Family info. [~zjshen], [~sjlee0], do we need to add this info ? Zhijie, for this, any specific use case you know of in ATSv1 ? [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Fix For: YARN-2928 Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411-YARN-2928.001.patch, YARN-3411-YARN-2928.002.patch, YARN-3411-YARN-2928.003.patch, YARN-3411-YARN-2928.004.patch, YARN-3411-YARN-2928.005.patch, YARN-3411-YARN-2928.006.patch, YARN-3411-YARN-2928.007.patch, YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.7.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561770#comment-14561770 ] Vrushali C commented on YARN-3051: -- Hi Varun, Good points.. My answers inline. bq. We can either return a single timeline entity for a flow ID(having aggregated metric values) or multiple entities indicating multiple flows runs for a flow ID. I have included an API for the former as of now. I think there can be uses cases for both though. Vrushali C, did hRaven have the facility for both kinds of queries ? I mean, is there a known use case ? Yes, there are use cases for both. hRaven has apis for both types of calls, they are named differently though. The /flow endpoint in hRaven will return multiple flow runs (limited by filters). The /summary will return aggregated values for all the runs of that flow in that time range filter. Let me give an example (a hadoop sleep job for simplicity). Say user janedoe runs a hadoop sleep job 3 times today and has run it 5 times yesterday and say 6 times on one day about a month back. Now, we may want to see two different things: #1 summarized stats for flow “Sleep job” invoked between last 2 days: It would say this flow was run 8 times, first was at timestamp X, last run was at timestamp Y, it took up a total of N megabytemillis, had a total of M containers across all runs, etc etc. It tells us how much of the cluster capacity a particular flow from a particular user is taking up. -#2 List of flow runs: Will show us details about each flow run. If we say limit = 3 in the query parameters, it would return latest 3 runs of this flow. If we say limit = 100, it would return all the runs in this particular case (including the ones from a month back). If we pass in flowVersion=XXYYZZ, then it would return the list of flows that match this version. For the initial development, I think we may want to work on #2 first (return list of flow runs). The summary api will need aggregated tables which we can add later on, we could file a jira for that, my 2c. bq. Do we plan to include additional info in the user table which can be used for filtering user level entites ? Could not think of any use case but just for flexibility I have added filters in the API getUserEntities. I haven’t looked at the code in detail, but as such, for user level entities, we would want time range, limit on number of records returns, flow name filter, cluster name filter. bq. I have included an API to query flow information based on the appid. As of now I return the flow to which app belongs to(includes multiple runs) instead of flow run it belongs to. Which is a more viable scenario ? Or we need to support both ? An app id can belong to exactly one flow run. App id is the hadoop yarn application id, which should be unique on the cluster. Given an app id, we should be able to look up the exact flow run and return just that. The equivalent api in hRaven is /jobFlow. bq. But if metrics are aggregated daily and weekly, we wont be able to get something like value of specific metric for a flow from say Thursday 4 pm to Friday 9 am. Vrushali C, can you confirm ? If this is so, a timestamp doesnt make much sense. Dates can be specified instead. The thinking is to split the querying across tables. We would query both the daily summary table for the complete day details and the regular flow tables for the details like those of Thursday 4 pm to Friday 9 am. But this does mean aggregating on the query side. So, I think, for starters, we could start off by allowing Date boundaries. We can enhance the API to accept finer timestamps later. bq. Will there be queue table(s) in addition to user table(s) ? If yes, how will queue data be aggregated ? Based on entity type ? I may need an additional API for queues then. Yes, we would need a queue based aggregation table. Right now, those details are to be worked out. So perhaps we can leave aside the queue based APIs (or file a different jira to handle queue based apis). Hope this helps. I can give you more examples if you would like to get more details or have any other questions. I will also look at the patch this week. Also, we should ensure we use the same classes/methods used for key related (flow keys, row keys) construction and parsing across reader apis and writer apis else they will diverge. thanks Vrushali [Storage abstraction] Create backing storage read interface for ATS readers --- Key: YARN-3051 URL: https://issues.apache.org/jira/browse/YARN-3051 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Varun Saxena Attachments:
[jira] [Commented] (YARN-3647) RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object
[ https://issues.apache.org/jira/browse/YARN-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561792#comment-14561792 ] Hudson commented on YARN-3647: -- FAILURE: Integrated in Hadoop-trunk-Commit #7909 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7909/]) YARN-3647. RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object. (Sunil G via wangda) (wangda: rev ec0a852a37d5c91a62d3d0ff3ddbd9d58235b312) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object --- Key: YARN-3647 URL: https://issues.apache.org/jira/browse/YARN-3647 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.8.0 Attachments: 0001-YARN-3647.patch, 0002-YARN-3647.patch After YARN-3579, RMWebServices apis can use the updated version of apis in CommonNodeLabelsManager which gives full NodeLabel object instead of creating NodeLabel object from plain label name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
Zhijie Shen created YARN-3725: - Summary: App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561710#comment-14561710 ] MENG DING commented on YARN-1197: - [~leftnoteasy] Makes sense to me. Will update the doc to include this. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197_Design.pdf, mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561722#comment-14561722 ] MENG DING commented on YARN-1197: - [~leftnoteasy] Makes sense to me. Will update the doc to include this. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: YARN-1197_Design.pdf, mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561790#comment-14561790 ] Hudson commented on YARN-3626: -- FAILURE: Integrated in Hadoop-trunk-Commit #7909 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7909/]) YARN-3626. On Windows localized resources are not moved to the front of the classpath when they should be. Contributed by Craig Welch. (cnauroth: rev 4102e5882e17b75507ae5cf8b8979485b3e24cbc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/util/MRApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.11.patch, YARN-3626.14.patch, YARN-3626.15.patch, YARN-3626.16.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561795#comment-14561795 ] Zhijie Shen commented on YARN-3725: --- I'm proposing to do the following: 1. Short term fix for 2.7.1: Check if service address in timeline DT is empty or not. If empty, we fall back to use the configured service address. It will make app submission via REST API work in secure mode without additional DT process work unless users really want to renew the DT from somewhere other than the configure address. It shouldn't be common as we usually only setup one timeline server per YARN cluster. 2. Long term fix: we can do something similar to HDFS-6904. Let the client to pass in the service address, and set token's service address at server side before serializing it into a string. And this problem is not just limited to ATS. RM REST API doesn't set the service address for RM DT too. It's better to seek for a common solution. For example, we can fix DelegationTokenAuthenticationHandler to make all use cases of hadoop http auth component set the service addr properly. One step further, even RPC protocol may have the similar problem. For example, if we work with ApplicationClientProtocol directly, we should get an RM DT without service address (correct me if I'm wrong). Thoughts? App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3700) ATS Web Performance issue at load time when large number of jobs
[ https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561651#comment-14561651 ] Zhijie Shen commented on YARN-3700: --- +1 last patch LGTM will commit it. ATS Web Performance issue at load time when large number of jobs Key: YARN-3700 URL: https://issues.apache.org/jira/browse/YARN-3700 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3700.1.patch, YARN-3700.2.1.patch, YARN-3700.2.2.patch, YARN-3700.2.patch, YARN-3700.3.patch, YARN-3700.4.patch Currently, we will load all the apps when we try to load the yarn timelineservice web page. If we have large number of jobs, it will be very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561682#comment-14561682 ] Zhijie Shen commented on YARN-3411: --- Yeah, in v1, there's a starttime for entity, which is used to indicate when the entity starts to exist. This value is used in multiple places. For example, when we query entities, the matched entities are sorted according to timestamp to be returned. Also, in v1 the retention granularity is at entity level. We check if the starttime of an entity is out of TTL, and then decide to discard it and its events. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Fix For: YARN-2928 Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411-YARN-2928.001.patch, YARN-3411-YARN-2928.002.patch, YARN-3411-YARN-2928.003.patch, YARN-3411-YARN-2928.004.patch, YARN-3411-YARN-2928.005.patch, YARN-3411-YARN-2928.006.patch, YARN-3411-YARN-2928.007.patch, YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.7.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561739#comment-14561739 ] Varun Saxena commented on YARN-3411: [~zjshen], I was actually talking about store insertion time and not the entity start time. If you look at {{LevelDbTimelineStore#checkStartTimeInDb}}, you would find that there is a store insert time(which is taken as current system time) also added in addition to entity start time. Pls note that store insert time and entity start time are not same. In ATSv1, we could specify a timestamp in query which is used to ignore entities that were inserted into the store after it. This is done by matching against the store insert time(which is not same as entity start time). So for backward compatibility sake, do we need to support it ? If yes, I dont see it being captured as part of writer implementations, as of now. If there is no use case for it though, we can drop it in ATSv2. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Fix For: YARN-2928 Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411-YARN-2928.001.patch, YARN-3411-YARN-2928.002.patch, YARN-3411-YARN-2928.003.patch, YARN-3411-YARN-2928.004.patch, YARN-3411-YARN-2928.005.patch, YARN-3411-YARN-2928.006.patch, YARN-3411-YARN-2928.007.patch, YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.7.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561741#comment-14561741 ] Varun Saxena commented on YARN-3411: [~zjshen], I was actually talking about store insertion time and not the entity start time. If you look at {{LevelDbTimelineStore#checkStartTimeInDb}}, you would find that there is a store insert time(which is taken as current system time) also added in addition to entity start time. Pls note that store insert time and entity start time are not same. In ATSv1, we could specify a timestamp in query which is used to ignore entities that were inserted into the store after it. This is done by matching against the store insert time(which is not same as entity start time). So for backward compatibility sake, do we need to support it ? If yes, I dont see it being captured as part of writer implementations, as of now. If there is no use case for it though, we can drop it in ATSv2. [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Fix For: YARN-2928 Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411-YARN-2928.001.patch, YARN-3411-YARN-2928.002.patch, YARN-3411-YARN-2928.003.patch, YARN-3411-YARN-2928.004.patch, YARN-3411-YARN-2928.005.patch, YARN-3411-YARN-2928.006.patch, YARN-3411-YARN-2928.007.patch, YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.7.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560470#comment-14560470 ] Rohith commented on YARN-3585: -- I tested locally using YARN-3641 FIX, issue is still exist. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled
[ https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560500#comment-14560500 ] Masatake Iwasaki commented on YARN-3714: In non-HA settings, if users do not explicitly set the {{yarn.resourcemanager.webapp.address}} in configuration, {{WebAppUtils#getResolvedRMWebAppURLWithoutScheme}} returns RM webapp address based on the value of {{yarn.resourcemanager.hostname}} via the default value set by yarn-default.xml. {noformat} property nameyarn.resourcemanager.webapp.address/name value${yarn.resourcemanager.hostname}:8088/value /property {noformat} As a result, WebAppUtils#getProxyHostsAndPortsForAmFilter can return proper proxy address. This does not apply to {{yarn.resourcemanager.hostname._rm-id_}} in HA mode. AM proxy filter can not get proper default proxy address if RM-HA is enabled Key: YARN-3714 URL: https://issues.apache.org/jira/browse/YARN-3714 Project: Hadoop YARN Issue Type: Bug Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Priority: Minor Default proxy address could not be got without setting {{yarn.resourcemanager.webapp.address._rm-id_}} and/or {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
[ https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560518#comment-14560518 ] Varun Saxena commented on YARN-3489: [~leftnoteasy], sorry missed your comment...Will have a look. RMServerUtils.validateResourceRequests should only obtain queue info once - Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3489-branch-2.7.02.patch, YARN-3489-branch-2.7.03.patch, YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, YARN-3489.03.patch Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560485#comment-14560485 ] Varun Vasudev commented on YARN-3678: - Is this in secure or non-secure mode? DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562083#comment-14562083 ] Robert Kanter commented on YARN-3528: - [~brahmareddy] are you still planning on working on this? Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562180#comment-14562180 ] gu-chi commented on YARN-3678: -- I made this https://github.com/apache/hadoop/pull/20/ DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
[ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3727: Attachment: YARN-3727.000.patch For better error recovery, check if the directory exists before using it for localization. -- Key: YARN-3727 URL: https://issues.apache.org/jira/browse/YARN-3727 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3727.000.patch For better error recovery, check if the directory exists before using it for localization. We saw the following localization failure happened due to existing cache directories. {code} 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot overwrite non empty destination directory //8/yarn/nm/usercache//filecache/21637 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs:///X/libjars/1234.jar(-//8/yarn/nm/usercache//filecache/21637/1234.jar) transitioned from DOWNLOADING to FAILED {code} The real cause for this failure may be disk failure, LevelDB operation failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or others. I wonder whether we can add error recovery code to avoid the localization failure by not using the existing cache directories for localization. The exception happened at {{files.rename(dst_work, destDirPath, Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the exception, the existing cache directory used by {{LocalizedResource}} will be deleted. {code} try { . files.rename(dst_work, destDirPath, Rename.OVERWRITE); } catch (Exception e) { try { files.delete(destDirPath, true); } catch (IOException ignore) { } throw e; } finally { {code} Since the conflicting local directory will be deleted after localization failure, I think it will be better to check if the directory exists before using it for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562314#comment-14562314 ] Hadoop QA commented on YARN-3725: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 18s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 3m 3s | Tests passed in hadoop-yarn-server-applicationhistoryservice. | | | | 42m 50s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735786/YARN-3725.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 5450413 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8110/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-applicationhistoryservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8110/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8110/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8110/console | This message was automatically generated. App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3726) Fix TestHBaseTimelineWriterImpl unit test failure by fixing it's test data
Vrushali C created YARN-3726: Summary: Fix TestHBaseTimelineWriterImpl unit test failure by fixing it's test data Key: YARN-3726 URL: https://issues.apache.org/jira/browse/YARN-3726 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vrushali C Assignee: Vrushali C There is a very fascinating bug that was introduced by the test data in the metrics time series check in the unit test in TestHBaseTimelineWriterImpl in YARN-3411. The unit test failure seen is {code} Error Message expected:1 but was:6 Stacktrace java.lang.AssertionError: expected:1 but was:6 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineWriterImpl.checkMetricsTimeseries(TestHBaseTimelineWriterImpl.java:219) at org.apache.hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineWriterImpl.testWriteEntityToHBase(TestHBaseTimelineWriterImpl.java:204) {code} The test data had 6 timestamps that belonged to 22nd April 2015. When the patch in YARN-3411 was submitted and tested by Hadoop QA on May 19th, the unit test was working fine. Fast forward a few more days and the test started failing. There has been no relevant code change or package version change interim. The change that is triggering the unit test failure is the passage of time. The reason for test failure is that the metrics time series data lives in a column family which has a TTL set to 30 days. Metrics time series data was written to the mini hbase cluster with cell timestamps set to April 22nd. Based on the column family configuration, hbase started deleting the data that was older than 30 days and the test started failing. The last value is retained, hence there is one value fetched from hbase. Will submit a patch with the test case fixed shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)