[jira] [Commented] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.
[ https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890681#comment-16890681 ] Bibin A Chundatt commented on YARN-9690: [~Babbleshack] Looks like the AM is trying to connect RM . As per the configuration mentioned in following document [Reference|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html] AM should connect to *AMRMProxy* in nodemanager yarn.resourcemanager.scheduler.address localhost:8049 Redirects jobs to the Node Manager’s AMRMProxy port. This is client side propery in case of mapreduce application. > Invalid AMRM token when distributed scheduling is enabled. > -- > > Key: YARN-9690 > URL: https://issues.apache.org/jira/browse/YARN-9690 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-scheduling, yarn >Affects Versions: 2.9.2, 3.1.2 > Environment: OS: Ubuntu 18.04 > JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03 >Reporter: Babble Shack >Priority: Major > Attachments: applicationlog, yarn-site.xml > > > Applications fail to start due to invalild AMRM from application attempt. > I have tested this with 0/100% opportunistic maps and the same issue occurs > regardless. > {code:java} > > --> > > > mapreduceyarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.resourcemanager.address > yarn-master-0.yarn-service.yarn:8032 > > > yarn.resourcemanager.scheduler.address > 0.0.0.0:8049 > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > yarn.nodemanager.opportunistic-containers-max-queue-length > 10 > > > yarn.nodemanager.distributed-scheduling.enabled > true > > > > yarn.webapp.ui2.enable > true > > > yarn.resourcemanager.resource-tracker.address > yarn-master-0.yarn-service.yarn:8031 > > > yarn.log-aggregation-enable > true > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > > > > yarn.nodemanager.resource.memory-mb > 7168 > > > yarn.scheduler.minimum-allocation-mb > 3584 > > > yarn.scheduler.maximum-allocation-mb > 7168 > > > yarn.app.mapreduce.am.resource.mb > 7168 > > > > yarn.app.mapreduce.am.command-opts > -Xmx5734m > > > > yarn.timeline-service.enabled > true > > > yarn.resourcemanager.system-metrics-publisher.enabled > true > > > yarn.timeline-service.generic-application-history.enabled > true > > > yarn.timeline-service.bind-host > 0.0.0.0 > > > {code} > Relevant logs: > {code:java} > 2019-07-22 14:56:37,104 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the > mappers will be scheduled using OPPORTUNISTIC containers > 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: > Connecting to ResourceManager at > yarn-master-0.yarn-service.yarn/10.244.1.134:8030 > 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1563805140414_0002_02 > 2019-07-22 14:56:37,152 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while > registering > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563805140414_0002_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Commented] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing
[ https://issues.apache.org/jira/browse/YARN-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890645#comment-16890645 ] Hadoop QA commented on YARN-9691: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core: The patch generated 3 new + 47 unchanged - 0 fixed = 50 total (was 47) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 27s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 17m 41s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 66m 1s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.0 Server=19.03.0 Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9691 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12975449/YARN-9691.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f683ad20e1f9 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ee87e9a | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24415/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24415/testReport/ | | Max.
[jira] [Updated] (YARN-9692) ContainerAllocationExpirer is missspelled
[ https://issues.apache.org/jira/browse/YARN-9692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhou wu updated YARN-9692: - Attachment: YARN-9692.001.patch > ContainerAllocationExpirer is missspelled > - > > Key: YARN-9692 > URL: https://issues.apache.org/jira/browse/YARN-9692 > Project: Hadoop YARN > Issue Type: Bug >Reporter: runzhou wu >Assignee: runzhou wu >Priority: Trivial > Attachments: YARN-9692.001.patch > > > The class ContainerAllocationExpirer is missspelled. > I think it should be changed to ContainerAllocationExpired -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9692) ContainerAllocationExpirer is missspelled
[ https://issues.apache.org/jira/browse/YARN-9692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890641#comment-16890641 ] runzhou wu commented on YARN-9692: -- The fully name is org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer > ContainerAllocationExpirer is missspelled > - > > Key: YARN-9692 > URL: https://issues.apache.org/jira/browse/YARN-9692 > Project: Hadoop YARN > Issue Type: Bug >Reporter: runzhou wu >Assignee: runzhou wu >Priority: Trivial > > The class ContainerAllocationExpirer is missspelled. > I think it should be changed to ContainerAllocationExpired -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9692) ContainerAllocationExpirer is missspelled
runzhou wu created YARN-9692: Summary: ContainerAllocationExpirer is missspelled Key: YARN-9692 URL: https://issues.apache.org/jira/browse/YARN-9692 Project: Hadoop YARN Issue Type: Bug Reporter: runzhou wu Assignee: runzhou wu The class ContainerAllocationExpirer is missspelled. I think it should be changed to ContainerAllocationExpired -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing
kyungwan nam created YARN-9691: -- Summary: canceling upgrade does not work if upgrade failed container is existing Key: YARN-9691 URL: https://issues.apache.org/jira/browse/YARN-9691 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam if a container is failed to upgrade during yarn service upgrade, it will be released container and transition to FAILED_UPGRADE state. After then, I expected it is able to be back to the previous version using cancel-upgrade. but, It didn’t work. At that time, AM log is as follows {code} # failed to upgrade container_e62_1563179597798_0006_01_08 2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO service.ClientAMService - Upgrade container container_e62_1563179597798_0006_01_08 2019-07-16 18:21:55,153 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] spec state state changed from NEEDS_UPGRADE -> UPGRADING 2019-07-16 18:21:55,154 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] Transitioned from READY to UPGRADING on UPGRADE event 2019-07-16 18:21:55,154 [pool-5-thread-4] INFO registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08]: Deleting registry path /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08 2019-07-16 18:21:55,156 [pool-6-thread-6] INFO provider.ProviderUtils - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version 1.0.1 : Creating dir on hdfs: hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO containerlaunch.ContainerLaunchService - reInitializing container container_e62_1563179597798_0006_01_08 with version 1.0.1 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO containerlaunch.AbstractLauncher - yarn docker env var has been set {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM, WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=} 2019-07-16 18:21:55,158 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for Container container_e62_1563179597798_0006_01_08 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] spec state state changed from UPGRADING -> RUNNING_BUT_UNREADY 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] retrieve status after 30 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to REINITIALIZED on START event 2019-07-16 18:22:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:22:37,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:23:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:23:08,225 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] spec state state changed from RUNNING_BUT_UNREADY -> FAILED_UPGRADE # request canceling upgrade 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_04 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_03 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container
[jira] [Commented] (YARN-2497) Fair scheduler should support strict node labels
[ https://issues.apache.org/jira/browse/YARN-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890573#comment-16890573 ] Yufei Gu commented on YARN-2497: Hi [~chenzhaohang], AFAIK, FS doesn't support node label in any version. > Fair scheduler should support strict node labels > > > Key: YARN-2497 > URL: https://issues.apache.org/jira/browse/YARN-2497 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Wangda Tan >Assignee: Daniel Templeton >Priority: Major > Attachments: YARN-2497.001.patch, YARN-2497.002.patch, > YARN-2497.003.patch, YARN-2497.004.patch, YARN-2497.005.patch, > YARN-2497.006.patch, YARN-2497.007.patch, YARN-2497.008.patch, > YARN-2497.009.patch, YARN-2497.010.patch, YARN-2497.011.patch, > YARN-2497.branch-3.0.001.patch, YARN-2499.WIP01.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9537) Add configuration to disable AM preemption
[ https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu reassigned YARN-9537: -- Assignee: zhoukang > Add configuration to disable AM preemption > -- > > Key: YARN-9537 > URL: https://issues.apache.org/jira/browse/YARN-9537 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.2.0, 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9537.001.patch > > > In this issue, i will add a configuration to support disable AM preemption. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9537) Add configuration to disable AM preemption
[ https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890572#comment-16890572 ] Yufei Gu commented on YARN-9537: Hi [~cane], added you to contributor, and assign this to you. Will you still work on this? > Add configuration to disable AM preemption > -- > > Key: YARN-9537 > URL: https://issues.apache.org/jira/browse/YARN-9537 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.2.0, 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9537.001.patch > > > In this issue, i will add a configuration to support disable AM preemption. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.
[ https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890503#comment-16890503 ] Jim Brennan commented on YARN-9647: --- [~ebadger], [~eyang], [~magnum] I think I'm following the discussion and I agree with the problem analysis. {quote}It's slightly more nuanced than this. If the lists don't match the container still could've failed because of an invalid mount. Basically if we get an invalid mount error then we need to figure out whether that invalid mount was in the original allowed-mounts lists in container-executor.cfg. If it was, then the error message should indicate a bad disk. Otherwise, the usual invalid mount error message should be fine. {quote} Do we need to maintain two lists? check_mount_permitted() is already returning -1 in the case where the normalize_mount fails for the mount_src before even checking if it is permitted. If the disk is bad, I think this is where it will fail. I don't think we'll get to the point of checking whether it is permitted? Maybe we just need to change this error message: {noformat} fprintf(ERRORFILE, "Invalid docker mount '%s', realpath=%s\n", values[i], mount_src); {noformat} to {noformat} fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe bad disk?\n", mount_src, values[i]); {noformat} Even better, pull the normalizing of mount_src out of check_mount_permitted and do it separately. {noformat} char *normalized_path = normalize_mount(mount_src, 0); if (normalized_path == NULL) { fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe bad disk?\n", mount_src, values[i]); ret = INVALID_DOCKER_MOUNT; goto free_and_exit; } permitted_rw = check_mount_permitted((const char **) permitted_rw_mounts, normalized_path); permitted_ro = check_mount_permitted((const char **) permitted_ro_mounts, normalized_path); {noformat} For paths coming from NM (local dirs / log dirs) it should have already checked to ensure bad ones aren't in the list. > Docker launch fails when local-dirs or log-dirs is unhealthy. > - > > Key: YARN-9647 > URL: https://issues.apache.org/jira/browse/YARN-9647 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.2 >Reporter: KWON BYUNGCHANG >Priority: Major > Attachments: YARN-9647.001.patch, YARN-9647.002.patch > > > my /etc/hadoop/conf/container-executor.cfg > {code} > [docker] >docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local >docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local > {code} > if /data2 is unhealthy, docker launch fails although container can use > /data1 as local-dir, log-dir > error message is below > {code} > [2019-06-25 14:55:26.168]Exception from container-launch. Container id: > container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: > Launch container failed Shell error output: Could not determine real path of > mount '/data2/hadoop/yarn/local' Could not determine real path of mount > '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk > Error constructing docker command, docker error code=16, error message='Mount > access error' Shell output: main : command provided 4 main : run as user is > magnum main : requested yarn user is magnum Creating script paths... Creating > local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit > code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code > 29. > {code} > root cause is that normalize_mounts() in docker-util.c return -1 because it > cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is > disk fault at this point) > however disk of nm local dirs and nm log dirs can fail at any time. > docker launch should succeed if there are available local dirs and log dirs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.
[ https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890443#comment-16890443 ] Eric Badger commented on YARN-9647: --- bq. We can resolve this error by keeping track of the original container-executor.cfg, and normalized list. When two lists are not matching, container-executor can provide a different error message that container failed to launch due to unhealthy disk rather than continuing. It's slightly more nuanced than this. If the lists don't match the container still could've failed because of an invalid mount. Basically if we get an invalid mount error then we need to figure out whether that invalid mount was in the original allowed-mounts lists in container-executor.cfg. If it was, then the error message should indicate a bad disk. Otherwise, the usual invalid mount error message should be fine. But as long as the logic isn't too complicated, I'm ok with this > Docker launch fails when local-dirs or log-dirs is unhealthy. > - > > Key: YARN-9647 > URL: https://issues.apache.org/jira/browse/YARN-9647 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.2 >Reporter: KWON BYUNGCHANG >Priority: Major > Attachments: YARN-9647.001.patch, YARN-9647.002.patch > > > my /etc/hadoop/conf/container-executor.cfg > {code} > [docker] >docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local >docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local > {code} > if /data2 is unhealthy, docker launch fails although container can use > /data1 as local-dir, log-dir > error message is below > {code} > [2019-06-25 14:55:26.168]Exception from container-launch. Container id: > container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: > Launch container failed Shell error output: Could not determine real path of > mount '/data2/hadoop/yarn/local' Could not determine real path of mount > '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk > Error constructing docker command, docker error code=16, error message='Mount > access error' Shell output: main : command provided 4 main : run as user is > magnum main : requested yarn user is magnum Creating script paths... Creating > local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit > code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code > 29. > {code} > root cause is that normalize_mounts() in docker-util.c return -1 because it > cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is > disk fault at this point) > however disk of nm local dirs and nm log dirs can fail at any time. > docker launch should succeed if there are available local dirs and log dirs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.
[ https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890435#comment-16890435 ] Eric Yang commented on YARN-9647: - [~ebadger], I think the approach taken is ok. We want to filter out bad disk from allowed mount to guard against user defined mount point or system suggested mount point. The difficult part is to identify if the mount path is user specified or system suggested. In .cmd file, both user specified and system suggested paths are listed together. There is no easy way to rotate to a different disk, unless node manager relaunch the container with another set of workdir paths. [~magnum] , I think [~ebadger] is also right that this patch may have misleading error message when bad disk happens. We can resolve this error by keeping track of the original container-executor.cfg, and normalized list. When two lists are not matching, container-executor can provide a different error message that container failed to launch due to unhealthy disk rather than continuing. Would this work? > Docker launch fails when local-dirs or log-dirs is unhealthy. > - > > Key: YARN-9647 > URL: https://issues.apache.org/jira/browse/YARN-9647 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.2 >Reporter: KWON BYUNGCHANG >Priority: Major > Attachments: YARN-9647.001.patch, YARN-9647.002.patch > > > my /etc/hadoop/conf/container-executor.cfg > {code} > [docker] >docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local >docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local > {code} > if /data2 is unhealthy, docker launch fails although container can use > /data1 as local-dir, log-dir > error message is below > {code} > [2019-06-25 14:55:26.168]Exception from container-launch. Container id: > container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: > Launch container failed Shell error output: Could not determine real path of > mount '/data2/hadoop/yarn/local' Could not determine real path of mount > '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk > Error constructing docker command, docker error code=16, error message='Mount > access error' Shell output: main : command provided 4 main : run as user is > magnum main : requested yarn user is magnum Creating script paths... Creating > local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit > code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code > 29. > {code} > root cause is that normalize_mounts() in docker-util.c return -1 because it > cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is > disk fault at this point) > however disk of nm local dirs and nm log dirs can fail at any time. > docker launch should succeed if there are available local dirs and log dirs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode
[ https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890402#comment-16890402 ] Botong Huang commented on YARN-9689: +[~giovanni.fumarola] for help > Router does not support kerberos proxy when in secure mode > -- > > Key: YARN-9689 > URL: https://issues.apache.org/jira/browse/YARN-9689 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Priority: Major > > When we enable kerberos in YARN-Federation mode, we can not get new app since > it will throw kerberos exception below.Which should be handled! > {code:java} > 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2019-07-22,18:43:25,528 WARN > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: > Unable to create a new ApplicationId in SubCluster xxx > java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed > on local exception: java.io.IOException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Failed to find any Kerberos tgt)] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564) > at org.apache.hadoop.ipc.Client.call(Client.java:1506) > at org.apache.hadoop.ipc.Client.call(Client.java:1416) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691) > Caused by:
[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.
[ https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890394#comment-16890394 ] Eric Badger commented on YARN-9647: --- [~eyang], [~Jim_Brennan], [~billie.rinaldi], any ideas on how to fix this in a clean way? > Docker launch fails when local-dirs or log-dirs is unhealthy. > - > > Key: YARN-9647 > URL: https://issues.apache.org/jira/browse/YARN-9647 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.2 >Reporter: KWON BYUNGCHANG >Priority: Major > Attachments: YARN-9647.001.patch, YARN-9647.002.patch > > > my /etc/hadoop/conf/container-executor.cfg > {code} > [docker] >docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local >docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local > {code} > if /data2 is unhealthy, docker launch fails although container can use > /data1 as local-dir, log-dir > error message is below > {code} > [2019-06-25 14:55:26.168]Exception from container-launch. Container id: > container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: > Launch container failed Shell error output: Could not determine real path of > mount '/data2/hadoop/yarn/local' Could not determine real path of mount > '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk > Error constructing docker command, docker error code=16, error message='Mount > access error' Shell output: main : command provided 4 main : run as user is > magnum main : requested yarn user is magnum Creating script paths... Creating > local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit > code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code > 29. > {code} > root cause is that normalize_mounts() in docker-util.c return -1 because it > cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is > disk fault at this point) > however disk of nm local dirs and nm log dirs can fail at any time. > docker launch should succeed if there are available local dirs and log dirs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.
[ https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890393#comment-16890393 ] Eric Badger commented on YARN-9647: --- [~magnum], thanks for the explanation. I understand what you mean since {{docker.allowed.[ro,rw]-mounts}} will always be parsed and if either are bad then you will fail all launches. However, some errors might get confusing with your proposed approach. For example, the user may set bind-mounts or there may be some defined mounts for all containers. Those could be hard-coded in confs (or by users' jobs) and then once the container is launched the container will get an invalid docker mount message even though the mount is in the allowed list. It would be nice to be able to not fail on bad disks in the allowed lists, but also have good logging when the container fails due to a bad disk. Simply ignoring the bad disks in the allowed list gives you a misleading error message if the container attempts to use those disks. > Docker launch fails when local-dirs or log-dirs is unhealthy. > - > > Key: YARN-9647 > URL: https://issues.apache.org/jira/browse/YARN-9647 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.2 >Reporter: KWON BYUNGCHANG >Priority: Major > Attachments: YARN-9647.001.patch, YARN-9647.002.patch > > > my /etc/hadoop/conf/container-executor.cfg > {code} > [docker] >docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local >docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local > {code} > if /data2 is unhealthy, docker launch fails although container can use > /data1 as local-dir, log-dir > error message is below > {code} > [2019-06-25 14:55:26.168]Exception from container-launch. Container id: > container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: > Launch container failed Shell error output: Could not determine real path of > mount '/data2/hadoop/yarn/local' Could not determine real path of mount > '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk > Error constructing docker command, docker error code=16, error message='Mount > access error' Shell output: main : command provided 4 main : run as user is > magnum main : requested yarn user is magnum Creating script paths... Creating > local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit > code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code > 29. > {code} > root cause is that normalize_mounts() in docker-util.c return -1 because it > cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is > disk fault at this point) > however disk of nm local dirs and nm log dirs can fail at any time. > docker launch should succeed if there are available local dirs and log dirs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests
[ https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890368#comment-16890368 ] Zoltan Siegl commented on YARN-5106: Uploaded patched for branch-3.1 and branch-3.2 > Provide a builder interface for FairScheduler allocations for use in tests > -- > > Key: YARN-5106 > URL: https://issues.apache.org/jira/browse/YARN-5106 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Zoltan Siegl >Priority: Major > Labels: newbie++ > Attachments: YARN-5106-branch-3.1.001.patch, > YARN-5106-branch-3.2.001.patch, YARN-5106.001.patch, YARN-5106.002.patch, > YARN-5106.003.patch, YARN-5106.004.patch, YARN-5106.005.patch, > YARN-5106.006.patch, YARN-5106.007.patch, YARN-5106.008.patch > > > Most, if not all, fair scheduler tests create an allocations XML file. Having > a helper class that potentially uses a builder would make the tests cleaner. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests
[ https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Siegl updated YARN-5106: --- Attachment: YARN-5106-branch-3.2.001.patch > Provide a builder interface for FairScheduler allocations for use in tests > -- > > Key: YARN-5106 > URL: https://issues.apache.org/jira/browse/YARN-5106 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Zoltan Siegl >Priority: Major > Labels: newbie++ > Attachments: YARN-5106-branch-3.1.001.patch, > YARN-5106-branch-3.2.001.patch, YARN-5106.001.patch, YARN-5106.002.patch, > YARN-5106.003.patch, YARN-5106.004.patch, YARN-5106.005.patch, > YARN-5106.006.patch, YARN-5106.007.patch, YARN-5106.008.patch > > > Most, if not all, fair scheduler tests create an allocations XML file. Having > a helper class that potentially uses a builder would make the tests cleaner. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests
[ https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Siegl updated YARN-5106: --- Attachment: YARN-5106-branch-3.1.001.patch > Provide a builder interface for FairScheduler allocations for use in tests > -- > > Key: YARN-5106 > URL: https://issues.apache.org/jira/browse/YARN-5106 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Zoltan Siegl >Priority: Major > Labels: newbie++ > Attachments: YARN-5106-branch-3.1.001.patch, > YARN-5106-branch-3.2.001.patch, YARN-5106.001.patch, YARN-5106.002.patch, > YARN-5106.003.patch, YARN-5106.004.patch, YARN-5106.005.patch, > YARN-5106.006.patch, YARN-5106.007.patch, YARN-5106.008.patch > > > Most, if not all, fair scheduler tests create an allocations XML file. Having > a helper class that potentially uses a builder would make the tests cleaner. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup
[ https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890350#comment-16890350 ] Jonathan Hung commented on YARN-9668: - Thanks Haibo! Committed to branch-3.2, branch-3.1, branch-3.0 as well. > UGI conf doesn't read user overridden configurations on RM and NM startup > - > > Key: YARN-9668 > URL: https://issues.apache.org/jira/browse/YARN-9668 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0 >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.3, 3.2.2 > > Attachments: YARN-9668-branch-2.001.patch, > YARN-9668-branch-2.002.patch, YARN-9668-branch-3.2.001.patch, > YARN-9668.001.patch, YARN-9668.002.patch, YARN-9668.003.patch > > > Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not > passed to the UGI conf on RM or NM startup. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup
[ https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-9668: Fix Version/s: 3.2.2 3.1.3 3.0.4 > UGI conf doesn't read user overridden configurations on RM and NM startup > - > > Key: YARN-9668 > URL: https://issues.apache.org/jira/browse/YARN-9668 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0 >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.3, 3.2.2 > > Attachments: YARN-9668-branch-2.001.patch, > YARN-9668-branch-2.002.patch, YARN-9668-branch-3.2.001.patch, > YARN-9668.001.patch, YARN-9668.002.patch, YARN-9668.003.patch > > > Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not > passed to the UGI conf on RM or NM startup. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890322#comment-16890322 ] Sunil Govindan commented on YARN-9687: -- Hi [~Tao Yang] Thanks for reporting this issue. Yes, we have seen this in few places where such cases can occur given the combination of resource values. *fitsIn* helps in such areas. (already we fixed few in preemption modules) +1 for this patch. > Queue headroom check may let unacceptable allocation off when using > DominantResourceCalculator > -- > > Key: YARN-9687 > URL: https://issues.apache.org/jira/browse/YARN-9687 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9687.001.patch > > > Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} > is using {{Resources#greaterThanOrEqual}} which internally compare resources > by ratio, when using DominantResourceCalculator, it may let unacceptable > allocations off in some scenarios. > For example: > cluster-resource=<10GB, 10 vcores> > queue-headroom=<2GB, 4 vcores> > required-resource=<3GB, 1 vcores> > In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so > that allocations will be let off in scheduling process but will always be > rejected when committing these proposals. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890284#comment-16890284 ] Muhammad Samir Khan commented on YARN-9596: --- CSQueueUtils#updateUsedCapacity is called before getMaxAvailableResourceToQueuePartition. So any checks for correct partition should be in CSQueueUtils#updateQueueStatistics so that it captures both the methods. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890273#comment-16890273 ] Eric Yang commented on YARN-9562: - [~ebadger] Thank you for the patch. Can we create manifestJson as a json file in src/test/resources, and use TestImageTagToManifestPlugin.class.getResource("manifest.json"); to retrieve the json content, please? This might be easier to manage in the long run. Thanks > Add Java changes for the new RuncContainerRuntime > - > > Key: YARN-9562 > URL: https://issues.apache.org/jira/browse/YARN-9562 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9562.001.patch, YARN-9562.002.patch > > > This JIRA will be used to add the Java changes for the new > RuncContainerRuntime. This will work off of YARN-9560 to use much of the > existing DockerLinuxContainerRuntime code once it is moved up into an > abstract class that can be extended. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.
[ https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890254#comment-16890254 ] Babble Shack commented on YARN-9690: I enabled some additional configuration `yarn.nodemanager.amrmproxy.enabled` and set `yarn.resourcemanager.scheduler.address` to `0.0.0.0:8049` However I still get the same issue, in particular there is an excption whilst registering application master, because of an invalid AMRM-token. > Invalid AMRM token when distributed scheduling is enabled. > -- > > Key: YARN-9690 > URL: https://issues.apache.org/jira/browse/YARN-9690 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-scheduling, yarn >Affects Versions: 2.9.2, 3.1.2 > Environment: OS: Ubuntu 18.04 > JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03 >Reporter: Babble Shack >Priority: Major > Attachments: applicationlog, yarn-site.xml > > > Applications fail to start due to invalild AMRM from application attempt. > I have tested this with 0/100% opportunistic maps and the same issue occurs > regardless. > {code:java} > > --> > > > mapreduceyarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.resourcemanager.address > yarn-master-0.yarn-service.yarn:8032 > > > yarn.resourcemanager.scheduler.address > 0.0.0.0:8049 > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > yarn.nodemanager.opportunistic-containers-max-queue-length > 10 > > > yarn.nodemanager.distributed-scheduling.enabled > true > > > > yarn.webapp.ui2.enable > true > > > yarn.resourcemanager.resource-tracker.address > yarn-master-0.yarn-service.yarn:8031 > > > yarn.log-aggregation-enable > true > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > > > > yarn.nodemanager.resource.memory-mb > 7168 > > > yarn.scheduler.minimum-allocation-mb > 3584 > > > yarn.scheduler.maximum-allocation-mb > 7168 > > > yarn.app.mapreduce.am.resource.mb > 7168 > > > > yarn.app.mapreduce.am.command-opts > -Xmx5734m > > > > yarn.timeline-service.enabled > true > > > yarn.resourcemanager.system-metrics-publisher.enabled > true > > > yarn.timeline-service.generic-application-history.enabled > true > > > yarn.timeline-service.bind-host > 0.0.0.0 > > > {code} > Relevant logs: > {code:java} > 2019-07-22 14:56:37,104 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the > mappers will be scheduled using OPPORTUNISTIC containers > 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: > Connecting to ResourceManager at > yarn-master-0.yarn-service.yarn/10.244.1.134:8030 > 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1563805140414_0002_02 > 2019-07-22 14:56:37,152 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while > registering > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563805140414_0002_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) >
[jira] [Updated] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.
[ https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Babble Shack updated YARN-9690: --- Component/s: yarn > Invalid AMRM token when distributed scheduling is enabled. > -- > > Key: YARN-9690 > URL: https://issues.apache.org/jira/browse/YARN-9690 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-scheduling, yarn >Affects Versions: 2.9.2, 3.1.2 > Environment: OS: Ubuntu 18.04 > JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03 >Reporter: Babble Shack >Priority: Major > Attachments: applicationlog, yarn-site.xml > > > Applications fail to start due to invalild AMRM from application attempt. > I have tested this with 0/100% opportunistic maps and the same issue occurs > regardless. > {code:java} > > --> > > > mapreduceyarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.resourcemanager.address > yarn-master-0.yarn-service.yarn:8032 > > > yarn.resourcemanager.scheduler.address > 0.0.0.0:8049 > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > yarn.nodemanager.opportunistic-containers-max-queue-length > 10 > > > yarn.nodemanager.distributed-scheduling.enabled > true > > > > yarn.webapp.ui2.enable > true > > > yarn.resourcemanager.resource-tracker.address > yarn-master-0.yarn-service.yarn:8031 > > > yarn.log-aggregation-enable > true > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > > > > yarn.nodemanager.resource.memory-mb > 7168 > > > yarn.scheduler.minimum-allocation-mb > 3584 > > > yarn.scheduler.maximum-allocation-mb > 7168 > > > yarn.app.mapreduce.am.resource.mb > 7168 > > > > yarn.app.mapreduce.am.command-opts > -Xmx5734m > > > > yarn.timeline-service.enabled > true > > > yarn.resourcemanager.system-metrics-publisher.enabled > true > > > yarn.timeline-service.generic-application-history.enabled > true > > > yarn.timeline-service.bind-host > 0.0.0.0 > > > {code} > Relevant logs: > {code:java} > 2019-07-22 14:56:37,104 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the > mappers will be scheduled using OPPORTUNISTIC containers > 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: > Connecting to ResourceManager at > yarn-master-0.yarn-service.yarn/10.244.1.134:8030 > 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1563805140414_0002_02 > 2019-07-22 14:56:37,152 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while > registering > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563805140414_0002_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at >
[jira] [Created] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.
Babble Shack created YARN-9690: -- Summary: Invalid AMRM token when distributed scheduling is enabled. Key: YARN-9690 URL: https://issues.apache.org/jira/browse/YARN-9690 Project: Hadoop YARN Issue Type: Bug Components: distributed-scheduling Affects Versions: 3.1.2, 2.9.2 Environment: OS: Ubuntu 18.04 JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03 Reporter: Babble Shack Attachments: applicationlog, yarn-site.xml Applications fail to start due to invalild AMRM from application attempt. I have tested this with 0/100% opportunistic maps and the same issue occurs regardless. {code:java} --> mapreduceyarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.address yarn-master-0.yarn-service.yarn:8032 yarn.resourcemanager.scheduler.address 0.0.0.0:8049 yarn.resourcemanager.opportunistic-container-allocation.enabled true yarn.nodemanager.opportunistic-containers-max-queue-length 10 yarn.nodemanager.distributed-scheduling.enabled true yarn.webapp.ui2.enable true yarn.resourcemanager.resource-tracker.address yarn-master-0.yarn-service.yarn:8031 yarn.log-aggregation-enable true yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.resource.memory-mb 7168 yarn.scheduler.minimum-allocation-mb 3584 yarn.scheduler.maximum-allocation-mb 7168 yarn.app.mapreduce.am.resource.mb 7168 yarn.app.mapreduce.am.command-opts -Xmx5734m yarn.timeline-service.enabled true yarn.resourcemanager.system-metrics-publisher.enabled true yarn.timeline-service.generic-application-history.enabled true yarn.timeline-service.bind-host 0.0.0.0 {code} Relevant logs: {code:java} 2019-07-22 14:56:37,104 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the mappers will be scheduled using OPPORTUNISTIC containers 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at yarn-master-0.yarn-service.yarn/10.244.1.134:8030 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1563805140414_0002_02 2019-07-22 14:56:37,152 ERROR [main] org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while registering org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1563805140414_0002_02 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy82.registerApplicationMaster(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:160) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:121) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:274) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at
[jira] [Comment Edited] (YARN-6514) Fail to launch container when distributed scheduling is enabled
[ https://issues.apache.org/jira/browse/YARN-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890202#comment-16890202 ] Babble Shack edited comment on YARN-6514 at 7/22/19 2:35 PM: - I am experiencing the same issue in 2.9.2 and 3.1.2, [~lvzheng] did you ever get this resolved? was (Author: babbleshack): I am experiencing the same issue [~lvzheng] did you ever get this resolved? > Fail to launch container when distributed scheduling is enabled > --- > > Key: YARN-6514 > URL: https://issues.apache.org/jira/browse/YARN-6514 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-scheduling, yarn >Affects Versions: 3.0.0-alpha2 > Environment: Ubuntu Linux 4.4.0-72-generic with java-8-openjdk-amd64 > 1.8.0_121 >Reporter: Zheng Lv >Priority: Major > > When yarn.nodemanager.distributed-scheduling.enabled is set to true, > mapreduce fails to launch with Invalid AMRMToken errors. > This error does not occur when the distributed scheduling option is disabled. > {code:title=yarn-site.xml|borderStyle=solid} > > > > > > yarn.resourcemanager.hostname > h3master > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.nodemanager.env-whitelist > > JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.nodemanager.vmem-check-enabled > false > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > > yarn.nodemanager.opportunistic-containers-max-queue-length > 10 > > > yarn.nodemanager.distributed-scheduling.enabled > true > > > yarn.nodemanager.amrmproxy.enable > true > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > yarn.nodemanager.resource.memory-mb > 4096 > > > {code} > {code:title=Container Log|borderStyle=solid} > 2017-04-23 05:17:50,324 INFO [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for > application appattempt_1492953411349_0001_02 > 2017-04-23 05:17:51,625 INFO [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: > / > [system properties] > os.name: Linux > os.version: 4.4.0-72-generic > java.home: /usr/lib/jvm/java-8-openjdk-amd64/jre > java.runtime.version: 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 > java.vendor: Oracle Corporation > java.version: 1.8.0_121 > java.vm.name: OpenJDK 64-Bit Server VM > java.class.path: >
[jira] [Commented] (YARN-6514) Fail to launch container when distributed scheduling is enabled
[ https://issues.apache.org/jira/browse/YARN-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890202#comment-16890202 ] Babble Shack commented on YARN-6514: I am experiencing the same issue [~lvzheng] did you ever get this resolved? > Fail to launch container when distributed scheduling is enabled > --- > > Key: YARN-6514 > URL: https://issues.apache.org/jira/browse/YARN-6514 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-scheduling, yarn >Affects Versions: 3.0.0-alpha2 > Environment: Ubuntu Linux 4.4.0-72-generic with java-8-openjdk-amd64 > 1.8.0_121 >Reporter: Zheng Lv >Priority: Major > > When yarn.nodemanager.distributed-scheduling.enabled is set to true, > mapreduce fails to launch with Invalid AMRMToken errors. > This error does not occur when the distributed scheduling option is disabled. > {code:title=yarn-site.xml|borderStyle=solid} > > > > > > yarn.resourcemanager.hostname > h3master > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.nodemanager.env-whitelist > > JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.nodemanager.vmem-check-enabled > false > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > > yarn.nodemanager.opportunistic-containers-max-queue-length > 10 > > > yarn.nodemanager.distributed-scheduling.enabled > true > > > yarn.nodemanager.amrmproxy.enable > true > > > > yarn.resourcemanager.opportunistic-container-allocation.enabled > true > > > yarn.nodemanager.resource.memory-mb > 4096 > > > {code} > {code:title=Container Log|borderStyle=solid} > 2017-04-23 05:17:50,324 INFO [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for > application appattempt_1492953411349_0001_02 > 2017-04-23 05:17:51,625 INFO [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: > / > [system properties] > os.name: Linux > os.version: 4.4.0-72-generic > java.home: /usr/lib/jvm/java-8-openjdk-amd64/jre > java.runtime.version: 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 > java.vendor: Oracle Corporation > java.version: 1.8.0_121 > java.vm.name: OpenJDK 64-Bit Server VM > java.class.path: >
[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890170#comment-16890170 ] Weiwei Yang commented on YARN-9687: --- This is related to the core resource calculator, it would be good to have [~sunilg] to take a look too. [~sunilg], could you please help to review this? Thx > Queue headroom check may let unacceptable allocation off when using > DominantResourceCalculator > -- > > Key: YARN-9687 > URL: https://issues.apache.org/jira/browse/YARN-9687 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9687.001.patch > > > Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} > is using {{Resources#greaterThanOrEqual}} which internally compare resources > by ratio, when using DominantResourceCalculator, it may let unacceptable > allocations off in some scenarios. > For example: > cluster-resource=<10GB, 10 vcores> > queue-headroom=<2GB, 4 vcores> > required-resource=<3GB, 1 vcores> > In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so > that allocations will be let off in scheduling process but will always be > rejected when committing these proposals. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9689) Router does not support kerberos proxy when in secure mode
zhoukang created YARN-9689: -- Summary: Router does not support kerberos proxy when in secure mode Key: YARN-9689 URL: https://issues.apache.org/jira/browse/YARN-9689 Project: Hadoop YARN Issue Type: Improvement Components: federation Affects Versions: 3.1.2 Reporter: zhoukang When we enable kerberos in YARN-Federation mode, we can not get new app since it will throw kerberos exception below.Which should be handled! {code:java} 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2019-07-22,18:43:25,528 WARN org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: Unable to create a new ApplicationId in SubCluster xxx java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564) at org.apache.hadoop.ipc.Client.call(Client.java:1506) at org.apache.hadoop.ipc.Client.call(Client.java:1416) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252) at org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691) Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:801) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode
[ https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890069#comment-16890069 ] zhoukang commented on YARN-9689: [~botong] Could help evaluate this?Thanks! > Router does not support kerberos proxy when in secure mode > -- > > Key: YARN-9689 > URL: https://issues.apache.org/jira/browse/YARN-9689 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Priority: Major > > When we enable kerberos in YARN-Federation mode, we can not get new app since > it will throw kerberos exception below.Which should be handled! > {code:java} > 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2019-07-22,18:43:25,528 WARN > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: > Unable to create a new ApplicationId in SubCluster xxx > java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed > on local exception: java.io.IOException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Failed to find any Kerberos tgt)] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564) > at org.apache.hadoop.ipc.Client.call(Client.java:1506) > at org.apache.hadoop.ipc.Client.call(Client.java:1416) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691) > Caused by: