[jira] [Commented] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.

2019-07-22 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890681#comment-16890681
 ] 

Bibin A Chundatt commented on YARN-9690:


[~Babbleshack]

Looks like the AM is trying to connect RM . As per the configuration mentioned 
in following document 
[Reference|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html]
 AM should connect to *AMRMProxy* in nodemanager

yarn.resourcemanager.scheduler.address localhost:8049 Redirects jobs to the 
Node Manager’s AMRMProxy port.

This is client side propery in case of mapreduce application. 

> Invalid AMRM token when distributed scheduling is enabled.
> --
>
> Key: YARN-9690
> URL: https://issues.apache.org/jira/browse/YARN-9690
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 2.9.2, 3.1.2
> Environment: OS: Ubuntu 18.04
> JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
>Reporter: Babble Shack
>Priority: Major
> Attachments: applicationlog, yarn-site.xml
>
>
> Applications fail to start due to invalild AMRM from application attempt. 
> I have tested this with 0/100% opportunistic maps and the same issue occurs 
> regardless. 
> {code:java}
> 
> -->
> 
>   
>     mapreduceyarn.nodemanager.aux-services
>     mapreduce_shuffle
>   
>   
>       yarn.resourcemanager.address
>       yarn-master-0.yarn-service.yarn:8032
>   
>   
>       yarn.resourcemanager.scheduler.address
>       0.0.0.0:8049
>   
>   
>     
> yarn.resourcemanager.opportunistic-container-allocation.enabled
>     true
>   
>   
>     yarn.nodemanager.opportunistic-containers-max-queue-length
>     10
>   
>   
>     yarn.nodemanager.distributed-scheduling.enabled
>     true
>   
>  
>   
>     yarn.webapp.ui2.enable
>     true
>   
>   
>       yarn.resourcemanager.resource-tracker.address
>       yarn-master-0.yarn-service.yarn:8031
>   
>   
>     yarn.log-aggregation-enable
>     true
>   
>   
>       yarn.nodemanager.aux-services
>       mapreduce_shuffle
>   
>   
>   
>   
>   
>     yarn.nodemanager.resource.memory-mb
>     7168
>   
>   
>     yarn.scheduler.minimum-allocation-mb
>     3584
>   
>   
>     yarn.scheduler.maximum-allocation-mb
>     7168
>   
>   
>     yarn.app.mapreduce.am.resource.mb
>     7168
>   
>   
>   
>     yarn.app.mapreduce.am.command-opts
>     -Xmx5734m
>   
>   
>   
>     yarn.timeline-service.enabled
>     true
>   
>   
>     yarn.resourcemanager.system-metrics-publisher.enabled
>     true
>   
>   
>     yarn.timeline-service.generic-application-history.enabled
>     true
>   
>   
>     yarn.timeline-service.bind-host
>     0.0.0.0
>   
> 
> {code}
> Relevant logs:
> {code:java}
> 2019-07-22 14:56:37,104 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the 
> mappers will be scheduled using OPPORTUNISTIC containers
> 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: 
> Connecting to ResourceManager at 
> yarn-master-0.yarn-service.yarn/10.244.1.134:8030
> 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken from appattempt_1563805140414_0002_02
> 2019-07-22 14:56:37,152 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while 
> registering
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563805140414_0002_02
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>     at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> 

[jira] [Commented] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing

2019-07-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890645#comment-16890645
 ] 

Hadoop QA commented on YARN-9691:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 50s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 15s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core:
 The patch generated 3 new + 47 unchanged - 0 fixed = 50 total (was 47) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 27s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 17m 
41s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 66m  1s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.0 Server=19.03.0 Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9691 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12975449/YARN-9691.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux f683ad20e1f9 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ee87e9a |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/24415/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24415/testReport/ |
| Max. 

[jira] [Updated] (YARN-9692) ContainerAllocationExpirer is missspelled

2019-07-22 Thread runzhou wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhou wu updated YARN-9692:
-
Attachment: YARN-9692.001.patch

> ContainerAllocationExpirer is missspelled
> -
>
> Key: YARN-9692
> URL: https://issues.apache.org/jira/browse/YARN-9692
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: runzhou wu
>Assignee: runzhou wu
>Priority: Trivial
> Attachments: YARN-9692.001.patch
>
>
> The class ContainerAllocationExpirer is missspelled.
> I think it should be changed to ContainerAllocationExpired



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9692) ContainerAllocationExpirer is missspelled

2019-07-22 Thread runzhou wu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890641#comment-16890641
 ] 

runzhou wu commented on YARN-9692:
--

The fully name is 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer

> ContainerAllocationExpirer is missspelled
> -
>
> Key: YARN-9692
> URL: https://issues.apache.org/jira/browse/YARN-9692
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: runzhou wu
>Assignee: runzhou wu
>Priority: Trivial
>
> The class ContainerAllocationExpirer is missspelled.
> I think it should be changed to ContainerAllocationExpired



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9692) ContainerAllocationExpirer is missspelled

2019-07-22 Thread runzhou wu (JIRA)
runzhou wu created YARN-9692:


 Summary: ContainerAllocationExpirer is missspelled
 Key: YARN-9692
 URL: https://issues.apache.org/jira/browse/YARN-9692
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: runzhou wu
Assignee: runzhou wu


The class ContainerAllocationExpirer is missspelled.

I think it should be changed to ContainerAllocationExpired



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing

2019-07-22 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9691:
--

 Summary: canceling upgrade does not work if upgrade failed 
container is existing
 Key: YARN-9691
 URL: https://issues.apache.org/jira/browse/YARN-9691
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


if a container is failed to upgrade during yarn service upgrade, it will be 
released container and transition to FAILED_UPGRADE state.
After then, I expected it is able to be back to the previous version using 
cancel-upgrade. but, It didn’t work.
At that time, AM log is as follows

{code}
# failed to upgrade container_e62_1563179597798_0006_01_08

2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO  
service.ClientAMService - Upgrade container 
container_e62_1563179597798_0006_01_08
2019-07-16 18:21:55,153 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
NEEDS_UPGRADE -> UPGRADING
2019-07-16 18:21:55,154 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] Transitioned from READY to 
UPGRADING on UPGRADE event
2019-07-16 18:21:55,154 [pool-5-thread-4] INFO  
registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08]: Deleting registry path 
/users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08
2019-07-16 18:21:55,156 [pool-6-thread-6] INFO  provider.ProviderUtils - 
[COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version 
1.0.1 : Creating dir on hdfs: 
hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
containerlaunch.ContainerLaunchService - reInitializing container 
container_e62_1563179597798_0006_01_08 with version 1.0.1
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
containerlaunch.AbstractLauncher - yarn docker env var has been set 
{LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, 
YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM,
 WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, 
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, 
LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, 
YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=}
2019-07-16 18:21:55,158 
[org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO  
impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for 
Container container_e62_1563179597798_0006_01_08
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
UPGRADING -> RUNNING_BUT_UNREADY
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] retrieve status after 30
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to 
REINITIALIZED on START event
2019-07-16 18:22:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:22:37,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:08,225 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
RUNNING_BUT_UNREADY -> FAILED_UPGRADE

# request canceling upgrade 

2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container container_e62_1563179597798_0006_01_04 true
2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container container_e62_1563179597798_0006_01_03 true
2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container 

[jira] [Commented] (YARN-2497) Fair scheduler should support strict node labels

2019-07-22 Thread Yufei Gu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890573#comment-16890573
 ] 

Yufei Gu commented on YARN-2497:


Hi [~chenzhaohang], AFAIK, FS doesn't support node label in any version. 

> Fair scheduler should support strict node labels
> 
>
> Key: YARN-2497
> URL: https://issues.apache.org/jira/browse/YARN-2497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Wangda Tan
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: YARN-2497.001.patch, YARN-2497.002.patch, 
> YARN-2497.003.patch, YARN-2497.004.patch, YARN-2497.005.patch, 
> YARN-2497.006.patch, YARN-2497.007.patch, YARN-2497.008.patch, 
> YARN-2497.009.patch, YARN-2497.010.patch, YARN-2497.011.patch, 
> YARN-2497.branch-3.0.001.patch, YARN-2499.WIP01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9537) Add configuration to disable AM preemption

2019-07-22 Thread Yufei Gu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu reassigned YARN-9537:
--

Assignee: zhoukang

> Add configuration to disable AM preemption
> --
>
> Key: YARN-9537
> URL: https://issues.apache.org/jira/browse/YARN-9537
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.2.0, 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9537.001.patch
>
>
> In this issue, i will add a configuration to support disable AM preemption.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9537) Add configuration to disable AM preemption

2019-07-22 Thread Yufei Gu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890572#comment-16890572
 ] 

Yufei Gu commented on YARN-9537:


Hi [~cane], added you to contributor, and assign this to you. Will you still 
work on this?

> Add configuration to disable AM preemption
> --
>
> Key: YARN-9537
> URL: https://issues.apache.org/jira/browse/YARN-9537
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.2.0, 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9537.001.patch
>
>
> In this issue, i will add a configuration to support disable AM preemption.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.

2019-07-22 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890503#comment-16890503
 ] 

Jim Brennan commented on YARN-9647:
---

[~ebadger], [~eyang], [~magnum] I think I'm following the discussion and I 
agree with the problem analysis.
{quote}It's slightly more nuanced than this. If the lists don't match the 
container still could've failed because of an invalid mount. Basically if we 
get an invalid mount error then we need to figure out whether that invalid 
mount was in the original allowed-mounts lists in container-executor.cfg. If it 
was, then the error message should indicate a bad disk. Otherwise, the usual 
invalid mount error message should be fine.
{quote}
Do we need to maintain two lists? check_mount_permitted() is already returning 
-1 in the case where the normalize_mount fails for the mount_src before even 
checking if it is permitted. If the disk is bad, I think this is where it will 
fail. I don't think we'll get to the point of checking whether it is permitted? 
Maybe we just need to change this error message:
{noformat}
fprintf(ERRORFILE, "Invalid docker mount '%s', realpath=%s\n", values[i], 
mount_src);
{noformat}
to
{noformat}
fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe bad 
disk?\n", mount_src, values[i]);
{noformat}
Even better, pull the normalizing of mount_src out of check_mount_permitted and 
do it separately.
{noformat}
  char *normalized_path = normalize_mount(mount_src, 0);
  if (normalized_path == NULL) {
  fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe 
bad disk?\n", mount_src, values[i]);
  ret = INVALID_DOCKER_MOUNT;
  goto free_and_exit;
  }
  permitted_rw = check_mount_permitted((const char **) permitted_rw_mounts, 
normalized_path);
  permitted_ro = check_mount_permitted((const char **) permitted_ro_mounts, 
normalized_path);

{noformat}
For paths coming from NM (local dirs / log dirs) it should have already checked 
to ensure bad ones aren't in the list.

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -
>
> Key: YARN-9647
> URL: https://issues.apache.org/jira/browse/YARN-9647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.

2019-07-22 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890443#comment-16890443
 ] 

Eric Badger commented on YARN-9647:
---

bq. We can resolve this error by keeping track of the original 
container-executor.cfg, and normalized list. When two lists are not matching, 
container-executor can provide a different error message that container failed 
to launch due to unhealthy disk rather than continuing.

It's slightly more nuanced than this. If the lists don't match the container 
still could've failed because of an invalid mount. Basically if we get an 
invalid mount error then we need to figure out whether that invalid mount was 
in the original allowed-mounts lists in container-executor.cfg. If it was, then 
the error message should indicate a bad disk. Otherwise, the usual invalid 
mount error message should be fine. 

But as long as the logic isn't too complicated, I'm ok with this

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -
>
> Key: YARN-9647
> URL: https://issues.apache.org/jira/browse/YARN-9647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.

2019-07-22 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890435#comment-16890435
 ] 

Eric Yang commented on YARN-9647:
-

[~ebadger], I think the approach taken is ok.  We want to filter out bad disk 
from allowed mount to guard against user defined mount point or system 
suggested mount point.  The difficult part is to identify if the mount path is 
user specified or system suggested.  In .cmd file, both user specified and 
system suggested paths are listed together.  There is no easy way to rotate to 
a different disk, unless node manager relaunch the container with another set 
of workdir paths.

[~magnum] , I think [~ebadger] is also right that this patch may have 
misleading error message when bad disk happens.  We can resolve this error by 
keeping track of the original container-executor.cfg, and normalized list.  
When two lists are not matching, container-executor can provide a different 
error message that container failed to launch due to unhealthy disk rather than 
continuing.

Would this work?

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -
>
> Key: YARN-9647
> URL: https://issues.apache.org/jira/browse/YARN-9647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode

2019-07-22 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890402#comment-16890402
 ] 

Botong Huang commented on YARN-9689:


+[~giovanni.fumarola] for help

> Router does not support kerberos proxy when in secure mode
> --
>
> Key: YARN-9689
> URL: https://issues.apache.org/jira/browse/YARN-9689
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Priority: Major
>
> When we enable kerberos in YARN-Federation mode, we can not get new app since 
> it will throw kerberos exception below.Which should be handled!
> {code:java}
> 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2019-07-22,18:43:25,528 WARN 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: 
> Unable to create a new ApplicationId in SubCluster xxx
> java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed 
> on local exception: java.io.IOException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564)
> at org.apache.hadoop.ipc.Client.call(Client.java:1506)
> at org.apache.hadoop.ipc.Client.call(Client.java:1416)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691)
> Caused by: 

[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.

2019-07-22 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890394#comment-16890394
 ] 

Eric Badger commented on YARN-9647:
---

[~eyang], [~Jim_Brennan], [~billie.rinaldi], any ideas on how to fix this in a 
clean way? 

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -
>
> Key: YARN-9647
> URL: https://issues.apache.org/jira/browse/YARN-9647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.

2019-07-22 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890393#comment-16890393
 ] 

Eric Badger commented on YARN-9647:
---

[~magnum], thanks for the explanation. I understand what you mean since 
{{docker.allowed.[ro,rw]-mounts}} will always be parsed and if either are bad 
then you will fail all launches. However, some errors might get confusing with 
your proposed approach. For example, the user may set bind-mounts or there may 
be some defined mounts for all containers. Those could be hard-coded in confs 
(or by users' jobs) and then once the container is launched the container will 
get an invalid docker mount message even though the mount is in the allowed 
list. 

It would be nice to be able to not fail on bad disks in the allowed lists, but 
also have good logging when the container fails due to a bad disk. Simply 
ignoring the bad disks in the allowed list gives you a misleading error message 
if the container attempts to use those disks. 

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -
>
> Key: YARN-9647
> URL: https://issues.apache.org/jira/browse/YARN-9647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests

2019-07-22 Thread Zoltan Siegl (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890368#comment-16890368
 ] 

Zoltan Siegl commented on YARN-5106:


Uploaded patched for branch-3.1 and branch-3.2 

> Provide a builder interface for FairScheduler allocations for use in tests
> --
>
> Key: YARN-5106
> URL: https://issues.apache.org/jira/browse/YARN-5106
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Zoltan Siegl
>Priority: Major
>  Labels: newbie++
> Attachments: YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.2.001.patch, YARN-5106.001.patch, YARN-5106.002.patch, 
> YARN-5106.003.patch, YARN-5106.004.patch, YARN-5106.005.patch, 
> YARN-5106.006.patch, YARN-5106.007.patch, YARN-5106.008.patch
>
>
> Most, if not all, fair scheduler tests create an allocations XML file. Having 
> a helper class that potentially uses a builder would make the tests cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests

2019-07-22 Thread Zoltan Siegl (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Siegl updated YARN-5106:
---
Attachment: YARN-5106-branch-3.2.001.patch

> Provide a builder interface for FairScheduler allocations for use in tests
> --
>
> Key: YARN-5106
> URL: https://issues.apache.org/jira/browse/YARN-5106
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Zoltan Siegl
>Priority: Major
>  Labels: newbie++
> Attachments: YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.2.001.patch, YARN-5106.001.patch, YARN-5106.002.patch, 
> YARN-5106.003.patch, YARN-5106.004.patch, YARN-5106.005.patch, 
> YARN-5106.006.patch, YARN-5106.007.patch, YARN-5106.008.patch
>
>
> Most, if not all, fair scheduler tests create an allocations XML file. Having 
> a helper class that potentially uses a builder would make the tests cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests

2019-07-22 Thread Zoltan Siegl (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Siegl updated YARN-5106:
---
Attachment: YARN-5106-branch-3.1.001.patch

> Provide a builder interface for FairScheduler allocations for use in tests
> --
>
> Key: YARN-5106
> URL: https://issues.apache.org/jira/browse/YARN-5106
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Zoltan Siegl
>Priority: Major
>  Labels: newbie++
> Attachments: YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.2.001.patch, YARN-5106.001.patch, YARN-5106.002.patch, 
> YARN-5106.003.patch, YARN-5106.004.patch, YARN-5106.005.patch, 
> YARN-5106.006.patch, YARN-5106.007.patch, YARN-5106.008.patch
>
>
> Most, if not all, fair scheduler tests create an allocations XML file. Having 
> a helper class that potentially uses a builder would make the tests cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-22 Thread Jonathan Hung (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890350#comment-16890350
 ] 

Jonathan Hung commented on YARN-9668:
-

Thanks Haibo! Committed to branch-3.2, branch-3.1, branch-3.0 as well.

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.3, 3.2.2
>
> Attachments: YARN-9668-branch-2.001.patch, 
> YARN-9668-branch-2.002.patch, YARN-9668-branch-3.2.001.patch, 
> YARN-9668.001.patch, YARN-9668.002.patch, YARN-9668.003.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9668) UGI conf doesn't read user overridden configurations on RM and NM startup

2019-07-22 Thread Jonathan Hung (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-9668:

Fix Version/s: 3.2.2
   3.1.3
   3.0.4

> UGI conf doesn't read user overridden configurations on RM and NM startup
> -
>
> Key: YARN-9668
> URL: https://issues.apache.org/jira/browse/YARN-9668
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.3, 3.2.2
>
> Attachments: YARN-9668-branch-2.001.patch, 
> YARN-9668-branch-2.002.patch, YARN-9668-branch-3.2.001.patch, 
> YARN-9668.001.patch, YARN-9668.002.patch, YARN-9668.003.patch
>
>
> Similar to HADOOP-15150. Configs overridden thru e.g. -D or -conf are not 
> passed to the UGI conf on RM or NM startup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-22 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890322#comment-16890322
 ] 

Sunil Govindan commented on YARN-9687:
--

Hi [~Tao Yang]

Thanks for reporting this issue. Yes, we have seen this in few places where 
such cases can occur given the combination of resource values. *fitsIn* helps 
in such areas. (already we fixed few in preemption modules)

+1 for this patch.

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-22 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890284#comment-16890284
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

CSQueueUtils#updateUsedCapacity is called before 
getMaxAvailableResourceToQueuePartition. So any checks for correct partition 
should be in CSQueueUtils#updateQueueStatistics so that it captures both the 
methods.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-07-22 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890273#comment-16890273
 ] 

Eric Yang commented on YARN-9562:
-

[~ebadger] Thank you for the patch.  Can we create manifestJson as a json file 
in src/test/resources, and use 
TestImageTagToManifestPlugin.class.getResource("manifest.json"); to retrieve 
the json content, please?  This might be easier to manage in the long run.  
Thanks

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.

2019-07-22 Thread Babble Shack (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890254#comment-16890254
 ] 

Babble Shack commented on YARN-9690:


I enabled some additional configuration 

`yarn.nodemanager.amrmproxy.enabled` and set 
`yarn.resourcemanager.scheduler.address` to `0.0.0.0:8049`

However I still get the same issue, in particular there is an excption whilst 
registering application master, because of an invalid AMRM-token.

> Invalid AMRM token when distributed scheduling is enabled.
> --
>
> Key: YARN-9690
> URL: https://issues.apache.org/jira/browse/YARN-9690
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 2.9.2, 3.1.2
> Environment: OS: Ubuntu 18.04
> JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
>Reporter: Babble Shack
>Priority: Major
> Attachments: applicationlog, yarn-site.xml
>
>
> Applications fail to start due to invalild AMRM from application attempt. 
> I have tested this with 0/100% opportunistic maps and the same issue occurs 
> regardless. 
> {code:java}
> 
> -->
> 
>   
>     mapreduceyarn.nodemanager.aux-services
>     mapreduce_shuffle
>   
>   
>       yarn.resourcemanager.address
>       yarn-master-0.yarn-service.yarn:8032
>   
>   
>       yarn.resourcemanager.scheduler.address
>       0.0.0.0:8049
>   
>   
>     
> yarn.resourcemanager.opportunistic-container-allocation.enabled
>     true
>   
>   
>     yarn.nodemanager.opportunistic-containers-max-queue-length
>     10
>   
>   
>     yarn.nodemanager.distributed-scheduling.enabled
>     true
>   
>  
>   
>     yarn.webapp.ui2.enable
>     true
>   
>   
>       yarn.resourcemanager.resource-tracker.address
>       yarn-master-0.yarn-service.yarn:8031
>   
>   
>     yarn.log-aggregation-enable
>     true
>   
>   
>       yarn.nodemanager.aux-services
>       mapreduce_shuffle
>   
>   
>   
>   
>   
>     yarn.nodemanager.resource.memory-mb
>     7168
>   
>   
>     yarn.scheduler.minimum-allocation-mb
>     3584
>   
>   
>     yarn.scheduler.maximum-allocation-mb
>     7168
>   
>   
>     yarn.app.mapreduce.am.resource.mb
>     7168
>   
>   
>   
>     yarn.app.mapreduce.am.command-opts
>     -Xmx5734m
>   
>   
>   
>     yarn.timeline-service.enabled
>     true
>   
>   
>     yarn.resourcemanager.system-metrics-publisher.enabled
>     true
>   
>   
>     yarn.timeline-service.generic-application-history.enabled
>     true
>   
>   
>     yarn.timeline-service.bind-host
>     0.0.0.0
>   
> 
> {code}
> Relevant logs:
> {code:java}
> 2019-07-22 14:56:37,104 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the 
> mappers will be scheduled using OPPORTUNISTIC containers
> 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: 
> Connecting to ResourceManager at 
> yarn-master-0.yarn-service.yarn/10.244.1.134:8030
> 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken from appattempt_1563805140414_0002_02
> 2019-07-22 14:56:37,152 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while 
> registering
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563805140414_0002_02
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>     at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   

[jira] [Updated] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.

2019-07-22 Thread Babble Shack (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babble Shack updated YARN-9690:
---
Component/s: yarn

> Invalid AMRM token when distributed scheduling is enabled.
> --
>
> Key: YARN-9690
> URL: https://issues.apache.org/jira/browse/YARN-9690
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 2.9.2, 3.1.2
> Environment: OS: Ubuntu 18.04
> JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
>Reporter: Babble Shack
>Priority: Major
> Attachments: applicationlog, yarn-site.xml
>
>
> Applications fail to start due to invalild AMRM from application attempt. 
> I have tested this with 0/100% opportunistic maps and the same issue occurs 
> regardless. 
> {code:java}
> 
> -->
> 
>   
>     mapreduceyarn.nodemanager.aux-services
>     mapreduce_shuffle
>   
>   
>       yarn.resourcemanager.address
>       yarn-master-0.yarn-service.yarn:8032
>   
>   
>       yarn.resourcemanager.scheduler.address
>       0.0.0.0:8049
>   
>   
>     
> yarn.resourcemanager.opportunistic-container-allocation.enabled
>     true
>   
>   
>     yarn.nodemanager.opportunistic-containers-max-queue-length
>     10
>   
>   
>     yarn.nodemanager.distributed-scheduling.enabled
>     true
>   
>  
>   
>     yarn.webapp.ui2.enable
>     true
>   
>   
>       yarn.resourcemanager.resource-tracker.address
>       yarn-master-0.yarn-service.yarn:8031
>   
>   
>     yarn.log-aggregation-enable
>     true
>   
>   
>       yarn.nodemanager.aux-services
>       mapreduce_shuffle
>   
>   
>   
>   
>   
>     yarn.nodemanager.resource.memory-mb
>     7168
>   
>   
>     yarn.scheduler.minimum-allocation-mb
>     3584
>   
>   
>     yarn.scheduler.maximum-allocation-mb
>     7168
>   
>   
>     yarn.app.mapreduce.am.resource.mb
>     7168
>   
>   
>   
>     yarn.app.mapreduce.am.command-opts
>     -Xmx5734m
>   
>   
>   
>     yarn.timeline-service.enabled
>     true
>   
>   
>     yarn.resourcemanager.system-metrics-publisher.enabled
>     true
>   
>   
>     yarn.timeline-service.generic-application-history.enabled
>     true
>   
>   
>     yarn.timeline-service.bind-host
>     0.0.0.0
>   
> 
> {code}
> Relevant logs:
> {code:java}
> 2019-07-22 14:56:37,104 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the 
> mappers will be scheduled using OPPORTUNISTIC containers
> 2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: 
> Connecting to ResourceManager at 
> yarn-master-0.yarn-service.yarn/10.244.1.134:8030
> 2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken from appattempt_1563805140414_0002_02
> 2019-07-22 14:56:37,152 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while 
> registering
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563805140414_0002_02
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>     at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>     at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>     at 
> 

[jira] [Created] (YARN-9690) Invalid AMRM token when distributed scheduling is enabled.

2019-07-22 Thread Babble Shack (JIRA)
Babble Shack created YARN-9690:
--

 Summary: Invalid AMRM token when distributed scheduling is enabled.
 Key: YARN-9690
 URL: https://issues.apache.org/jira/browse/YARN-9690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: distributed-scheduling
Affects Versions: 3.1.2, 2.9.2
 Environment: OS: Ubuntu 18.04
JVM: 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
Reporter: Babble Shack
 Attachments: applicationlog, yarn-site.xml

Applications fail to start due to invalild AMRM from application attempt. 

I have tested this with 0/100% opportunistic maps and the same issue occurs 
regardless. 



{code:java}

-->

  
    mapreduceyarn.nodemanager.aux-services
    mapreduce_shuffle
  
  
      yarn.resourcemanager.address
      yarn-master-0.yarn-service.yarn:8032
  
  
      yarn.resourcemanager.scheduler.address
      0.0.0.0:8049
  

  
    yarn.resourcemanager.opportunistic-container-allocation.enabled
    true
  
  
    yarn.nodemanager.opportunistic-containers-max-queue-length
    10
  
  
    yarn.nodemanager.distributed-scheduling.enabled
    true
  
 
  
    yarn.webapp.ui2.enable
    true
  
  
      yarn.resourcemanager.resource-tracker.address
      yarn-master-0.yarn-service.yarn:8031
  
  
    yarn.log-aggregation-enable
    true
  
  
      yarn.nodemanager.aux-services
      mapreduce_shuffle
  

  
  

  
  
    yarn.nodemanager.resource.memory-mb
    7168
  
  
    yarn.scheduler.minimum-allocation-mb
    3584
  
  
    yarn.scheduler.maximum-allocation-mb
    7168
  
  
    yarn.app.mapreduce.am.resource.mb
    7168
  
  
  
    yarn.app.mapreduce.am.command-opts
    -Xmx5734m
  

  
  
    yarn.timeline-service.enabled
    true
  
  
    yarn.resourcemanager.system-metrics-publisher.enabled
    true
  
  
    yarn.timeline-service.generic-application-history.enabled
    true
  
  
    yarn.timeline-service.bind-host
    0.0.0.0
  

{code}
Relevant logs:
{code:java}
2019-07-22 14:56:37,104 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 100% of the mappers 
will be scheduled using OPPORTUNISTIC containers
2019-07-22 14:56:37,117 INFO [main] org.apache.hadoop.yarn.client.RMProxy: 
Connecting to ResourceManager at 
yarn-master-0.yarn-service.yarn/10.244.1.134:8030
2019-07-22 14:56:37,150 WARN [main] org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 Invalid AMRMToken from appattempt_1563805140414_0002_02
2019-07-22 14:56:37,152 ERROR [main] 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while 
registering
org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken 
from appattempt_1563805140414_0002_02
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
    at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
    at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
    at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
    at com.sun.proxy.$Proxy82.registerApplicationMaster(Unknown Source)
    at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:160)
    at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:121)
    at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:274)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at 

[jira] [Comment Edited] (YARN-6514) Fail to launch container when distributed scheduling is enabled

2019-07-22 Thread Babble Shack (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890202#comment-16890202
 ] 

Babble Shack edited comment on YARN-6514 at 7/22/19 2:35 PM:
-

I am experiencing the same issue in 2.9.2 and 3.1.2, [~lvzheng] did you ever 
get this resolved?


was (Author: babbleshack):
I am experiencing the same issue [~lvzheng] did you ever get this resolved?

> Fail to launch container when distributed scheduling is enabled
> ---
>
> Key: YARN-6514
> URL: https://issues.apache.org/jira/browse/YARN-6514
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 3.0.0-alpha2
> Environment: Ubuntu Linux 4.4.0-72-generic with java-8-openjdk-amd64 
> 1.8.0_121
>Reporter: Zheng Lv
>Priority: Major
>
> When yarn.nodemanager.distributed-scheduling.enabled is set to true, 
> mapreduce fails to launch with Invalid AMRMToken errors.
> This error does not occur when the distributed scheduling option is disabled.
> {code:title=yarn-site.xml|borderStyle=solid}
> 
> 
> 
> 
> 
> yarn.resourcemanager.hostname
> h3master
> 
> 
> yarn.nodemanager.aux-services
> mapreduce_shuffle
> 
> 
> yarn.nodemanager.env-whitelist
> 
> JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME
> 
> 
> yarn.nodemanager.aux-services
> mapreduce_shuffle
> 
> 
> yarn.nodemanager.vmem-check-enabled
> false
> 
> 
> 
> yarn.resourcemanager.opportunistic-container-allocation.enabled
> true
> 
> 
> 
> yarn.nodemanager.opportunistic-containers-max-queue-length
> 10
> 
> 
> yarn.nodemanager.distributed-scheduling.enabled
> true
> 
> 
> yarn.nodemanager.amrmproxy.enable
> true
> 
> 
> 
> yarn.resourcemanager.opportunistic-container-allocation.enabled
> true
> 
> 
> yarn.nodemanager.resource.memory-mb
> 4096
> 
> 
> {code}
> {code:title=Container Log|borderStyle=solid}
> 2017-04-23 05:17:50,324 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for 
> application appattempt_1492953411349_0001_02
> 2017-04-23 05:17:51,625 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: 
> /
> [system properties]
> os.name: Linux
> os.version: 4.4.0-72-generic
> java.home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> java.runtime.version: 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13
> java.vendor: Oracle Corporation
> java.version: 1.8.0_121
> java.vm.name: OpenJDK 64-Bit Server VM
> java.class.path: 
> 

[jira] [Commented] (YARN-6514) Fail to launch container when distributed scheduling is enabled

2019-07-22 Thread Babble Shack (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890202#comment-16890202
 ] 

Babble Shack commented on YARN-6514:


I am experiencing the same issue [~lvzheng] did you ever get this resolved?

> Fail to launch container when distributed scheduling is enabled
> ---
>
> Key: YARN-6514
> URL: https://issues.apache.org/jira/browse/YARN-6514
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling, yarn
>Affects Versions: 3.0.0-alpha2
> Environment: Ubuntu Linux 4.4.0-72-generic with java-8-openjdk-amd64 
> 1.8.0_121
>Reporter: Zheng Lv
>Priority: Major
>
> When yarn.nodemanager.distributed-scheduling.enabled is set to true, 
> mapreduce fails to launch with Invalid AMRMToken errors.
> This error does not occur when the distributed scheduling option is disabled.
> {code:title=yarn-site.xml|borderStyle=solid}
> 
> 
> 
> 
> 
> yarn.resourcemanager.hostname
> h3master
> 
> 
> yarn.nodemanager.aux-services
> mapreduce_shuffle
> 
> 
> yarn.nodemanager.env-whitelist
> 
> JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME
> 
> 
> yarn.nodemanager.aux-services
> mapreduce_shuffle
> 
> 
> yarn.nodemanager.vmem-check-enabled
> false
> 
> 
> 
> yarn.resourcemanager.opportunistic-container-allocation.enabled
> true
> 
> 
> 
> yarn.nodemanager.opportunistic-containers-max-queue-length
> 10
> 
> 
> yarn.nodemanager.distributed-scheduling.enabled
> true
> 
> 
> yarn.nodemanager.amrmproxy.enable
> true
> 
> 
> 
> yarn.resourcemanager.opportunistic-container-allocation.enabled
> true
> 
> 
> yarn.nodemanager.resource.memory-mb
> 4096
> 
> 
> {code}
> {code:title=Container Log|borderStyle=solid}
> 2017-04-23 05:17:50,324 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for 
> application appattempt_1492953411349_0001_02
> 2017-04-23 05:17:51,625 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: 
> /
> [system properties]
> os.name: Linux
> os.version: 4.4.0-72-generic
> java.home: /usr/lib/jvm/java-8-openjdk-amd64/jre
> java.runtime.version: 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13
> java.vendor: Oracle Corporation
> java.version: 1.8.0_121
> java.vm.name: OpenJDK 64-Bit Server VM
> java.class.path: 
> 

[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-22 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890170#comment-16890170
 ] 

Weiwei Yang commented on YARN-9687:
---

This is related to the core resource calculator, it would be good to have 
[~sunilg] to take a look too. [~sunilg], could you please help to review this? 
Thx

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9689) Router does not support kerberos proxy when in secure mode

2019-07-22 Thread zhoukang (JIRA)
zhoukang created YARN-9689:
--

 Summary: Router does not support kerberos proxy when in secure mode
 Key: YARN-9689
 URL: https://issues.apache.org/jira/browse/YARN-9689
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: federation
Affects Versions: 3.1.2
Reporter: zhoukang


When we enable kerberos in YARN-Federation mode, we can not get new app since 
it will throw kerberos exception below.Which should be handled!
{code:java}
2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server : javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
2019-07-22,18:43:25,528 WARN 
org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: 
Unable to create a new ApplicationId in SubCluster xxx
java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed on 
local exception: java.io.IOException: javax.security.sasl.SaslException: GSS 
initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564)
at org.apache.hadoop.ipc.Client.call(Client.java:1506)
at org.apache.hadoop.ipc.Client.call(Client.java:1416)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source)
at 
org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252)
at 
org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism level: 
Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:801)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 

[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode

2019-07-22 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890069#comment-16890069
 ] 

zhoukang commented on YARN-9689:


[~botong] Could help evaluate this?Thanks!

> Router does not support kerberos proxy when in secure mode
> --
>
> Key: YARN-9689
> URL: https://issues.apache.org/jira/browse/YARN-9689
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Priority: Major
>
> When we enable kerberos in YARN-Federation mode, we can not get new app since 
> it will throw kerberos exception below.Which should be handled!
> {code:java}
> 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2019-07-22,18:43:25,528 WARN 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: 
> Unable to create a new ApplicationId in SubCluster xxx
> java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed 
> on local exception: java.io.IOException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564)
> at org.apache.hadoop.ipc.Client.call(Client.java:1506)
> at org.apache.hadoop.ipc.Client.call(Client.java:1416)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691)
> Caused by: