from:"Eric Yang \(Jira\)"

[jira] [Commented] (YARN-10291) Yarn service commands doesn't work when https is enabled in RM

2020-07-13 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156777#comment-17156777
 ] 

Eric Yang commented on YARN-10291:
--

[~brahmareddy] Hadoop 
[getAcceptedIssuers|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ssl/ReloadingX509TrustManager.java#L146]
 is returning a list of empty issuers, or the list from 
javax.net.ssl.X509TrustManager.  Unless CA chained certificates are installed 
into cacerts, there is no issuer verification in Hadoop own implementation of 
SSL.  This is the reason that I think Hadoop's implementation of loading 
trusted store is odd.

> Yarn service commands doesn't work when https is enabled in RM
> --
>
> Key: YARN-10291
> URL: https://issues.apache.org/jira/browse/YARN-10291
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10291.001.patch
>
>
> when we submit application using command "yarn app -launch sleeper-service 
> ../share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json" , it throws 
> below exception 
> {code:java}
> com.sun.jersey.api.client.ClientHandlerException: 
> javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> {code}
> We should use WebServiceClient#createClient as it takes care of setting 
> sslfactory when https is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed

2020-07-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152338#comment-17152338
 ] 

Eric Yang commented on YARN-10341:
--

cc [~billie] [~jianhe]

> Yarn Service Container Completed event doesn't get processed 
> -
>
> Key: YARN-10341
> URL: https://issues.apache.org/jira/browse/YARN-10341
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-10341.001.patch
>
>
> If there 10 workers running and if containers get killed , after a while we 
> see that there are just 9 workers runnning. This is due to CONTAINER 
> COMPLETED Event is not processed on AM side. 
> Issue is in below code:
> {code:java}
> public void onContainersCompleted(List statuses) {
>   for (ContainerStatus status : statuses) {
> ContainerId containerId = status.getContainerId();
> ComponentInstance instance = 
> liveInstances.get(status.getContainerId());
> if (instance == null) {
>   LOG.warn(
>   "Container {} Completed. No component instance exists. 
> exitStatus={}. diagnostics={} ",
>   containerId, status.getExitStatus(), status.getDiagnostics());
>   return;
> }
> ComponentEvent event =
> new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED)
> .setStatus(status).setInstance(instance)
> .setContainerId(containerId);
> dispatcher.getEventHandler().handle(event);
>   }
> {code}
> If component instance doesnt exist for a container, it doesnt iterate over 
> other containers as its returning from method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10341) Yarn Service Container Completed event doesn't get processed

2020-07-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152267#comment-17152267
 ] 

Eric Yang edited comment on YARN-10341 at 7/6/20, 7:34 PM:
---

[~BilwaST] I see that you'd changed the code from break to continue.  This 
change looks better.  Please use a new version of the patch instead of 
replacing existing patch 001, this will help the precommit build to report 
correctly for the new patch.  Thanks


was (Author: eyang):
[~BilwaST] I see that you'd changed the code from break to continue.  This 
change looks better.  Please use a new version of the patch instead of 
replacing existing patch 001, this will help the recommit build to report 
correctly for the new patch.  Thanks

> Yarn Service Container Completed event doesn't get processed 
> -
>
> Key: YARN-10341
> URL: https://issues.apache.org/jira/browse/YARN-10341
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-10341.001.patch
>
>
> If there 10 workers running and if containers get killed , after a while we 
> see that there are just 9 workers runnning. This is due to CONTAINER 
> COMPLETED Event is not processed on AM side. 
> Issue is in below code:
> {code:java}
> public void onContainersCompleted(List statuses) {
>   for (ContainerStatus status : statuses) {
> ContainerId containerId = status.getContainerId();
> ComponentInstance instance = 
> liveInstances.get(status.getContainerId());
> if (instance == null) {
>   LOG.warn(
>   "Container {} Completed. No component instance exists. 
> exitStatus={}. diagnostics={} ",
>   containerId, status.getExitStatus(), status.getDiagnostics());
>   return;
> }
> ComponentEvent event =
> new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED)
> .setStatus(status).setInstance(instance)
> .setContainerId(containerId);
> dispatcher.getEventHandler().handle(event);
>   }
> {code}
> If component instance doesnt exist for a container, it doesnt iterate over 
> other containers as its returning from method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed

2020-07-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152267#comment-17152267
 ] 

Eric Yang commented on YARN-10341:
--

[~BilwaST] I see that you'd changed the code from break to continue.  This 
change looks better.  Please use a new version of the patch instead of 
replacing existing patch 001, this will help the recommit build to report 
correctly for the new patch.  Thanks

> Yarn Service Container Completed event doesn't get processed 
> -
>
> Key: YARN-10341
> URL: https://issues.apache.org/jira/browse/YARN-10341
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-10341.001.patch
>
>
> If there 10 workers running and if containers get killed , after a while we 
> see that there are just 9 workers runnning. This is due to CONTAINER 
> COMPLETED Event is not processed on AM side. 
> Issue is in below code:
> {code:java}
> public void onContainersCompleted(List statuses) {
>   for (ContainerStatus status : statuses) {
> ContainerId containerId = status.getContainerId();
> ComponentInstance instance = 
> liveInstances.get(status.getContainerId());
> if (instance == null) {
>   LOG.warn(
>   "Container {} Completed. No component instance exists. 
> exitStatus={}. diagnostics={} ",
>   containerId, status.getExitStatus(), status.getDiagnostics());
>   return;
> }
> ComponentEvent event =
> new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED)
> .setStatus(status).setInstance(instance)
> .setContainerId(containerId);
> dispatcher.getEventHandler().handle(event);
>   }
> {code}
> If component instance doesnt exist for a container, it doesnt iterate over 
> other containers as its returning from method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed

2020-07-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152241#comment-17152241
 ] 

Eric Yang commented on YARN-10341:
--

[~BilwaST] Sorry, I am confused by this ticket and the proposed patch fix to 
the described problem.  
The containers "restart_policy" controls if the container should be restarted 
on the event of failure/killed.  If it was not set, it will always restart.  If 
it was set to "NEVER", it will not restart.  The completion events are 
secondary information to assist to restart the containers or not.  Using return 
or break in onContainerCompleted method, don't make any difference.

Maybe I am missing something, could you give more information on how this patch 
address the observed issue?

> Yarn Service Container Completed event doesn't get processed 
> -
>
> Key: YARN-10341
> URL: https://issues.apache.org/jira/browse/YARN-10341
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Critical
> Attachments: YARN-10341.001.patch
>
>
> If there 10 workers running and if containers get killed , after a while we 
> see that there are just 9 workers runnning. This is due to CONTAINER 
> COMPLETED Event is not processed on AM side. 
> Issue is in below code:
> {code:java}
> public void onContainersCompleted(List statuses) {
>   for (ContainerStatus status : statuses) {
> ContainerId containerId = status.getContainerId();
> ComponentInstance instance = 
> liveInstances.get(status.getContainerId());
> if (instance == null) {
>   LOG.warn(
>   "Container {} Completed. No component instance exists. 
> exitStatus={}. diagnostics={} ",
>   containerId, status.getExitStatus(), status.getDiagnostics());
>   return;
> }
> ComponentEvent event =
> new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED)
> .setStatus(status).setInstance(instance)
> .setContainerId(containerId);
> dispatcher.getEventHandler().handle(event);
>   }
> {code}
> If component instance doesnt exist for a container, it doesnt iterate over 
> other containers as its returning from method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-30 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148889#comment-17148889
 ] 

Eric Yang commented on YARN-10311:
--

[~BilwaST] There is no secure cluster test case written for ServiceClient at 
this time.  You may need to write a new class to test this change.

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch, YARN-10311.002.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-29 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148066#comment-17148066
 ] 

Eric Yang commented on YARN-9809:
-

+1 for patch 007.  Tested both healthy and unhealthy health check scripts in my 
limited 1 node environment.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-29 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148063#comment-17148063
 ] 

Eric Yang commented on YARN-10310:
--

ServiceClient is designed to work with YARN service API.  The main distinction 
between YARN service app and YARN app are:

1.  YARN app can start multiple instances of the same name.
2.  YARN service app can not start multiple instances of the same name.  
Application name and user name are used to construct DNS hostnames.  They must 
be uniquely defined per service.  
3.  Using appType=unit-test with YARN service can result in conflict behavior 
between YARN app and YARN service.
4.  YARN service app can be completely suspended with footprint in HDFS only.  
Classic YARN app does not handle suspension and resume.

Those are the main differences between classic YARN app, and YARN services.  
Classic YARN API may not fully support YARN service, and the same vice versa 
due to the intersection of commonality of features is small between two 
mutually exclusive ideas.  Neither classic YARN API nor YARN service API can 
operate as a union API calls of either type of applications.  Hence, support 
appType=unit-test in YARN service client will create ambiguity that the classic 
YARN code was not designed to handle.  I am inclined to close this as won't fix 
to prevent using appType=unit-test with YARN service api.

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10328) Too many ZK Curator NodeExists exception logs in YARN Service AM logs

2020-06-29 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147918#comment-17147918
 ] 

Eric Yang commented on YARN-10328:
--

+1 looks good.  Committing shortly.

> Too many ZK Curator NodeExists exception logs in YARN Service AM logs
> -
>
> Key: YARN-10328
> URL: https://issues.apache.org/jira/browse/YARN-10328
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10328.001.patch
>
>
> Following debug logs are printed everytime when component is started.
> {code:java}
> [pool-6-thread-3] DEBUG zk.CuratorService - path already present: 
> /registry/users/server/services/yarn-service/default-worker/components
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for 
> /registry/users/hetuserver/services/yarn-service/default-worker/components
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:128)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1480)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:740)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:723)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:720)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:484)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:474)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:454)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587)
>   at 
> org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
>   at 
> org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putComponent(YarnRegistryViewForProviders.java:146)
>   at 
> org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putComponent(YarnRegistryViewForProviders.java:128)
>   at 
> org.apache.hadoop.yarn.service.component.instance.ComponentInstance.updateServiceRecord(ComponentInstance.java:511)
>   at 
> org.apache.hadoop.yarn.service.component.instance.ComponentInstance.updateContainerStatus(ComponentInstance.java:449)
>   at 
> org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:620)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-18 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139511#comment-17139511
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] [~Jim_Brennan] I agree that health check script handling is 
separated from register health check status.  +1 on patch 004.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138894#comment-17138894
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] Sorry, my statement was not clear.  If the script name is incorrect, 
resulting exit code is non-zero, or the execution exit code is non-zero.  In 
those cases, health check will report as healthy.  I think those conditions 
must be considered as unhealthy, in the event that check script does not have 
proper prerequisites.  The errors can be caught.  Is this something that we can 
fix to make this more user friendly?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138853#comment-17138853
 ] 

Eric Yang commented on YARN-9809:
-

[~Jim_Brennan] Thank you for the instruction.  I updated my check script 
accordingly to:

{code}
#!/bin/bash
echo "ERROR test"
{code}

This works.  The script must return 0 exit code to work as well, otherwise, it 
will report as healthy.  This implies, if the health check script doesn't 
exist, it reports as healthy.  Is this right?


> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138685#comment-17138685
 ] 

Eric Yang commented on YARN-9809:
-

[~ebadger] Thank you for the patch.  The patch looks very close to final 
product.  I have confirmed the test case failure doesn't happen, if there are 
sufficient amount of RAM on the testing node.  I also validated that new node 
manager can work with unpatched resource manager.  However, I could not get 
health check script to fail to cause node registered as unhealthy.

Here is my check script:
{code}
#!/bin/bash
echo "i am here" > /tmp/hello
exit 1
{code}

It would be nice to have verbose message to show the exit code of the health 
check script in node manager log file.  The script is executed, but it shows 
healthy.  What am I doing wrong?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10308) Update javadoc and variable names for keytab in yarn services as it supports filesystems other than hdfs and local file system

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138608#comment-17138608
 ] 

Eric Yang edited comment on YARN-10308 at 6/17/20, 4:07 PM:


+1 I just committed patch 002 to trunk.  Thank you [~BilwaST] for the patch.


was (Author: eyang):
+1 I just committed this to trunk.  Thank you [~BilwaST] for the patch.

> Update javadoc and variable names for keytab in yarn services as it supports 
> filesystems other than hdfs and local file system
> --
>
> Key: YARN-10308
> URL: https://issues.apache.org/jira/browse/YARN-10308
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10308.001.patch, YARN-10308.002.patch
>
>
> 1.  Below description should be updated
> {code:java}
> @ApiModelProperty(value = "The URI of the kerberos keytab. It supports two " +
>   "schemes \"hdfs\" and \"file\". If the URI starts with \"hdfs://\" " +
>   "scheme, it indicates the path on hdfs where the keytab is stored. The 
> " +
>   "keytab will be localized by YARN and made available to AM in its 
> local" +
>   " directory. If the URI starts with \"file://\" scheme, it indicates a 
> " +
>   "path on the local host where the keytab is presumbaly installed by " +
>   "admins upfront. ")
>   public String getKeytab() {
> return keytab;
>   }
> {code}
> 2. Variables below are still named on hdfs which is confusing
> {code:java}
> if ("file".equals(keytabURI.getScheme())) {
>   LOG.info("Using a keytab from localhost: " + keytabURI);
> } else {
>   Path keytabOnhdfs = new Path(keytabURI);
>   if (!fileSystem.getFileSystem().exists(keytabOnhdfs)) {
> LOG.warn(service.getName() + "'s keytab (principalName = "
> + principalName + ") doesn't exist at: " + keytabOnhdfs);
> return;
>   }
>   LocalResource keytabRes = fileSystem.createAmResource(keytabOnhdfs,
>   LocalResourceType.FILE);
>   localResource.put(String.format(YarnServiceConstants.KEYTAB_LOCATION,
>   service.getName()), keytabRes);
>   LOG.info("Adding " + service.getName() + "'s keytab for "
>   + "localization, uri = " + keytabOnhdfs);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10308) Update javadoc and variable names for keytab in yarn services as it supports filesystems other than hdfs and local file system

2020-06-17 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10308:
-
   Fix Version/s: 3.4.0
Target Version/s: 3.4.0

> Update javadoc and variable names for keytab in yarn services as it supports 
> filesystems other than hdfs and local file system
> --
>
> Key: YARN-10308
> URL: https://issues.apache.org/jira/browse/YARN-10308
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10308.001.patch, YARN-10308.002.patch
>
>
> 1.  Below description should be updated
> {code:java}
> @ApiModelProperty(value = "The URI of the kerberos keytab. It supports two " +
>   "schemes \"hdfs\" and \"file\". If the URI starts with \"hdfs://\" " +
>   "scheme, it indicates the path on hdfs where the keytab is stored. The 
> " +
>   "keytab will be localized by YARN and made available to AM in its 
> local" +
>   " directory. If the URI starts with \"file://\" scheme, it indicates a 
> " +
>   "path on the local host where the keytab is presumbaly installed by " +
>   "admins upfront. ")
>   public String getKeytab() {
> return keytab;
>   }
> {code}
> 2. Variables below are still named on hdfs which is confusing
> {code:java}
> if ("file".equals(keytabURI.getScheme())) {
>   LOG.info("Using a keytab from localhost: " + keytabURI);
> } else {
>   Path keytabOnhdfs = new Path(keytabURI);
>   if (!fileSystem.getFileSystem().exists(keytabOnhdfs)) {
> LOG.warn(service.getName() + "'s keytab (principalName = "
> + principalName + ") doesn't exist at: " + keytabOnhdfs);
> return;
>   }
>   LocalResource keytabRes = fileSystem.createAmResource(keytabOnhdfs,
>   LocalResourceType.FILE);
>   localResource.put(String.format(YarnServiceConstants.KEYTAB_LOCATION,
>   service.getName()), keytabRes);
>   LOG.info("Adding " + service.getName() + "'s keytab for "
>   + "localization, uri = " + keytabOnhdfs);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138573#comment-17138573
 ] 

Eric Yang edited comment on YARN-10310 at 6/17/20, 3:56 PM:


[~BilwaST] The root cause is the parameter -appTypes unit-test.

Using hdfs/had...@example.com principal, the error message is same as using 
h...@example.com.

{code}
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch abc sleeper 
2020-06-17 08:17:17,867 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:18,320 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:18,323 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-17 08:17:21,104 INFO client.ApiServiceClient: Application ID: 
application_1592406514799_0003
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch abc sleeper 
2020-06-17 08:17:32,401 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:32,971 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:32,974 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-17 08:17:35,320 ERROR client.ApiServiceClient: Service name abc is 
already taken.
{code}

verifyNoLiveAppInRM only look for appTypes == YarnServiceConstants.APP_TYPE.
The correct fix might be adding appTypes == unit-test to the 
GetApplicationRequest to obtain the correct type of applications.  

HDFS error message "Dir existing on hdfs." is to safe guard that a instance of 
the yarn-service application in suspended mode (where there is no copy running 
in RM), and its working directory exists.  The error message is not wrong for 
the suspended use case, and I agree that there might be better way to support 
--appTypes flag for YARN service API to yield consistent output.  Could you 
refine the patch accordingly?  Thanks


was (Author: eyang):
[~BilwaST] The root cause is the parameter -appTypes unit-test.

Using hdfs/had...@example.com principal, the error message is same as using 
h...@example.com.

{code}
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch abc sleeper 
2020-06-17 08:17:17,867 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:18,320 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:18,323 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-17 08:17:21,104 INFO client.ApiServiceClient: Application ID: 
application_1592406514799_0003
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch abc sleeper 
2020-06-17 08:17:32,401 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:32,971 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:32,974 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-17 08:17:35,320 ERROR client.ApiServiceClient: Service name abc is 
already taken.
{code}

verifyNoLiveAppInRM only look for appTypes == YarnServiceConstants.APP_TYPE.
The correct fix might be adding appTypes == unit-test to the 
GetApplicationRequest to obtain the correct type of applications.  

HDFS error message "Dir existing on hdfs." is to safe guard that a instance of 
the yarn-service application in suspended mode (where there is no copy running 
in RM), and it's working directory.  The error message is not wrong for the 
suspended use case, and I agree that there might be better way to support 
--appTypes flag for YARN service API to yield consistent output.  Could you 
refine the patch according it?  Thanks

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138597#comment-17138597
 ] 

Eric Yang commented on YARN-10311:
--

[~prabhujoseph] [~kyungwan nam], thank you for your input for clarifying the 
use case.  I found it difficult to manage multiple delegation tokens from 
multiple namenodes and use appropriate tokens to the corresponding namenode.  
However, that is a longer conversation to have in hadoop-common for hadoop 
security.  While I think this is a good addition to address the immediate 
problem, I do not have ability to spawn off multiple hdfs clusters at this 
time.  The patch looks good on the surface, and test case would really help to 
prevent regression.  I would appreciate if you can step in to review this patch 
and test on real clusters.  Thanks

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch, YARN-10311.002.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-17 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138573#comment-17138573
 ] 

Eric Yang commented on YARN-10310:
--

[~BilwaST] The root cause is the parameter -appTypes unit-test.

Using hdfs/had...@example.com principal, the error message is same as using 
h...@example.com.

{code}
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch abc sleeper 
2020-06-17 08:17:17,867 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:18,320 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:18,323 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-17 08:17:21,104 INFO client.ApiServiceClient: Application ID: 
application_1592406514799_0003
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch abc sleeper 
2020-06-17 08:17:32,401 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:32,971 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-17 08:17:32,974 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-17 08:17:35,320 ERROR client.ApiServiceClient: Service name abc is 
already taken.
{code}

verifyNoLiveAppInRM only look for appTypes == YarnServiceConstants.APP_TYPE.
The correct fix might be adding appTypes == unit-test to the 
GetApplicationRequest to obtain the correct type of applications.  

HDFS error message "Dir existing on hdfs." is to safe guard that a instance of 
the yarn-service application in suspended mode (where there is no copy running 
in RM), and it's working directory.  The error message is not wrong for the 
suspended use case, and I agree that there might be better way to support 
--appTypes flag for YARN service API to yield consistent output.  Could you 
refine the patch according it?  Thanks

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-16 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138031#comment-17138031
 ] 

Eric Yang commented on YARN-10311:
--

Mapreduce.job.hdfs-servers is used for distcp job to obtain delegation token 
for copy data across HDFS clusters.  YARN service works with a single HDFS 
cluster, and application inside the container can initialize their own 
credentials login in Mapreduce client to obtain DT to another HDFS cluster.  
There is no apparent reason to support access to another HDFS cluster to 
request delegation token for YARN service.  Sorry, the reason for this patch is 
unclear to me.  Can you explain the use case for this code?

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch, YARN-10311.002.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-16 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137833#comment-17137833
 ] 

Eric Yang commented on YARN-10310:
--

[~BilwaST] Thanks for explaining this.  If application type is unit-test, and 
user purposely delete the json file of previous instance of the yarn-service.  
This would allow second instance of the service to run.  YARN allows multiple 
application submission of the same name, if the application type is unit-test 
or mapreduce.  verifyNoLiveAppInRM only safe guards application type of 
yarn-service.  By using appTypes unit-test, you are triggering unintended 
approach to launch yarn-service.  This is not a bug in YARN service, but how 
user rigged the system to attempt to trigger unintended code execution path.  
By shortening the username, it will not prevent verifyNoLiveAppInRM to throw 
exception for unit-test application type neither.  This is working as designed 
for yarn-service, and allows services and applications co-exist in the same 
system with different working mode.  My recommendation is to submit the app 
without appTypes to avoid slipping pass verifyNoLiveAppInRM.

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-16 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137779#comment-17137779
 ] 

Eric Yang commented on YARN-10311:
--

[~BilwaST], thank you for patch 002.  I am not sure if this change is good.

1.  Removing final from org.apache.hadoop.security.token.Token is dangerous, 
and can cause third party code to inject malicious credential after it's 
creation. 
2.  Delegation token should work across namenodes.  There is no reason to 
obtain separated DT individually.  The token is always renewed with active 
namenode.  Get delegation token request is redirected from standby namenode to 
active namenode.  Otherwise, this solution would require a lot more inner 
tracking mechanism to know which token must be renewed with which name service. 
 The complexity would quickly grow out of hand.
3. There is no precedence of doing manual token renewals with each name service 
in Hadoop code.

Can you explain in more details why is this necessary?  Thanks

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch, YARN-10311.002.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-16 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136795#comment-17136795
 ] 

Eric Yang commented on YARN-10310:
--

Trunk code without patch 001 produces:

Launching application using hdfs/had...@example.com principal:
{code}
$ kinit hdfs/had...@example.com
Password for hdfs/had...@example.com: 
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch rr sleeper
2020-06-16 09:08:28,553 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-16 09:08:29,325 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-16 09:08:29,329 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-16 09:08:45,835 INFO client.ApiServiceClient: Application ID: 
application_1592323643465_0001
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/hdfs dfs -ls .yarn/services/rr
Found 3 items
drwxr-x---   - hdfs supergroup  0 2020-06-16 09:08 
.yarn/services/rr/conf
drwxr-xr-x   - hdfs supergroup  0 2020-06-16 09:08 .yarn/services/rr/lib
-rw-rw-rw-   1 hdfs supergroup831 2020-06-16 09:08 
.yarn/services/rr/rr.json
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/hdfs dfs -rmr .yarn/services/rr
rmr: DEPRECATED: Please use '-rm -r' instead.
Deleted .yarn/services/rr
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch rr sleeper
2020-06-16 09:10:18,754 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-16 09:10:19,206 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-16 09:10:19,209 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-16 09:10:21,421 ERROR client.ApiServiceClient: Service name rr is 
already taken.
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/hdfs dfs -ls .yarn/services
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ klist
Ticket cache: FILE:/tmp/krb5cc_123
Default principal: hdfs/had...@example.com

Valid starting   Expires  Service principal
06/16/2020 09:08:15  06/17/2020 09:08:15  krbtgt/example@example.com
{code}

Launching application using hdfs principal while service file is already 
deleted from hdfs:

{code}
$ kinit
Password for h...@example.com: 
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/yarn app -launch rr sleeper
2020-06-16 09:20:05,737 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-16 09:20:06,405 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at kerberos.example.com/192.168.1.9:8032
2020-06-16 09:20:06,409 INFO client.ApiServiceClient: Loading service 
definition from local FS: 
/usr/local/hadoop-3.4.0-SNAPSHOT/share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json
2020-06-16 09:20:10,082 ERROR client.ApiServiceClient: Service name rr is 
already taken.
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ ./bin/hdfs dfs -ls .yarn/services
[hdfs@kerberos hadoop-3.4.0-SNAPSHOT]$ 
{code}

If the application is running, verifyNoLiveAppInRM does throw exception.  I can 
not reproduce the claimed issue.  I suspect that verifyNoLiveAppInRM did not 
throw exception due to cluster configuration issues.  

We should not use getShortUserName() api on the client side.  The client must 
pass the full principal name to server, and only server resolves the short name 
when necessary.

Please check in core-site.xml, the following properties have been configured:

{code}
  
hadoop.http.authentication.type
kerberos
  

  
hadoop.http.filter.initializers
org.apache.hadoop.security.AuthenticationFilterInitializer
  
{code}

If they are not configured correctly, you may be accessing ServiceClient 
insecurely which result in the errors that you were seeing.

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
>

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-16 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136349#comment-17136349
 ] 

Eric Yang commented on YARN-10310:
--

[~BilwaST] Without the patch, I was unable to resubmit app with the same name 
twice.  If you deleted service json from hdfs, you are allowed to submit the 
app again.  I think this is working as designed.  The check is happening based 
on data on hdfs rather than what is in resource manager memory.  This is safer 
to prevent data loss in case resource manager crashes.  In my system 
hdfs/had...@example.com mapped to hdfs user principal.  It appears that your 
side didn't.  I am not sure how the difference happens, and may need more 
information.

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-15 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136186#comment-17136186
 ] 

Eric Yang commented on YARN-10310:
--

[~BilwaST] I was unable to reproduce the error reported here using 
hdfs/had...@example.com principal.  Failure to create service, maybe cause by 
existing instance of sleeper service.  The service finished running, but it was 
not destroyed to remove the state file from hdfs.  I could not reproduce the 
described problem, nor patch 001 looks like a solution that would address the 
described problem.  Please clarify.  Thanks

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-15 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136176#comment-17136176
 ] 

Eric Yang commented on YARN-10311:
--

[~BilwaST] This may introduce additional challenges for system admin to 
configure yarn.service.hdfs-servers configuration properly.  Would it be 
possible to perform the lookup base on hdfs-site.xml values without additional 
config in yarn service?

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-12 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134470#comment-17134470
 ] 

Eric Yang commented on YARN-10311:
--

Delegation token must be issued by active name node only.  What is the use case 
for this?

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-12 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134388#comment-17134388
 ] 

Eric Yang commented on YARN-10310:
--

[~BilwaST] I am in the process of setting up a new development environment to 
test this patch.  Give me a few days to complete my validations.  Thanks

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-10 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130356#comment-17130356
 ] 

Eric Yang commented on YARN-10310:
--

It sounds like you need to check auth_to_local rule to make sure that the 
username is mapped correctly.

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-10 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130321#comment-17130321
 ] 

Eric Yang commented on YARN-10310:
--

[~BilwaST] Sorry, I am confused by your comment.  Are you saying after applying 
this patch, hdfs/had...@hadoop.com principal generates "already exists" error 
message?

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10308) Update javadoc and variable names for keytab in yarn services as it supports filesystems other than hdfs and local file system

2020-06-09 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129645#comment-17129645
 ] 

Eric Yang commented on YARN-10308:
--

{quote}It supports all filesystems like "hdfs", "file", "viewfs", "s3" 
etc.{quote}

"All filesystems" maybe claiming more than implementations that support this.  
Have you tested with viewfs, httpfs, and s3?  How about changing it to Hadoop 
supported filesystem types.  e.g. hdfs?  This will bring expectation closer to 
reality.



> Update javadoc and variable names for keytab in yarn services as it supports 
> filesystems other than hdfs and local file system
> --
>
> Key: YARN-10308
> URL: https://issues.apache.org/jira/browse/YARN-10308
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10308.001.patch
>
>
> 1.  Below description should be updated
> {code:java}
> @ApiModelProperty(value = "The URI of the kerberos keytab. It supports two " +
>   "schemes \"hdfs\" and \"file\". If the URI starts with \"hdfs://\" " +
>   "scheme, it indicates the path on hdfs where the keytab is stored. The 
> " +
>   "keytab will be localized by YARN and made available to AM in its 
> local" +
>   " directory. If the URI starts with \"file://\" scheme, it indicates a 
> " +
>   "path on the local host where the keytab is presumbaly installed by " +
>   "admins upfront. ")
>   public String getKeytab() {
> return keytab;
>   }
> {code}
> 2. Variables below are still named on hdfs which is confusing
> {code:java}
> if ("file".equals(keytabURI.getScheme())) {
>   LOG.info("Using a keytab from localhost: " + keytabURI);
> } else {
>   Path keytabOnhdfs = new Path(keytabURI);
>   if (!fileSystem.getFileSystem().exists(keytabOnhdfs)) {
> LOG.warn(service.getName() + "'s keytab (principalName = "
> + principalName + ") doesn't exist at: " + keytabOnhdfs);
> return;
>   }
>   LocalResource keytabRes = fileSystem.createAmResource(keytabOnhdfs,
>   LocalResourceType.FILE);
>   localResource.put(String.format(YarnServiceConstants.KEYTAB_LOCATION,
>   service.getName()), keytabRes);
>   LOG.info("Adding " + service.getName() + "'s keytab for "
>   + "localization, uri = " + keytabOnhdfs);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10310) YARN Service - User is able to launch a service with same name

2020-06-08 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128632#comment-17128632
 ] 

Eric Yang commented on YARN-10310:
--

Patch 001 looks good, pending Jenkins validation.

> YARN Service - User is able to launch a service with same name
> --
>
> Key: YARN-10310
> URL: https://issues.apache.org/jira/browse/YARN-10310
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10310.001.patch
>
>
> As ServiceClient uses UserGroupInformation.getCurrentUser().getUserName() to 
> get user whereas ClientRMService#submitApplication uses 
> UserGroupInformation.getCurrentUser().getShortUserName() to set application 
> username.
> In case of user with name hdfs/had...@hadoop.com. below condition fails
> ClientRMService#getApplications()
> {code:java}
> if (users != null && !users.isEmpty() &&
>   !users.contains(application.getUser())) {
> continue;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10291) Yarn service commands doesn't work when https is enabled in RM

2020-05-26 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116823#comment-17116823
 ] 

Eric Yang commented on YARN-10291:
--

[~BilwaST] Have you tried to install ca certificate into Java cacerts trust 
store or use -Djavax.net.ssl.trustStore= to define trust store path?  
Additional code to setup trust store shouldn't be necessary.  Most of the TLS 
verification can fallback to JVM default implementation without override.  The 
odd ends of Hadoop ssl is having odd implementation of SSL support, which does 
not have reliable accepted issuer validation.  This is one of the reason that 
Jersey client was used to make Hadoop TLS support more like Java instead of 
continuing on the forked path of ignoring certificate signer validation.

Let me know if Java carets option works.  It is good to have validation in this 
area.  Thanks

> Yarn service commands doesn't work when https is enabled in RM
> --
>
> Key: YARN-10291
> URL: https://issues.apache.org/jira/browse/YARN-10291
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10291.001.patch
>
>
> when we submit application using command "yarn app -launch sleeper-service 
> ../share/hadoop/yarn/yarn-service-examples/sleeper/sleeper.json" , it throws 
> below exception 
> {code:java}
> com.sun.jersey.api.client.ClientHandlerException: 
> javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> {code}
> We should use WebServiceClient#createClient as it takes care of setting 
> sslfactory when https is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10228) Yarn Service fails if am java opts contains ZK authentication file path

2020-05-20 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10228:
-
Fix Version/s: 3.4.0
 Target Version/s: 3.4.0
Affects Version/s: 3.3.0

> Yarn Service fails if am java opts contains ZK authentication file path
> ---
>
> Key: YARN-10228
> URL: https://issues.apache.org/jira/browse/YARN-10228
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10228.001.patch
>
>
> If i configure 
> {code:java}
> yarn.service.am.java.opts=-Xmx768m 
> -Djava.security.auth.login.config=/opt/hadoop/etc/jaas-zk.conf
> {code}
> Invalid character error is getting printed .
> This is due to jvm opts validation added in YARN-9718



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10228) Yarn Service fails if am java opts contains ZK authentication file path

2020-05-19 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111369#comment-17111369
 ] 

Eric Yang commented on YARN-10228:
--

[~BilwaST] Thank you for the patch.  +1 LGTM, pending Jenkins reports.

> Yarn Service fails if am java opts contains ZK authentication file path
> ---
>
> Key: YARN-10228
> URL: https://issues.apache.org/jira/browse/YARN-10228
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10228.001.patch
>
>
> If i configure 
> {code:java}
> yarn.service.am.java.opts=-Xmx768m 
> -Djava.security.auth.login.config=/opt/hadoop/etc/jaas-zk.conf
> {code}
> Invalid character error is getting printed .
> This is due to jvm opts validation added in YARN-9718



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10228) Yarn Service fails if am java opts contains ZK authentication file path

2020-05-19 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111307#comment-17111307
 ] 

Eric Yang commented on YARN-10228:
--

[~BilwaST] I think excessive config management for validating one character is 
not good usability design and prone to more mistakes. "/" character is mostly 
safe, unless there is incorrect file permission on the file system.  I am more 
comfortable to allow "/" character after more thinking.

> Yarn Service fails if am java opts contains ZK authentication file path
> ---
>
> Key: YARN-10228
> URL: https://issues.apache.org/jira/browse/YARN-10228
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
>
> If i configure 
> {code:java}
> yarn.service.am.java.opts=-Xmx768m 
> -Djava.security.auth.login.config=/opt/hadoop/etc/jaas-zk.conf
> {code}
> Invalid character error is getting printed .
> This is due to jvm opts validation added in YARN-9718



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10228) Yarn Service fails if am java opts contains ZK authentication file path

2020-05-18 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110407#comment-17110407
 ] 

Eric Yang commented on YARN-10228:
--

Thank you [~BilwaST] for the bug report.  I think there are two methods to 
resolve this issue:

1.  Use YARN service json with:

{code}
"configuration" : {
"properties" : {
"java.security.auth.login.config" : "/opt/hadoop/etc/jaas-zk.conf"
}
}
{code}

2.  Allow "/" character to pass through in launch command.

{code}
-Pattern pattern = Pattern.compile("[!~#?@*&%${}()<>\\[\\]|\"\\/,`;]");
+Pattern pattern = Pattern.compile("[!~#?@*&%${}()<>\\[\\]|\",`;]");
{code}

The second approach will allow the specified yarn.service.am.java.opts to 
receive "/" character.  I am not sure 100% sure, if this could have any 
undesired loophole.  Some confirmation from your side would be great.  Thanks

> Yarn Service fails if am java opts contains ZK authentication file path
> ---
>
> Key: YARN-10228
> URL: https://issues.apache.org/jira/browse/YARN-10228
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
>
> If i configure 
> {code:java}
> yarn.service.am.java.opts=-Xmx768m 
> -Djava.security.auth.login.config=/opt/hadoop/etc/jaas-zk.conf
> {code}
> Invalid character error is getting printed .
> This is due to jvm opts validation added in YARN-9718



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-05-15 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108617#comment-17108617
 ] 

Eric Yang commented on YARN-9809:
-

[~Jim_Brennan] This feature is a great addition to make admin task easier for 
large scale cluster.  What is the latency that we are talking about in 
health-check script?  If it is a few seconds and less, I agree that there is 
marginal difference  in startup time, and potential benefit is great.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.

2020-04-23 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091058#comment-17091058
 ] 

Eric Yang commented on YARN-8417:
-

Dockerfile can predefine a list of environment variables.  User can override 
the default environment variables by supplying {{-e}} or {{-env-file}} as 
parameters for override.  Although there is a grey area where yarn-default.xml 
implicitly allows 
{code}JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,HADOOP_YARN_HOME{code}
 variables to pass through.  I can see arguments for both sides that favor or 
against pass through.  This seems to be a site specific decision, and default 
to pass through is less work to run existing YARN workload on docker by either 
mounting JAVA, Hadoop binaries to the same path as host, or the docker image 
contains mirror image of the host binaries.

YARN service JSON allows user to explicitly override environment variables, if 
the site default doesn't match user use case.  This arrangement is somewhat 
similar to Docker image has defaults and user can overrides.  If the users like 
to have everything coming from docker defaults, user can simply ask system 
admin to remove environment white list.  This will enable system to work with 
Docker image defaults.

It is also easier to catch if a user is using override JAVA environment 
variables, if YARN service JSON contains the override value.  Docker image may 
contain third party Java, which is much harder to detect.  This is mostly a 
security question, where site allows a third party Java with custom cacerts 
truststore, or it shouldn't.  There is no easy answer, and most likely a site 
driven question.  My vote is to mark this as won't fix.

> Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker 
> container.
> 
>
> Key: YARN-8417
> URL: https://issues.apache.org/jira/browse/YARN-8417
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before 
> launching Docker container no matter if ENTRY_POINT is used or not. This will 
> overwrite environments defined inside Dockerfile (by using \{{ENV}}). For 
> Docker container, it actually doesn't make sense to pass JAVA_HOME, 
> HDFS_HOME, etc. because inside docker image we have a separate Java/Hadoop 
> installed or mounted to exactly same directory of host machine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10219) YARN service placement constraints is broken

2020-04-14 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083402#comment-17083402
 ] 

Eric Yang commented on YARN-10219:
--

Thank you [~prabhujoseph].

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch, YARN-10219.004.patch, YARN-10219.005.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-13 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Attachment: YARN-10219.005.patch

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch, YARN-10219.004.patch, YARN-10219.005.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10219) YARN service placement constraints is broken

2020-04-13 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082437#comment-17082437
 ] 

Eric Yang commented on YARN-10219:
--

[~prabhujoseph] Patch 4 updated the indentation and reduce number of vcore that 
can be used for the test to pass on more powerful nodes.

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch, YARN-10219.004.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-13 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Attachment: YARN-10219.004.patch

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch, YARN-10219.004.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-13 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Target Version/s: 3.3.0  (was: 3.4.0)
Priority: Blocker  (was: Major)

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10219) YARN service placement constraints is broken

2020-04-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076724#comment-17076724
 ] 

Eric Yang commented on YARN-10219:
--

I am unsure the reason that unit test failed for patch 002.  It only fixed 
checkstyle issue in patch 1.  The exact anti affinity test pass with success in 
my cluster environment, and I am unable to get it to fail locally.  I suspect 
it maybe dynamic detection of number of vcore to use per node manager in 
Jenkins environment which is different from my laptop.  My laptop is saturated 
at 4 cpu cores, which may prevent additional containers to start and allowed 
test case to pass.  I resubmit patch 2 as patch 3 for retest.  If this fails 
again, I will add vcore restriction to this test case to prevent fail on more 
powerful hardware.

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-06 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Attachment: YARN-10219.003.patch

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch, 
> YARN-10219.003.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-03 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Attachment: YARN-10219.002.patch

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch, YARN-10219.002.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-03 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Affects Version/s: 3.1.1
   3.1.2
   3.2.1
   3.1.3

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-03 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Affects Version/s: 3.3.0
   3.1.0
   3.2.0

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10219) YARN service placement constraints is broken

2020-04-03 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10219:
-
Attachment: YARN-10219.001.patch

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10219) YARN service placement constraints is broken

2020-04-03 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-10219:


Assignee: Eric Yang

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10219) YARN service placement constraints is broken

2020-04-02 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074159#comment-17074159
 ] 

Eric Yang commented on YARN-10219:
--

Placement constraint can define AFFINITY and ANTI_AFFINITY based on node label, 
node attributes, or allocation tag.  The syntax used matching node label, and 
node attributes are different syntax than matching allocation tags.  

Node attributes, and node label, uses syntax that is documented:

{code}
PlacementConstraints
.targetNodeAttribute(PlacementConstraints.NODE,
NodeAttributeOpCode.EQ,
PlacementConstraints.PlacementTargets
.nodeAttribute("java", "1.8")))
{code}

Allocation tag uses syntax:

{code}
PlacementConstraints
.targetIn(yarnServiceConstraint.getScope().getValue(),
targetExpressions.toArray(new TargetExpression[0]))
.build();
{code}

The correct expression of the placement constraint supposed to have separated 
constraint policies for each use cases, for example:

{code:java}
      "placement_policy": {
        "constraints": [
          {
            "type": "AFFINITY",
            "scope": "NODE",
            "node_partitions": [
              "label2"
            ]
          },
          {
            "type": "ANTI_AFFINITY",
            "scope": "NODE",
            "target_tags": [
              "pong"
            ]
          }
        ]
      }, {code}

"pong" containers are spread out on nodes with label2 partitions.  The existing 
documented syntax is self conflicting to cause problem in code logic.

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Moved] (YARN-10219) YARN service placement constraints is broken

2020-04-01 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang moved HIVE-23125 to YARN-10219:
-

   Key: YARN-10219  (was: HIVE-23125)
Issue Type: Bug  (was: Task)
   Project: Hadoop YARN  (was: Hive)

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10152) Fix findbugs warnings in hadoop-yarn-applications-mawo-core module

2020-02-20 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041563#comment-17041563
 ] 

Eric Yang commented on YARN-10152:
--

+1

> Fix findbugs warnings in hadoop-yarn-applications-mawo-core module
> --
>
> Key: YARN-10152
> URL: https://issues.apache.org/jira/browse/YARN-10152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>
> {noformat}
>     FindBugs :
>        
> module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-mawo/hadoop-yarn-applications-mawo-core
>        Class org.apache.hadoop.applications.mawo.server.common.TaskStatus 
> implements Cloneable but does not define or use clone method At 
> TaskStatus.java:does not define or use clone method At TaskStatus.java:[lines 
> 39-346]
>        Equals method for 
> org.apache.hadoop.applications.mawo.server.worker.WorkerId assumes the 
> argument is of type WorkerId At WorkerId.java:the argument is of type 
> WorkerId At WorkerId.java:[line 114]
>        
> org.apache.hadoop.applications.mawo.server.worker.WorkerId.equals(Object) 
> does not check for null argument At WorkerId.java:null argument At 
> WorkerId.java:[lines 114-115] {noformat}
> Detail: 
> [https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/1414/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-applications-mawo_hadoop-yarn-applications-mawo-core-warnings.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10113) SystemServiceManagerImpl fails to initialize

2020-02-11 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034548#comment-17034548
 ] 

Eric Yang commented on YARN-10113:
--

[~kyungwan nam] Thank you for the patch.  Patch looks good.  Can we have a test 
case to covert this?

> SystemServiceManagerImpl fails to initialize 
> -
>
> Key: YARN-10113
> URL: https://issues.apache.org/jira/browse/YARN-10113
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10113-001.patch, YARN-10113-002.patch
>
>
> RM fails to start with SystemServiceManagerImpl failed to initialize.
> {code}
> 2020-01-28 12:20:43,631 WARN  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:636)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:881)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1257)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1298)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1294)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1294)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
> ... 5 more
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1645)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1219)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1235)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1202)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1181)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1177)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1189)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 16 more
>

[jira] [Commented] (YARN-10113) SystemServiceManagerImpl fails to initialize

2020-02-10 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034042#comment-17034042
 ] 

Eric Yang commented on YARN-10113:
--

[~prabhujoseph] The patch seems to be creating another configuration object 
instead of using the one passed in from serviceInit.  Could this be problematic 
that other place have similar override and parameter passing?  It might be good 
to use clone of the conf object instead of doing new Configuration() for 
performance reason.  Thought?

> SystemServiceManagerImpl fails to initialize 
> -
>
> Key: YARN-10113
> URL: https://issues.apache.org/jira/browse/YARN-10113
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10113-001.patch, YARN-10113-002.patch
>
>
> RM fails to start with SystemServiceManagerImpl failed to initialize.
> {code}
> 2020-01-28 12:20:43,631 WARN  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:636)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:881)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1257)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1298)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1294)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1294)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
> ... 5 more
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1645)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1219)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1235)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1202)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1181)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1177)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1189)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)

[jira] [Resolved] (YARN-8472) YARN Container Phase 2

2020-01-22 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-8472.
-
Fix Version/s: 3.3.0
 Release Note: 
- Improved debugging Docker container on YARN
- Improved security for running Docker containers
- Improved cgroup management for docker container.
   Resolution: Fixed

> YARN Container Phase 2
> --
>
> Key: YARN-8472
> URL: https://issues.apache.org/jira/browse/YARN-8472
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.3.0
>
>
> In YARN-3611, we have implemented basic Docker container support for YARN.  
> This story is the next phase to improve container usability.
> Several area for improvements are:
>  # Software defined network support
>  # Interactive shell to container
>  # User management sss/nscd integration
>  # Runc/containerd support
>  # Metrics/Logs integration with Timeline service v2 
>  # Docker container profiles
>  # Docker cgroup management



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-22 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021545#comment-17021545
 ] 

Eric Yang commented on YARN-9292:
-

>From today's YARN Docker community meeting, we have decided to abandon this 
>patch.  There is possibilities that AM can fail over a node which has 
>different latest tag than previous node.  The frame of reference to latest tag 
>is relative to the node where AM is running.  If there are inconsistency in 
>the cluster, this patch will not solve the consistency problem.  Newly spawned 
>AM will use a different sha id that maps to latest tag, which leads to 
>inconsistent sha id used by the same application.

The ideal design is to have YARN client to discover the latest tag is 
referencing, then populate that information to rest of the job.  Unfortunately, 
there is no connection between YARN and where docker registry might be running. 
 Hence, it is not possible to implement this proper for YARN and Docker 
integration.  The community settle on document this wrinkle and try to avoid 
using latest tag as best practice.

For Runc container, it will be possible to use HDFS as source of truth to look 
up the global hash designation for runc container.  YARN client can query HDFS 
for the latest tag and it will be consistent on all nodes.  This will add some 
extra protocol interactions between YARN client and RM to solve this problem by 
the ideal design.

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch, YARN-9292.008.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-16 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017364#comment-17017364
 ] 

Eric Yang commented on YARN-9292:
-

[~ebadger] {quote}I do have some questions on why we can't move the AM into a 
docker container though. What is it that is special about the AM that we need 
to run it directly on the host? What does it depend on the host for? We should 
be able to use the distributed cache to localize any libraries/jars that it 
needs. And as far as nscd/sssd, those can be bind-mounted into the container 
via configs. If they don't have nscd/sssd then they can bind-mount /etc/passwd. 
Since they would've been using the host anyway, this is no different.{quote}

YARN native service was a code merge from Apache Slider, and it was developed 
to run in YARN container directory like mapreduce tasks.  If the AM docker 
image is a mirror image of the host system, AM can run in a docker container.  
AM code still depends on all Hadoop client libraries, Hadoop configuration and 
Hadoop environment variables.

{quote}As far as the docker image itself, why does Hadoop need to provide an 
image? Everything needed can be provided via the distributed cache or 
bind-mounts, right? I don't see why we need a specialized image that is tied to 
Hadoop. You just need an image with Java and Bash.{quote}

>From 10,000 feet point of view, yes, AM only requires Java and Bash.  If 
>Hadoop provides the image, our users can deploy the image without worry about 
>how to create a docker image that mirrors the host structure.  Without Hadoop 
>supplying image and agreed upon image format.  It is up to the system admin's 
>interpretation of where Hadoop client configuration and client binaries are 
>located.  He/she can run the job with ENTRY point mode disabled and bind mount 
>Hadoop configuration and binaries.  As I recall, this is the less secure 
>approach to run the container because container requires to bind mount 
>writable Hadoop log directory to the container for launcher script to write 
>output.  This is a hassle and no container benefit. This method still exposes 
>host level environment and binaries to container.  There are 5 people on 
>planet Earth that knows how to wire this together, but unlikely to suggest 
>this approach.

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch, YARN-9292.008.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-13 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014687#comment-17014687
 ] 

Eric Yang commented on YARN-9292:
-

[~ebadger] {quote}the image wouldn't have been pulled to that node before the 
task is run, right? That's my concern here.{quote}

The concern for inconsistent docker image spread on the cluster is a valid one. 
 There are two possibilities.  Docker image exist on AM node, or it doesn't.  
# In the case where image exist on AM node, launching docker image using sha 
from AM node will result in a warning or failure.  This depends on application 
anti-affinity policy.  The error message of failed to launch docker container 
using sha signature should provide some clues to administrators to fix the 
docker images on other nodes.  
# If it is requesting an image that doesn't exist on the AM node, it will 
proceed with latest tag.  It will have consistent images used if YARN-9184 is 
enabled.  If YARN-9184 is turned off, it will follow the same pattern as 1.

{quote}The command you ran doesn't even work for my version of Docker.{quote}

I think my mouse cursor jumped when I copy and paste the information.  I 
couldn't find where it changed the output.  Your syntax is the correct one to 
use.  Sorry for the confusion.

{quote}Reading around on the internet, it looks like Docker takes the manifest 
sha and then recalculates the digest with some other stuff added on (maybe the 
tag data?) to get a new digest. I'm worried that this could break if we 
randomly choose the last sha. For example, maybe centos:7 is installed 
everywhere, but centos:latest is only installed on this one node by accident. 
If we grab the centos:latest sha, it won't work on the rest of the nodes in the 
cluster because the sha won't match the tag of the image on those nodes, even 
though they have the same manifest hash. Or maybe it only does the check based 
on the manifest hash. I can't seem to reproduce this with my version of Docker, 
so I can't test out what actually happens.{quote}

When the list become multiple, they are pointed to the same image, just the 
repository id is different.  At this time, using any of the repo digest id have 
the same out come.  This was tested carefully before I go ahead with the 
implementation.

This patch will impact the most when system admin does not use docker registry 
to manage docker images, and have inconsistent docker latest images sitting on 
nodes.  They may get some extra nudge on launching application with 
inconsistent images with anti-affinity policy defined.  Majority of users are 
not affected by this change.  If AM picks an older image than latest on docker 
registry,  the application docker images remain uniform.  There is a 
possibility to have more of the same containers end up on the same node.  
However, this should be fine when user does not specify placement policy rules.

I think this problem has been dissected to as small piece as possible, I 
haven't came up with more elegant solution to keep docker image consistent with 
latest tag and support both docker registry and without.  Let me know if there 
is new ideas coming to mind.

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch, YARN-9292.008.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-10 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013212#comment-17013212
 ] 

Eric Yang commented on YARN-9292:
-

Patch 008 fixes space issues suggested by [~ebadger].

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch, YARN-9292.008.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-10 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9292:

Attachment: YARN-9292.008.patch

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch, YARN-9292.008.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-10 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013200#comment-17013200
 ] 

Eric Yang commented on YARN-9292:
-

{quote}I see. Why isn't the AM run inside of a Docker container?{quote}

Very good question, and the answer is somewhat complicated.  For AM to run in 
the docker container, AM must have identical Hadoop client bits (Java, Hadoop, 
etc), and credential mapping (nscd/sssd).  Many of those pieces can not be 
moved cleanly into Docker container in the first implementation of YARN native 
service (LLAP/Slider alike projects) because resistance of building agreeable 
docker image as part of Hadoop project.  AM remains as outside of docker 
container for simplicity.

{quote}The node might have an old Docker image on it. It would be nice to get 
the image information from the registry and only fall back to the local node's 
version if the registry lookup fails. An indirect way to do this would be to do 
a docker pull} before calling {{docker images.{quote}

The same can be argued for people who do not want to automatic pulling of 
docker image to latest.  As the result, there is a flag implemented in 
YARN-9184.  The flag decides if it will be base on local latest or repository 
latest.  This change should work in combination with YARN-9184.

{quote}If we can hit the docker registry directly via its REST API then we 
won't need to invoke the container-executor at all and we can avoid this 
problem. This looks like it should be fairly trivial, but I don't know how much 
more difficult secure registries would be.{quote}

We don't contact docker registry directly nor we have code to conect secure 
docker registry.  I think it is too risky to contact the registry directly 
because the registry could be a private registry defined in user's docker 
config.json.  It would be going down a rabbit hole to follow this path.

{quote}Do you have documentation handy for docker image inspect that talks 
about the fuzzy matching?{quote}

Sorry, don't have any, but here is a example that you can try locally:

{code}$ docker images|grep centos
centos 7 9f38484d220f10 months 
ago   202MB
centos latest9f38484d220f10 months 
ago   202MB{code}

Suppose that you have used both centos:7 and centos:latest tags, and that are 
both pointed to the same image.  The result of the repository digest hash for 
both images produces different hash:

{code}$ docker inspect image centos -f "{{.RepoDigests}}"
[centos@sha256:a799dd8a2ded4a83484bbae769d97655392b3f86533ceb7dd96bbac929809f3c 
centos@sha256:b5e66c4651870a1ad435cd75922fe2cb943c9e973a9673822d1414824a1d0475]{code}

By using either hash is fine, they will result in the same image.  It is 
somewhat fuzzy because they are alias of one another.

{quote}A space before and after != and ==. If the purpose of omitting the 
spaces is to show operation bundling, then I would just add () around the two 
separate comparisons around the &&{quote}

I see, will update according.  Thanks

{quote}Thanks for clearing up the quoting issue. But I'm still getting what 
appears to be a less than ideal result. Is this expected behavior?{quote}

Maybe need to upgrade the docker version.  The output appears like this on my 
system:

{code}$ docker images --format="{{json .}}" --filter="dangling=false"
{"Containers":"N/A","CreatedAt":"2019-11-11 11:23:08 -0500 
EST","CreatedSince":"2 months 
ago","Digest":"\u003cnone\u003e","ID":"7317640d555e","Repository":"prom/prometheus","SharedSize":"N/A","Size":"130MB","Tag":"latest","UniqueSize":"N/A","VirtualSize":"130.2MB"}
{"Containers":"N/A","CreatedAt":"2019-07-15 16:14:12 -0400 
EDT","CreatedSince":"5 months 
ago","Digest":"\u003cnone\u003e","ID":"771e0613a264","Repository":"ozonesecure_kdc","SharedSize":"N/A","Size":"127MB","Tag":"latest","UniqueSize":"N/A","VirtualSize":"127.4MB"}
{"Containers":"N/A","CreatedAt":"2019-07-15 00:04:39 -0400 
EDT","CreatedSince":"5 months 
ago","Digest":"\u003cnone\u003e","ID":"48b0eebc96f0","Repository":"jaegertracing/all-in-one","SharedSize":"N/A","Size":"48.7MB","Tag":"latest","UniqueSize":"N/A","VirtualSize":"48.71MB"}
{"Containers":"N/A","CreatedAt":"2019-07-02 14:56:10 -0400 
EDT","CreatedSince":"6 months 
ago","Digest":"\u003cnone\u003e","ID":"f38d9c7e49be","Repository":"flokkr/hadoop","SharedSize":"N/A","Size":"503MB","Tag":"2.7.7","UniqueSize":"N/A","VirtualSize":"503.3MB"}
{"Containers":"N/A","CreatedAt":"2019-06-25 14:27:08 -0400 
EDT","CreatedSince":"6 months 
ago","Digest":"\u003cnone\u003e","ID":"c912f3f026ed","Repository":"grafana/grafana","SharedSize":"N/A","Size":"249MB","Tag":"latest","UniqueSize":"N/A","VirtualSize":"248.5MB"}
{"Containers":"N/A","CreatedAt":"2019-06-24 19:37:36 -0400 
EDT","CreatedSince":"6 months

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-10 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013075#comment-17013075
 ] 

Eric Yang commented on YARN-9292:
-

Patch 007 rebase to current trunk with DOCKER_IMAGE_REGEX fix suggested from 
[~ebadger].

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-10 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9292:

Attachment: YARN-9292.007.patch

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch, YARN-9292.007.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-09 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012300#comment-17012300
 ] 

Eric Yang commented on YARN-9292:
-

[~ebadger] Thanks for the review.  Here are my feedback:

{quote}Doesn't the container know what image it was started with in its 
environment?{quote}

ServiceScheduler runs as part of application master for YARN service.  YARN AM 
is not containerized.  Docker command to resolve image digest id happens prior 
to any docker container is launched.  The lookup for the docker image is from 
the node which AM is running.  We use the sha256 digest from AM node as 
authoritative signature to give the application equal chance of acquiring 
docker digest id on any node manager.

{quote}If we don't care about the container and just want to know what the sha 
of the image:tag is, then I agree with Chandni Singh that we don't need to use 
the containerId.{quote}

Container ID is used by container executor to properly permission the working 
directory, generate .cmd file for container-executor binary, and all output and 
exit code are stored to the container id directory.  Without container ID, we 
will need to craft a complete separate path to acquire privileges to launch 
docker commands, which is extra code duplication and not follow the security 
practice that was baked in place to prevent parameter hijacking.  I choose to 
follow the existing process to avoid code bloat.

{quote}But if there are many, couldn't that not be the correct one?{quote}

The output given from docker image [image-id] -f "{{.RepoDigests}}" may contain 
similar names, like local/centos and centos at the same time due to fuzzy 
matching.  For loop matches the exact name instead of prefix matching.  Hence, 
it is always the correct one that is matched.

{quote}I think we should import this instead of including the full path{quote}

Sorry, can't do.  There is another import that reference to 
org.apache.hadoop.yarn.service.component.Component, which prevent use of the 
same name.

{quote}Spacing issues on the operators.{quote}

Checkstyle did not find spacing issue with the existing patch, and the issue is 
not clear to me.  Care to elaborate?

{quote}The first part of both of these regexes is identical. I think we should 
create a subregex and append to it to avoid having to make changes in multiple 
places in the future. One if the image followed by a tag and the other is an 
image followed by a sha. Should be easy to do.{quote}

Sure, I will compact this to rebase this patch to trunk.

{quote}The else clause syntax doesn't seem to work for me. Did I do something 
wrong?{quote}

Yes, unlike C exec, when running docker command on cli, it needs to be quoted 
to prevent shell expansion:

{code}docker images --format="{{json .}}" --filter="dangling=flase"{code}.  For 
clarity, we are using:
{code}docker image [image-id] -f "{{.RepoDigests}}"{code} to find the real 
digest hash due to bugs with docker images output.

{quote}Another possible solution is to have the AM get the sha256 hash of the 
image that it is running in and then passing that sha to all of the containers 
that it starts. This would move the query into the Hadoop cluster itself.{quote}

I think the patch is implementing what you are suggesting that Hadoop query 
into the cluster itself via a node manager rest endpoint.

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be

[jira] [Commented] (YARN-9052) Replace all MockRM submit method definitions with a builder

2020-01-09 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012080#comment-17012080
 ] 

Eric Yang commented on YARN-9052:
-

Code clean up and performance optimization usually go hand in hand to ensure 
the net gain is positive.  Some basic level of comprehension presents during 
the execution of the code.  If code is hard to understand by human, it will run 
poorly by machine as well.  Although machines can juggle with a much larger set 
of variables, poorly understood code can result in bugs.  [~snemeth] had been 
doing code rewrite for Hadoop for many years.  There is a few hiccups, but I 
think there are positive net gains with his help, however so slightly that it 
may seem.  It will get people on the edge prior to release time because some 
code are baked well during the cycle of development.  It would be helpful to 
show some performance number of the net result to boost the confidence for rest 
of the community.  In this case I think Sunil's pain point for submitApp covers 
for this issue.  Other issues should be discussed separately.  It is near 3.3.0 
release, unless we have good solid data points for performance gain, I would 
suggest to slow down on the code rewrite for now.

> Replace all MockRM submit method definitions with a builder
> ---
>
> Key: YARN-9052
> URL: https://issues.apache.org/jira/browse/YARN-9052
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: 
> YARN-9052-004withlogs-patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt,
>  YARN-9052-testlogs003-justfailed.txt, 
> YARN-9052-testlogs003-patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt,
>  YARN-9052-testlogs004-justfailed.txt, YARN-9052.001.patch, 
> YARN-9052.002.patch, YARN-9052.003.patch, YARN-9052.004.patch, 
> YARN-9052.004.withlogs.patch, YARN-9052.005.patch, YARN-9052.006.patch, 
> YARN-9052.007.patch, YARN-9052.008.patch, YARN-9052.009.patch, 
> YARN-9052.009.patch, YARN-9052.testlogs.002.patch, 
> YARN-9052.testlogs.002.patch, YARN-9052.testlogs.003.patch, 
> YARN-9052.testlogs.patch
>
>
> MockRM has 31 definitions of submitApp, most of them having more than 
> acceptable number of parameters, ranging from 2 to even 22 parameters, which 
> makes the code completely unreadable.
> On top of unreadability, it's very hard to follow what RmApp will be produced 
> for tests as they often pass a lot of empty / null values as parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8283) [Umbrella] MaWo - A Master Worker framework on top of YARN Services

2020-01-08 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011068#comment-17011068
 ] 

Eric Yang commented on YARN-8283:
-

[~brahmareddy] This looks like a feature that will not be closed by 3.3.0 
release.  There are check style errors in the patches, which is the reason that 
I did not commit them.  Python 2.7 is deprecated on Jan 1, 2020.  This 
contribution will need some updates to keep it going.  Please skip this feature 
in the release notes.  Thanks

> [Umbrella] MaWo - A Master Worker framework on top of YARN Services
> ---
>
> Key: YARN-8283
> URL: https://issues.apache.org/jira/browse/YARN-8283
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: [Design Doc] [YARN-8283] MaWo - A Master Worker 
> framework on top of YARN Services.pdf
>
>
> There is a need for an application / framework to handle Master-Worker 
> scenarios. There are existing frameworks on YARN which can be used to run a 
> job in distributed manner such as Mapreduce, Tez, Spark etc. But 
> master-worker use-cases usually are force-fed into one of these existing 
> frameworks which have been designed primarily around data-parallelism instead 
> of generic Master Worker type of computations.
> In this JIRA, we’d like to contribute MaWo - a YARN Service based framework 
> that achieves this goal. The overall goal is to create an app that can take 
> an input job specification with tasks, their durations and have a Master dish 
> the tasks off to a predetermined set of workers. The components will be 
> responsible for making sure that the tasks and the overall job finish in 
> specific time durations.
> We have been using a version of the MaWo framework for running unit tests of 
> Hadoop in a parallel manner on an existing Hadoop YARN cluster. What 
> typically takes 10 hours to run all of Hadoop project’s unit-tests can finish 
> under 20 minutes on a MaWo app of about 50 containers!
> YARN-3307 was an original attempt at this but through a first-class YARN app. 
> In this JIRA, we instead use YARN Service for orchestration so that our code 
> can focus on the core Master Worker paradigm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9414) Application Catalog for YARN applications

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-9414.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

[~brahmareddy] I moved the enhancement to next release.  This feature can go GA 
without the enhancements.

> Application Catalog for YARN applications
> -
>
> Key: YARN-9414
> URL: https://issues.apache.org/jira/browse/YARN-9414
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN Appstore.pdf, YARN-Application-Catalog.pdf
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9523) Build application catalog docker image as part of hadoop dist build

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9523:

Parent Issue: YARN-10078  (was: YARN-9414)

> Build application catalog docker image as part of hadoop dist build
> ---
>
> Key: YARN-9523
> URL: https://issues.apache.org/jira/browse/YARN-9523
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9523.001.patch
>
>
> It would be nice to make Application catalog docker image as part of the 
> distribution.  The suggestion is to change from:
> {code:java}
> mvn clean package -Pnative,dist,docker{code}
> to
> {code:java}
> mvn clean package -Pnative,dist{code}
> User can still build tarball only using:
> {code:java}
> mvn clean package -DskipDocker -DskipTests -DskipShade -Pnative,dist{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8533) Multi-user support for application catalog

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8533:

Parent Issue: YARN-10078  (was: YARN-9414)

> Multi-user support for application catalog
> --
>
> Key: YARN-8533
> URL: https://issues.apache.org/jira/browse/YARN-8533
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Eric Yang
>Priority: Major
>
> The current application catalog will launch applications as the user who runs 
> the application catalog.  This allows personalized application catalog.  It 
> would be nice if the application catalog can launch application as the end 
> user who is viewing the application catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9499) Support application catalog high availability

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9499:

Parent Issue: YARN-10078  (was: YARN-9414)

> Support application catalog high availability
> -
>
> Key: YARN-9499
> URL: https://issues.apache.org/jira/browse/YARN-9499
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Priority: Major
>
> Application catalog is mostly a stateless web application.  It depends on 
> backend services to store states.  At this time, Solr is a single instance 
> server running in the same application catalog container.  It is possible to 
> externalize application catalog data to Solr Cloud to remove the single 
> instance Solr server.  This improves high availability of application catalog.
> This task is to focus on how to configure connection to external Solr cloud 
> for application catalog container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8532) Consolidate Yarn UI2 Service View with Application Catalog

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8532:

Parent Issue: YARN-10078  (was: YARN-9414)

> Consolidate Yarn UI2 Service View with Application Catalog
> --
>
> Key: YARN-8532
> URL: https://issues.apache.org/jira/browse/YARN-8532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services, yarn-ui-v2
>Reporter: Eric Yang
>Priority: Major
>
> There are some overlaps between YARN UI2, and Application Catalog.  The same 
> deployment feature exists in YARN UI2 and Application Catalog.  It would be 
> nice to present the application catalog as the first view to end user to 
> speed up deployment of application.  UI2 is a monitoring and resource 
> allocation and prioritization UI.  It might be more user friendly to transfer 
> UI2 deployment feature into Application Catalog to improve usability for both 
> end user who launches the apps, and system administrator who monitors the 
> apps usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8531) Link container logs from App detail page

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8531:

Parent Issue: YARN-10078  (was: YARN-9414)

> Link container logs from App detail page
> 
>
> Key: YARN-8531
> URL: https://issues.apache.org/jira/browse/YARN-8531
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-ui-v2
>Reporter: Eric Yang
>Priority: Major
>
> It would be nice to have visibility of contain log files for running 
> application to be viewable from application detail page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9414) Application Catalog for YARN applications

2020-01-08 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010893#comment-17010893
 ] 

Eric Yang commented on YARN-9414:
-

Move some enhancement work to next release.

> Application Catalog for YARN applications
> -
>
> Key: YARN-9414
> URL: https://issues.apache.org/jira/browse/YARN-9414
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-Application-Catalog.pdf
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10078) YARN Application Catalog enhancement

2020-01-08 Thread Eric Yang (Jira)

Eric Yang created YARN-10078:


 Summary: YARN Application Catalog enhancement
 Key: YARN-10078
 URL: https://issues.apache.org/jira/browse/YARN-10078
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Yang


This story continues the development work started in YARN-9414.  Some 
enhancement for YARN application catalog can make the application more user 
friendly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9137) Get the IP and port of the docker container and display it on WEB UI2

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9137:

Parent: (was: YARN-8472)
Issue Type: Wish  (was: Sub-task)

> Get the IP and port of the docker container and display it on WEB UI2
> -
>
> Key: YARN-9137
> URL: https://issues.apache.org/jira/browse/YARN-9137
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: Xun Liu
>Priority: Major
>
> 1) When using a container network such as Calico, the IP of the container is 
> not the IP of the host, but is allocated in the private network, and the 
> different containers can be directly connected.
>  Exposing the services in the container through a reverse proxy such as Ngxin 
> makes it easy for users to view the IP and port on WEB UI2 to use the 
> services in the container, such as Tomcat, TensorBoard, and so on.
>  2) When not using a container network such as Calico, the container also has 
> its own container port.
> So you need to display the IP and port of the docker container on WEB UI2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-7994) Add support for network-alias in docker run for user defined networks

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-7994.
-
Resolution: Later

This feature doesn't seem to be making progress in container phase 2.  Mark it 
for later.

> Add support for network-alias in docker run for user defined networks 
> --
>
> Key: YARN-7994
> URL: https://issues.apache.org/jira/browse/YARN-7994
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
>  Labels: Docker
>
> Docker Embedded DNS supports DNS resolution for containers by one or more of 
> its configured {{--network-alias}} within a user-defined network. 
> DockerRunCommand should support this option for DNS resolution to work 
> through docker embedded DNS 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-8744) In some cases docker kill is used to stop non-privileged containers instead of sending the signal directly

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-8744.
-
Resolution: Incomplete

Nice to have, but inconsequential detail.  There is no plan to fix this. 

> In some cases docker kill is used to stop non-privileged containers instead 
> of sending the signal directly
> --
>
> Key: YARN-8744
> URL: https://issues.apache.org/jira/browse/YARN-8744
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
>
> With YARN-8706, stopping docker containers was achieved by 
> 1. parsing the user specified {{STOPSIGNAL}} via docker inspect
> 2. executing {{docker kill --signal=}}
> Quoting [~ebadger]
> {quote}
> Additionally, for non-privileged containers, we don't need to call docker 
> kill. Instead, we can follow the code in handleContainerKill() and send the 
> signal directly. I think this code could probably be combined, since at this 
> point handleContainerKill() and handleContainerStop() will be doing the same 
> thing. The only difference is that the STOPSIGNAL will be used for the stop.
> {quote}
> To achieve the above, we need native code that accepts the name of the signal 
> rather than the value (number) of the signal. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8472) YARN Container Phase 2

2020-01-08 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010873#comment-17010873
 ] 

Eric Yang commented on YARN-8472:
-

[~brahmareddy] Thank you for the heads up.  We can close this umbrella for 
3.3.0.  I think the only outstanding issue is YARN-9292, which is good to be 
part of 3.3.0, but not absolutely required.  I ask [~billie] to review, if we 
can make the window.

> YARN Container Phase 2
> --
>
> Key: YARN-8472
> URL: https://issues.apache.org/jira/browse/YARN-8472
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>
> In YARN-3611, we have implemented basic Docker container support for YARN.  
> This story is the next phase to improve container usability.
> Several area for improvements are:
>  # Software defined network support
>  # Interactive shell to container
>  # User management sss/nscd integration
>  # Runc/containerd support
>  # Metrics/Logs integration with Timeline service v2 
>  # Docker container profiles
>  # Docker cgroup management



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-08 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9292:

Target Version/s: 3.3.0

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2020-01-08 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010811#comment-17010811
 ] 

Eric Yang commented on YARN-9292:
-

[~billie] Can you help with the review of this issue?  If I recall correctly 
container ID is used to determine the latest docker image tag used by the 
application.  Without container ID, it will not compute the latest image 
correctly for the given application.  It would be nice to have this issue 
closed for Hadoop 3.3.0 release.  Thanks

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8672) TestContainerManager#testLocalingResourceWhileContainerRunning occasionally times out

2020-01-07 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009944#comment-17009944
 ] 

Eric Yang commented on YARN-8672:
-

[~Jim_Brennan] No objection on backport.  [~ebadger] Could you shepherd the 
process, if precommit build passes?  Thanks

> TestContainerManager#testLocalingResourceWhileContainerRunning occasionally 
> times out
> -
>
> Key: YARN-8672
> URL: https://issues.apache.org/jira/browse/YARN-8672
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.0
>Reporter: Jason Darrell Lowe
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8672-branch-2.10.001.patch, YARN-8672.001.patch, 
> YARN-8672.002.patch, YARN-8672.003.patch, YARN-8672.004.patch, 
> YARN-8672.005.patch, YARN-8672.006.patch, YARN-8672.007.patch, 
> YARN-8672.008.patch
>
>
> Precommit builds have been failing in 
> TestContainerManager#testLocalingResourceWhileContainerRunning.  I have been 
> able to reproduce the problem without any patch applied if I run the test 
> enough times.  It looks like something is removing container tokens from the 
> nmPrivate area just as a new localizer starts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2020-01-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009079#comment-17009079
 ] 

Eric Yang commented on YARN-9956:
-

Thank you [~prabhujoseph] for the patch.
+1 for patch 5.  Committing shortly.

> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9956-001.patch, YARN-9956-002.patch, 
> YARN-9956-003.patch, YARN-9956-004.patch, YARN-9956-005.patch
>
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
>

[jira] [Commented] (YARN-10018) container-executor: possible -1 return value of fork() is not always checked

2019-12-12 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995229#comment-16995229
 ] 

Eric Yang commented on YARN-10018:
--

+1 looks good to me.

> container-executor: possible -1 return value of fork() is not always checked
> 
>
> Key: YARN-10018
> URL: https://issues.apache.org/jira/browse/YARN-10018
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10018-001.patch, YARN-10018-001.patch
>
>
> There are some places in the container-executor native, where the {{fork()}} 
> call is not handled properly. This operation can fail with -1, but sometimes 
> the necessary if branch is missing to validate that it's been successful.
> Also, at one location, the return value is defined as an {{int}}, not 
> {{pid_t}}. It's better to handle this transparently and change it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10019) container-executor: misc improvements in child processes and exec() calls

2019-12-12 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995225#comment-16995225
 ] 

Eric Yang commented on YARN-10019:
--

+1 looks good to me.

> container-executor: misc improvements in child processes and exec() calls
> -
>
> Key: YARN-10019
> URL: https://issues.apache.org/jira/browse/YARN-10019
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
> Attachments: YARN-10019-001.patch, YARN-10019-002.patch
>
>
> There are a couple of improvements that we can do in container-executor 
> regarding how we exit from child processes and how we handle failed exec() 
> calls:
> 1. If we're in the child code path and we detect an erroneous condition, the 
> usual way is just simply call {{_exit()}}. Normal {{exit()}} occurs in the 
> parent. Calling {{_exit()}}  prevents flushing stdio buffers twice and any 
> cleanup logic registered with {{atexit()}} or {{on_exit()}} will run only 
> once.
> 2. There's code like {{if (execlp(script_file_dest, script_file_dest, NULL) 
> != 0) ...}} which is not necessary. Exec functions are not supposed to 
> return. If they do, it's definitely an error, so no need to check the return 
> value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-12-09 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991851#comment-16991851
 ] 

Eric Yang commented on YARN-9956:
-

[~prabhujoseph] Something is still wrong with patch 003 in pre-commit test.  
Can you double check?  Thanks

> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9956-001.patch, YARN-9956-002.patch, 
> YARN-9956-003.patch
>
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
>

[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-12-05 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989383#comment-16989383
 ] 

Eric Yang commented on YARN-9561:
-

+1 for patch 15.  Thank you [~ebadger].

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch, 
> YARN-9561.012.patch, YARN-9561.013.patch, YARN-9561.014.patch, 
> YARN-9561.015.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-12-05 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989378#comment-16989378
 ] 

Eric Yang commented on YARN-9956:
-

Thank you for the patch, [~prabhujoseph].  Overall, the patch looks good.

The failed unit test looks a bit concerning.  I am not sure how it is related 
to the changes.  Can you confirm this is not an issue?


> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9956-001.patch, YARN-9956-002.patch
>
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at

[jira] [Updated] (YARN-10008) Add a warning message for accessing untrusted application master UI for webproxy

2019-12-01 Thread Eric Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-10008:
-
Description: 
Quote from Hadoop document: "In addition to this the proxy also tries to reduce 
the impact that a malicious AM could have on a user. It primarily does this by 
stripping out cookies from the user, and replacing them with a single cookie 
providing the user name of the logged in user. This is because most web based 
authentication systems will identify a user based off of a cookie. By providing 
this cookie to an untrusted application it opens up the potential for an 
exploit. If the cookie is designed properly that potential should be fairly 
minimal, but this is just to reduce that potential attack vector."

YARN web application proxy passes the user name to application master.  YARN 
application master UI can be developed by third party to look like resource 
manager to fool user.  It would be nice to add a warning message to warn user 
from accessing the untrusted application master UI via YARN application web 
proxy.  User can decide for themselves if they should trust "Untrusted 
Enterprise developer" before proceeding.

  was:YARN web application proxy passes the user credential to application 
master.  YARN application master UI can be developed by third party.  It would 
be nice to add a warning message to warn user from accessing the untrusted 
application master UI via YARN application web proxy.  User can decide for 
themselves if they should trust "Untrusted Enterprise developer" before 
proceeding.


> Add a warning message for accessing untrusted application master UI for 
> webproxy
> 
>
> Key: YARN-10008
> URL: https://issues.apache.org/jira/browse/YARN-10008
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Yang
>Priority: Major
>
> Quote from Hadoop document: "In addition to this the proxy also tries to 
> reduce the impact that a malicious AM could have on a user. It primarily does 
> this by stripping out cookies from the user, and replacing them with a single 
> cookie providing the user name of the logged in user. This is because most 
> web based authentication systems will identify a user based off of a cookie. 
> By providing this cookie to an untrusted application it opens up the 
> potential for an exploit. If the cookie is designed properly that potential 
> should be fairly minimal, but this is just to reduce that potential attack 
> vector."
> YARN web application proxy passes the user name to application master.  YARN 
> application master UI can be developed by third party to look like resource 
> manager to fool user.  It would be nice to add a warning message to warn user 
> from accessing the untrusted application master UI via YARN application web 
> proxy.  User can decide for themselves if they should trust "Untrusted 
> Enterprise developer" before proceeding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10008) Add a warning message for accessing untrusted application master UI for webproxy

2019-12-01 Thread Eric Yang (Jira)

Eric Yang created YARN-10008:


 Summary: Add a warning message for accessing untrusted application 
master UI for webproxy
 Key: YARN-10008
 URL: https://issues.apache.org/jira/browse/YARN-10008
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Yang


YARN web application proxy passes the user credential to application master.  
YARN application master UI can be developed by third party.  It would be nice 
to add a warning message to warn user from accessing the untrusted application 
master UI via YARN application web proxy.  User can decide for themselves if 
they should trust "Untrusted Enterprise developer" before proceeding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9983) Typo in YARN Service overview documentation

2019-11-19 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977764#comment-16977764
 ] 

Eric Yang commented on YARN-9983:
-

+1 

Thank you for the patch, [~denes.gerencser].

> Typo in YARN Service overview documentation
> ---
>
> Key: YARN-9983
> URL: https://issues.apache.org/jira/browse/YARN-9983
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.1.1
>Reporter: Denes Gerencser
>Assignee: Denes Gerencser
>Priority: Trivial
> Attachments: YARN-9983.001.patch
>
>
> There is a typo in 
> https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html
>  : "A restful API-server to for users to interact with YARN to deploy/manage 
> their services..." should be "A restful API-server for users to interact with 
> YARN to deploy/manage their services...".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9923) Introduce HealthReporter interface and implement running Docker daemon checker

2019-11-15 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975492#comment-16975492
 ] 

Eric Yang commented on YARN-9923:
-

{quote}Just to be sure, we can enforce the code to have no more than like 4 
threads (that means running at max 4 individual scripts) and no more, if you 
don't reject this solution.{quote}

4 threads sounds like a reasonable way to reduce resource usage problem.  User 
also has flexibility to implement one script only.  Hence, I don't have 
objection if you want to support multi-script approach.

> Introduce HealthReporter interface and implement running Docker daemon checker
> --
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: : No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.
> 
> A new interface called {{HealthChecker}} is introduced which is used in the 
> {{NodeHealthCheckerService}}. Currently existing implementations like 
> {{LocalDirsHandlerService}} are modified to implement this giving a clear 
> abstraction to the node's health. The {{DockerHealthChecker}} implements this 
> new interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9923) Introduce HealthReporter interface and implement running Docker daemon checker

2019-11-14 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974676#comment-16974676
 ] 

Eric Yang commented on YARN-9923:
-

[~adam.antal] 

{quote}do you mean that there was no public or hadoop-public API for 
health-checking on purpose?{quote}

I don't know if It was intentionally omitted, but I admit I didn't spent much 
time thinking about this API.  Pluggable health check interface is good.  There 
is no doubt about health check interface is a good feature for other people to 
implement their own health check implementation.  I only disagree on using Java 
to check Docker is a good pattern due to missing permissions to access 
privileged operations.

{quote}One improvement I can think of is to enable to set these things on a per 
script basis (allowing multiple scripts to run paralel).{quote}

Personally, I would prefer to avoid multi-script approach.  Apache common 
logging is one of real lesson that I learn from Hadoop that having too many run 
away threads making logging expensive and hard to debug where is the failure.  
We have moved to slf4j to reduce some of that bloat.  A single script runs 
under 30 seconds with 15 minutes interval, is more preferable by most system 
administrators.  We don't want to burn too many cpu cycles by healthcheck 
scripts.  The script itself can be organized into functions to keep things 
tidy, and potentially move some of the functions to Hadoop libexec scripts to 
keep the parts hackable and tidy.

{quote}For the sake of completeness a use case: In a cluster where Dockerized 
nodes with GPU are running TF jobs and nodes may depend on the availability of 
the Docker daemon as well as the GPU device, as of now we can only be sure that 
the node is working fine, if a container allocation is started on that node. 
{quote}

If the config toggle via environment variable can work, node manager can make 
decision of which part of the health check functions to run base on node 
manager own config.  This can prevent container to be schedule on unhealthy 
node base on above use case.  I think the outcome could be a better overall 
solution.  Wouldn't you agree?

> Introduce HealthReporter interface and implement running Docker daemon checker
> --
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: : No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.
> 
> A new interface called {{HealthChecker}} is introduced which is used in the 
> {{NodeHealthCheckerService}}. Currently existing implementations like 
> {{LocalDirsHandlerService}} are modified to implement this giving a clear 
> abstraction to the node's health. The {{DockerHealthChecker}} implements this 
> new interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9923) Introduce HealthReporter interface and implement running Docker daemon checker

2019-11-14 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974392#comment-16974392
 ] 

Eric Yang commented on YARN-9923:
-

[~adam.antal], thank you for the patch.  There are a number of people sharing 
the same concerns about the current implementation.  The main issues are:

# it would be better to be written as a health checker script hence, proper 
privileges can be obtained to check if the process is alive.
# Worries about future maintenance of DockerHealthCheckerService.  It sets 
precedence of examples that writing complex system specific logic in Java, 
which would be easier to implement in scripts.
# Future code may have too many timer threads that runs at different intervals 
and use more system resources than necessary.

Could we change DockerHealthCheckerService into node manager healthcheck 
script?  The toggle for enable docker check can be based on environment 
variables that pass down to node manager healthcheck script.  This approach 
will set good example of how to pass config toggle to healthcheck script, while 
maintaining backward compatibility to user's healthcheck script.

Let us know your thoughts.  Thanks

> Introduce HealthReporter interface and implement running Docker daemon checker
> --
>
> Key: YARN-9923
> URL: https://issues.apache.org/jira/browse/YARN-9923
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: : No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.
> 
> A new interface called {{HealthChecker}} is introduced which is used in the 
> {{NodeHealthCheckerService}}. Currently existing implementations like 
> {{LocalDirsHandlerService}} are modified to implement this giving a clear 
> abstraction to the node's health. The {{DockerHealthChecker}} implements this 
> new interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-08 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970646#comment-16970646
 ] 

Eric Yang commented on YARN-9561:
-

[~ebadger] What is the right way to run test_runc_util with patch 11?  Cetest 
is crashing on my machine:

{code}
mvn clean test -Dtest=cetest -Pnative
{code}

Maven output looks like this:
{code}
[INFO] ---
[INFO]  C M A K E B U I L D E RT E S T
[INFO] ---
[INFO] cetest: running 
/home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/test/cetest
 --gtest_filter=-Perf. 
--gtest_output=xml:/home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/surefire-reports/TEST-cetest.xml
[INFO] with extra environment variables {}
[INFO] STATUS: ERROR CODE 139 after 5 millisecond(s).
[INFO] ---
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 01:05 min
[INFO] Finished at: 2019-11-08T18:29:30-05:00
[INFO] Final Memory: 56M/575M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.hadoop:hadoop-maven-plugins:3.3.0-SNAPSHOT:cmake-test (cetest) on 
project hadoop-yarn-server-nodemanager: Test 
/home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/test/cetest
 returned ERROR CODE 139 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
{code}

Run the test manually reveals:

{code}
$ ./cetest 
Determining user details
Requested user eyang is not whitelisted and has id 501,which is below the 
minimum allowed 1000

Setting NM UID
Segmentation fault
{code}

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968632#comment-16968632
 ] 

Eric Yang commented on YARN-9956:
-

[~prabhujoseph] can you help out with this issue?  Thanks

> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968629#comment-16968629
 ] 

Eric Yang commented on YARN-9956:
-

Krb5.log evidence shows ApiServiceClient attempting to acquire TGS for both 
resource manager and also the non-existed RM.

{code}
Oct 30 22:01:41 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for HTTP/host1.example@example.com
Oct 30 22:01:41 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for krbtgt/example@example.com
Oct 30 22:01:42 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for HTTP/host2.example@example.com
Oct 30 22:01:42 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for krbtgt/example@example.com
Oct 30 22:01:42 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: LOOKING_UP_SERVER: authtime 0,  
hb...@example.com for HTTP/0.0@example.com, Server not found in Kerberos 
database
{code}


> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
>

[jira] [Created] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Eric Yang (Jira)

Eric Yang created YARN-9956:
---

 Summary: Improve connection error message for YARN ApiServerClient
 Key: YARN-9956
 URL: https://issues.apache.org/jira/browse/YARN-9956
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Yang


In HA environment, yarn.resourcemanager.webapp.address configuration is 
optional.  ApiServiceClient may produce confusing error message like this:

{code}
19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
host1.example.com:8090
19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
host2.example.com:8090
19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
GSSException: No valid credentials provided (Mechanism level: Server not found 
in Kerberos database (7) - LOOKING_UP_SERVER)
at 
java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
at 
java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
at 
java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
Caused by: KrbException: Server not found in Kerberos database (7) - 
LOOKING_UP_SERVER
at 
java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
at 
java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
at 
java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
at 
java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
at 
java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
at 
java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
at 
java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
... 15 more
Caused by: KrbException: Identifier doesn't match expected value (906)
at 
java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
at 
java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
at 
java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
at 
java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
... 21 more
19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
java.io.IOException: java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
Caused by: java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
... 6 more
Caused by: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Server

[jira] [Commented] (YARN-9953) YARN Service dependency should be configurable for each app

2019-11-05 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967829#comment-16967829
 ] 

Eric Yang commented on YARN-9953:
-

The programmable API for YARN service is to maintain backward compatibility at 
yarnfile level instead of at private Java API calls.  YARN service depends on 
API server, which is part of Resource Manager process.  There is fair bit of 
dependencies between YARN framework and the YARN service application.  By 
exposing the YARN service as configurable version, it will be harder to manage 
upgrades and create more obstacles for future version of YARN framework because 
older version of YARN service uses internal YARN API which may not work in 
future version of YARN.  Given those reasons, we can't move forward with this 
patch.  Sorry.

> YARN Service dependency should be configurable for each app
> ---
>
> Key: YARN-9953
> URL: https://issues.apache.org/jira/browse/YARN-9953
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9953.001.patch
>
>
> Currently, YARN Service dependency can be set as yarn.service.framework.path.
> But, It works only as configured in RM.
> This makes it impossible for the user to choose their YARN Service dependency.
> It should be configurable for each app.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-10-28 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961562#comment-16961562
 ] 

Eric Yang commented on YARN-9562:
-

[~ebadger] Can we do something to reduce the number of checkstyle warnings?  
Most of them are fixable.  Thanks

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2816 matches

Mail list logo