[jira] [Commented] (YARN-10721) YARN Service containers are restarted when RM failover

2021-04-04 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314485#comment-17314485
 ] 

kyungwan nam commented on YARN-10721:
-

I've seen the attached patch solves this problem in the my cluster.

[~csingh], [~billie]
I believe this issue is related with YARN-6168, YARN-7565.
Can you take a look at this issue
Thanks


> YARN Service containers are restarted when RM failover
> --
>
> Key: YARN-10721
> URL: https://issues.apache.org/jira/browse/YARN-10721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10721.001.patch, YARN-10721.002.patch
>
>
> Our cluster has a large number of NMs.
> When RM failover, it took 7 minutes for most of NMs to register with RM.
> After, I’ve seen that a lot of containers was restarted 
> I think it related with YARN-6168.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10721) YARN Service containers are restarted when RM failover

2021-03-30 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10721:

Attachment: YARN-10721.002.patch

> YARN Service containers are restarted when RM failover
> --
>
> Key: YARN-10721
> URL: https://issues.apache.org/jira/browse/YARN-10721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10721.001.patch, YARN-10721.002.patch
>
>
> Our cluster has a large number of NMs.
> When RM failover, it took 7 minutes for most of NMs to register with RM.
> After, I’ve seen that a lot of containers was restarted 
> I think it related with YARN-6168.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10721) YARN Service containers are restarted when RM failover

2021-03-29 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10721:

Description: 
Our cluster has a large number of NMs.
When RM failover, it took 7 minutes for most of NMs to register with RM.
After, I’ve seen that a lot of containers was restarted 

I think it related with YARN-6168.

  was:
Our cluster has a large number of NMs.
When RM failover, it took 7 minutes for most of NMs to register with RM.
After, I’ve seen that a lot of containers was restarted 


> YARN Service containers are restarted when RM failover
> --
>
> Key: YARN-10721
> URL: https://issues.apache.org/jira/browse/YARN-10721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10721.001.patch
>
>
> Our cluster has a large number of NMs.
> When RM failover, it took 7 minutes for most of NMs to register with RM.
> After, I’ve seen that a lot of containers was restarted 
> I think it related with YARN-6168.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10721) YARN Service containers are restarted when RM failover

2021-03-29 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-10721:
---

Attachment: YARN-10721.001.patch
  Assignee: kyungwan nam

> YARN Service containers are restarted when RM failover
> --
>
> Key: YARN-10721
> URL: https://issues.apache.org/jira/browse/YARN-10721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10721.001.patch
>
>
> Our cluster has a large number of NMs.
> When RM failover, it took 7 minutes for most of NMs to register with RM.
> After, I’ve seen that a lot of containers was restarted 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10721) YARN Service containers are restarted when RM failover

2021-03-29 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10721:
---

 Summary: YARN Service containers are restarted when RM failover
 Key: YARN-10721
 URL: https://issues.apache.org/jira/browse/YARN-10721
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


Our cluster has a large number of NMs.
When RM failover, it took 7 minutes for most of NMs to register with RM.
After, I’ve seen that a lot of containers was restarted 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10603) Failed to reinitialize for recovered container

2021-01-31 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276087#comment-17276087
 ] 

kyungwan nam edited comment on YARN-10603 at 2/1/21, 6:33 AM:
--

I've attached a patch. this patch works well in our cluster. 
Please review and comment.
Thanks.


was (Author: kyungwan nam):
I've attached a patch.
Please review and comment.
Thanks

> Failed to reinitialize for recovered container
> --
>
> Key: YARN-10603
> URL: https://issues.apache.org/jira/browse/YARN-10603
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10603.001.patch
>
>
> Container reinitializing request does not work after restarting NM.
> I found some problem as below.
> - when a recovered container is terminated, exiting occurs because it makes 
> always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS
> - container’s *recoveredStatus* is set at the time of NM recovery. and it is 
> never changed even though the container is terminated.
> as a result, newly reinitializing container will be launched as a recovered 
> container, but it doesn't work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10603) Failed to reinitialize for recovered container

2021-01-31 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10603:

Attachment: YARN-10603.001.patch

> Failed to reinitialize for recovered container
> --
>
> Key: YARN-10603
> URL: https://issues.apache.org/jira/browse/YARN-10603
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10603.001.patch
>
>
> Container reinitializing request does not work after restarting NM.
> I found some problem as below.
> - when a recovered container is terminated, exiting occurs because it makes 
> always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS
> - container’s *recoveredStatus* is set at the time of NM recovery. and it is 
> never changed even though the container is terminated.
> as a result, newly reinitializing container will be launched as a recovered 
> container, but it doesn't work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10603) Failed to reinitialize for recovered container

2021-01-31 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10603:
---

 Summary: Failed to reinitialize for recovered container
 Key: YARN-10603
 URL: https://issues.apache.org/jira/browse/YARN-10603
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


Container reinitializing request does not work after restarting NM.

I found some problem as below.

- when a recovered container is terminated, exiting occurs because it makes 
always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS
- container’s *recoveredStatus* is set at the time of NM recovery. and it is 
never changed even though the container is terminated.
as a result, newly reinitializing container will be launched as a recovered 
container, but it doesn't work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10567) Support parallelism for YARN Service

2021-01-11 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10567:

Attachment: YARN-10567.001.patch

> Support parallelism for YARN Service
> 
>
> Key: YARN-10567
> URL: https://issues.apache.org/jira/browse/YARN-10567
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-10567.001.patch
>
>
> YARN Service support job-like by using "restart_policy" introduced in 
> YARN-8080.
> But, we cannot set how many containers can be launched concurrently.
> This feature is something like "parallelism" in kubernetes.
> https://kubernetes.io/docs/concepts/workloads/controllers/job/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10567) Support parallelism for YARN Service

2021-01-11 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10567:
---

 Summary: Support parallelism for YARN Service
 Key: YARN-10567
 URL: https://issues.apache.org/jira/browse/YARN-10567
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: kyungwan nam


YARN Service support job-like by using "restart_policy" introduced in YARN-8080.
But, we cannot set how many containers can be launched concurrently.
This feature is something like "parallelism" in kubernetes.
https://kubernetes.io/docs/concepts/workloads/controllers/job/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10305) Lost system-credentials when restarting RM

2020-06-19 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140308#comment-17140308
 ] 

kyungwan nam commented on YARN-10305:
-

Hi. [~eyang] [~prabhujoseph]
I have seen this problem solved with this patch. 
Could you take a look at this patch?
Thanks


> Lost system-credentials when restarting RM
> --
>
> Key: YARN-10305
> URL: https://issues.apache.org/jira/browse/YARN-10305
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10305.001.patch
>
>
> System-credentials introduced in YARN-2704, it makes it to keep the 
> long-running apps.
> I’ve met a situation where system-credentials lost when restarting RM.
> Since then, if an app’s AM is stopped, restarting AM will be failed because 
> NMs do not have HDFS delegation token which is needed for resource 
> localization.
> The app has a couple of delegation token including timeline-server token and 
> HDFS delegation token.
> When restarting RM, RM will request a new HDFS delegation token for an app 
> that was submitted long ago. (It's fixed by YARN-5098)
> But, If an app has a couple of delegation token and an exception occur in the 
> token processed first, the next tokens are not processed.
> I think that’s why lost system-credentials.
> Here are RM’s logs at the time of restarting RM.
> {code}
> 2020-05-19 14:25:05,712 WARN  security.DelegationTokenRenewer 
> (DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to 
> add the application to the delegation token renewer on recovery.
> java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
> Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
> renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
> sequenceNumber=2193, masterKeyId=340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [403], message 
> [org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to 
> renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
> renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
> sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 
> 10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900]
> at 
> org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250)
> at 
> 

[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-06-17 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138226#comment-17138226
 ] 

kyungwan nam commented on YARN-10311:
-

Hi, I've met same issue in YARN-9905.
I wanted to seperate the HDFS for log-aggregation under HDFS federation. but, 
It doesn't work due to this issue.
Thanks~




> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch, YARN-10311.002.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs

2020-06-15 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136305#comment-17136305
 ] 

kyungwan nam commented on YARN-9905:


This looks the same as YARN-10311. Closing as duplicate.

> yarn-service is failed to setup application log if app-log-dir is not 
> default-fs
> 
>
> Key: YARN-9905
> URL: https://issues.apache.org/jira/browse/YARN-9905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9905.001.patch, YARN-9905.002.patch
>
>
> Currently, yarn-service takes a token of default namenode only.
>  it might cause authentication failure under HDFS federation.
> how to reproduce
>  - kerberized cluster
>  - multiple namespaces by HDFS federation.
>  - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not 
> default-fs
> here are the nodemanager logs at that time.
> {code:java}
> 2019-10-15 11:52:50,217 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new 
> application reference for app application_1569373267731_9571
> 2019-10-15 11:52:50,217 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(655)) - Application 
> application_1569373267731_9571 transitioned from NEW to INITING
> ...
>  Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]
> at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
> at org.apache.hadoop.ipc.Client.call(Client.java:1457)
> at org.apache.hadoop.ipc.Client.call(Client.java:1367)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
> at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595)
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396)
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254)
> at 
> 

[jira] [Created] (YARN-10305) Lost system-credentials when restarting RM

2020-06-02 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10305:
---

 Summary: Lost system-credentials when restarting RM
 Key: YARN-10305
 URL: https://issues.apache.org/jira/browse/YARN-10305
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


System-credentials introduced in YARN-2704, it makes it to keep the 
long-running apps.
I’ve met a situation where system-credentials lost when restarting RM.
Since then, if an app’s AM is stopped, restarting AM will be failed because NMs 
do not have HDFS delegation token which is needed for resource localization.


The app has a couple of delegation token including timeline-server token and 
HDFS delegation token.
When restarting RM, RM will request a new HDFS delegation token for an app that 
was submitted long ago. (It's fixed by YARN-5098)
But, If an app has a couple of delegation token and an exception occur in the 
token processed first, the next tokens are not processed.
I think that’s why lost system-credentials.

Here are RM’s logs at the time of restarting RM.
{code}
2020-05-19 14:25:05,712 WARN  security.DelegationTokenRenewer 
(DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to 
add the application to the delegation token renewer on recovery.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
sequenceNumber=2193, masterKeyId=340)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [403], message 
[org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to 
renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 
10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:512)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:629)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:626)
at java.security.AccessController.doPrivileged(Native Method)
at 

[jira] [Created] (YARN-10267) Add description, version as allocationTags for YARN Service

2020-05-14 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10267:
---

 Summary: Add description, version as allocationTags for YARN 
Service   
 Key: YARN-10267
 URL: https://issues.apache.org/jira/browse/YARN-10267
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam
Assignee: kyungwan nam


applicationTags for YARN Service only has the name.

It makes it difficult to identify what kind of apps are. 

It would be good if description, version are added to applicationTags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10262) Support application ACLs for YARN Service

2020-05-10 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10262:
---

 Summary: Support application ACLs for YARN Service
 Key: YARN-10262
 URL: https://issues.apache.org/jira/browse/YARN-10262
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam
Assignee: kyungwan nam


Currently, a user can access own yarn-service only. 
There’s no way to access the other user’s yarn-service.
It makes it difficult to collaborate between users.
User should be able to set the application ACLs for yarn-service.
It's like mapreduce.job.acl-view-job, mapreduce.job.acl-modify-job for 
MapReduce. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10196) destroying app leaks zookeeper connection

2020-04-28 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094308#comment-17094308
 ] 

kyungwan nam commented on YARN-10196:
-

Hi. [~prabhujoseph], this definitely seems like a bug.
Can you please take a look at this?
the patch works well in my cluster.
Thanks~

> destroying app leaks zookeeper connection
> -
>
> Key: YARN-10196
> URL: https://issues.apache.org/jira/browse/YARN-10196
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10196.001.patch, YARN-10196.002.patch
>
>
> when destroying app, curatorClient in ServiceClient is started. but It is 
> never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10196) destroying app leaks zookeeper connection

2020-04-28 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10196:

Attachment: YARN-10196.002.patch

> destroying app leaks zookeeper connection
> -
>
> Key: YARN-10196
> URL: https://issues.apache.org/jira/browse/YARN-10196
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10196.001.patch, YARN-10196.002.patch
>
>
> when destroying app, curatorClient in ServiceClient is started. but It is 
> never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10206) Service stuck in the STARTED state when it has a component having no instance

2020-03-24 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10206:
---

 Summary: Service stuck in the STARTED state when it has a 
component having no instance
 Key: YARN-10206
 URL: https://issues.apache.org/jira/browse/YARN-10206
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam



* 'compb' has no instance. it means 'number_of_containers' is 0
* 'compb' has a dependency on 'compa'.

{code}
"components": [
   {
  "name”:”compa”,
  "number_of_containers": 1,
  "dependencies" : [
  ]
},
{
  "name":"compb”,
  "number_of_containers": 0,
  "dependencies" : [
"compa"
  ],
{code}
when launching the service, it stuck in the STARTED state




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10203) Stuck in express_upgrading if there is any component which has no instance

2020-03-20 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10203:
---

 Summary: Stuck in express_upgrading if there is any component 
which has no instance
 Key: YARN-10203
 URL: https://issues.apache.org/jira/browse/YARN-10203
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


I was trying to "express upgrade" which introduced in YARN-8298.
https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceUpgrade.html

but, service state stuck in EXPRESS_UPGRADING.
It happens only If there is any component that has no instance. 
("number_of_containers" : 0)

the component which has no instance should be excepted from upgrade target




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission

2020-03-19 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062421#comment-17062421
 ] 

kyungwan nam commented on YARN-10034:
-

[~prabhujoseph], [~adam.antal]
Thank you for the review and commit!

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-10034.001.patch, YARN-10034.002.patch, 
> YARN-10034.003.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10184) NPE happens in NMClient when reinitializeContainer

2020-03-18 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10184:

Attachment: YARN-10184.002.patch

> NPE happens in NMClient when reinitializeContainer
> --
>
> Key: YARN-10184
> URL: https://issues.apache.org/jira/browse/YARN-10184
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10184.001.patch, YARN-10184.002.patch
>
>
> NPE happens in NMClient when upgrading a yarn-service app which AM has been 
> restarted.
> Here is AM’s log at the time of the NPE.
> {code}
> 2020-02-20 16:43:35,962 [Container  Event Dispatcher] ERROR 
> yarn.YarnUncaughtExceptionHandler - Thread Thread[Container  Event 
> Dispatcher,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$1.run(NMClientAsyncImpl.java:172)
> 2020-02-20 16:43:36,398 [AMRM Callback Handler Thread] WARN  
> service.ServiceScheduler - Container 
> container_e58_1581930783345_1954_01_06 Completed. No component instance 
> exists. exitStatus=-100. diagnostics=Container released by application 
> {code}
> NMClient keeps containers since the container has been started.
> But, when restarting AM, NMClient is initialized and previous containers are 
> lost. 
> Since then, NPE will happen when reinitializeContainer is requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10196) destroying app leaks zookeeper connection

2020-03-13 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10196:

Attachment: YARN-10196.001.patch

> destroying app leaks zookeeper connection
> -
>
> Key: YARN-10196
> URL: https://issues.apache.org/jira/browse/YARN-10196
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10196.001.patch
>
>
> when destroying app, curatorClient in ServiceClient is started. but It is 
> never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10196) destroying app leaks zookeeper connection

2020-03-13 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10196:
---

 Summary: destroying app leaks zookeeper connection
 Key: YARN-10196
 URL: https://issues.apache.org/jira/browse/YARN-10196
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


when destroying app, curatorClient in ServiceClient is started. but It is never 
closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10190) Typo in NMClientAsyncImpl

2020-03-09 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10190:
---

 Summary: Typo in NMClientAsyncImpl
 Key: YARN-10190
 URL: https://issues.apache.org/jira/browse/YARN-10190
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


Small typo in NMClientAsyncImpl.java

* ReInitializeContainerEvevnt -> ReInitializeContainerEvent
* containerLaunchContex -> containerLaunchContext



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10184) NPE happens in NMClient when reinitializeContainer

2020-03-09 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10184:
---

 Summary: NPE happens in NMClient when reinitializeContainer
 Key: YARN-10184
 URL: https://issues.apache.org/jira/browse/YARN-10184
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


NPE happens in NMClient when upgrading a yarn-service app which AM has been 
restarted.
Here is AM’s log at the time of the NPE.

{code}
2020-02-20 16:43:35,962 [Container  Event Dispatcher] ERROR 
yarn.YarnUncaughtExceptionHandler - Thread Thread[Container  Event 
Dispatcher,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$1.run(NMClientAsyncImpl.java:172)
2020-02-20 16:43:36,398 [AMRM Callback Handler Thread] WARN  
service.ServiceScheduler - Container container_e58_1581930783345_1954_01_06 
Completed. No component instance exists. exitStatus=-100. diagnostics=Container 
released by application 
{code}

NMClient keeps containers since the container has been started.
But, when restarting AM, NMClient is initialized and previous containers are 
lost. 
Since then, NPE will happen when reinitializeContainer is requested.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission

2020-02-21 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041690#comment-17041690
 ] 

kyungwan nam commented on YARN-10034:
-

[~prabhujoseph], Can you please take a look at this?
 Thanks

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10034.001.patch, YARN-10034.002.patch, 
> YARN-10034.003.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10119) Cannot reset the AM failure count for YARN Service

2020-02-20 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041424#comment-17041424
 ] 

kyungwan nam commented on YARN-10119:
-

Thanks [~prabhujoseph] for your review and commit.

> Cannot reset the AM failure count for YARN Service
> --
>
> Key: YARN-10119
> URL: https://issues.apache.org/jira/browse/YARN-10119
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
>  Labels: Reviewed
> Fix For: 3.3.0
>
> Attachments: YARN-10119.001.patch
>
>
> Currently, YARN Service does not support to reset AM failure count, which 
> introduced in YARN-611
> Since the AM failure count is never reset, eventually that will reach 
> yarn.service.am-restart.max-attempts and the app will be stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10119) Cannot reset the AM failure count for YARN Service

2020-02-13 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036621#comment-17036621
 ] 

kyungwan nam commented on YARN-10119:
-

[~prabhujoseph], Can you please take a look at this?
Thanks!

> Cannot reset the AM failure count for YARN Service
> --
>
> Key: YARN-10119
> URL: https://issues.apache.org/jira/browse/YARN-10119
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10119.001.patch
>
>
> Currently, YARN Service does not support to reset AM failure count, which 
> introduced in YARN-611
> Since the AM failure count is never reset, eventually that will reach 
> yarn.service.am-restart.max-attempts and the app will be stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM failed to start due to system services

2020-02-12 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035913#comment-17035913
 ] 

kyungwan nam commented on YARN-9521:


[~prabhujoseph] Thank you for your review and commit

> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
>  Labels: Reviewed
> Fix For: 3.3.0
>
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch, 
> YARN-9521.003.patch, YARN-9521.004.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9521) RM failed to start due to system services

2020-02-12 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9521:
---
Attachment: YARN-9521.004.patch

> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch, 
> YARN-9521.003.patch, YARN-9521.004.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM failed to start due to system services

2020-02-11 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035021#comment-17035021
 ] 

kyungwan nam commented on YARN-9521:


Attaches a new patch including test code.
[~eyang], [~prabhujoseph] Could you take a look it when you are available?
Thanks!

> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch, 
> YARN-9521.003.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9521) RM failed to start due to system services

2020-02-11 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9521:
---
Attachment: YARN-9521.003.patch

> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch, 
> YARN-9521.003.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10113) SystemServiceManagerImpl fails to initialize

2020-02-10 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034114#comment-17034114
 ] 

kyungwan nam commented on YARN-10113:
-

Hi. [~prabhujoseph], [~eyang].

I believe this is the same as YARN-9521. The FileSystem object for RM login 
user can be closed by ApiServiceClient.actionCleanUp.
the patch in YARN-9521 is to perform ApiServiceClient.actionCleanUp inside 
ugi.doAs().
It works well in my cluster (Hadoop-3.1.2)
Please let me know if I'm wrong.
Thanks!

 

> SystemServiceManagerImpl fails to initialize 
> -
>
> Key: YARN-10113
> URL: https://issues.apache.org/jira/browse/YARN-10113
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10113-001.patch, YARN-10113-002.patch
>
>
> RM fails to start with SystemServiceManagerImpl failed to initialize.
> {code}
> 2020-01-28 12:20:43,631 WARN  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:636)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:881)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1257)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1298)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1294)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1294)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
> ... 5 more
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1645)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1219)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1235)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1202)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1181)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1177)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1189)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)

[jira] [Created] (YARN-10119) Cannot reset the AM failure count for YARN Service

2020-02-06 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10119:
---

 Summary: Cannot reset the AM failure count for YARN Service
 Key: YARN-10119
 URL: https://issues.apache.org/jira/browse/YARN-10119
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam
Assignee: kyungwan nam


Currently, YARN Service does not support to reset AM failure count, which 
introduced in YARN-611

Since the AM failure count is never reset, eventually that will reach 
yarn.service.am-restart.max-attempts and the app will be stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission

2020-01-01 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006521#comment-17006521
 ] 

kyungwan nam commented on YARN-10034:
-

fixes checkstyle.
I don't think test failure is related to this issue. 
[~cheersyang] Sorry for bothering you. Could you review this?

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10034.001.patch, YARN-10034.002.patch, 
> YARN-10034.003.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10034) Allocation tags are not removed when node decommission

2020-01-01 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10034:

Attachment: YARN-10034.003.patch

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10034.001.patch, YARN-10034.002.patch, 
> YARN-10034.003.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission

2019-12-30 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005923#comment-17005923
 ] 

kyungwan nam commented on YARN-10034:
-

attaches a new patch including test code

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10034.001.patch, YARN-10034.002.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10034) Allocation tags are not removed when node decommission

2019-12-30 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-10034:

Attachment: YARN-10034.002.patch

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10034.001.patch, YARN-10034.002.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10034) Allocation tags are not removed when node decommission

2019-12-16 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-10034:
---

Attachment: YARN-10034.001.patch
  Assignee: kyungwan nam

attaches a patch.
please review or comment.
thanks.

> Allocation tags are not removed when node decommission
> --
>
> Key: YARN-10034
> URL: https://issues.apache.org/jira/browse/YARN-10034
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10034.001.patch
>
>
> When a node is decommissioned, allocation tags that are attached to the node 
> are not removed.
> I could see that allocation tags are revived when recommissioning the node.
> RM removes allocation tags only if NM confirms the container releases by 
> YARN-8511. but, decommissioned NM does not connect to RM anymore.
> Once a node is decommissioned, allocation tags that attached to the node 
> should be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10034) Allocation tags are not removed when node decommission

2019-12-16 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10034:
---

 Summary: Allocation tags are not removed when node decommission
 Key: YARN-10034
 URL: https://issues.apache.org/jira/browse/YARN-10034
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


When a node is decommissioned, allocation tags that are attached to the node 
are not removed.
I could see that allocation tags are revived when recommissioning the node.

RM removes allocation tags only if NM confirms the container releases by 
YARN-8511. but, decommissioned NM does not connect to RM anymore.
Once a node is decommissioned, allocation tags that attached to the node should 
be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10021) NPE in YARN Registry DNS when wrong DNS message is incoming

2019-12-09 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-10021:
---

Attachment: YARN-10021.001.patch
  Assignee: kyungwan nam

> NPE in YARN Registry DNS when wrong DNS message is incoming
> ---
>
> Key: YARN-10021
> URL: https://issues.apache.org/jira/browse/YARN-10021
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-10021.001.patch
>
>
> I’ve met NPE in YARN Registry DNS as below.
> It looks like this happens if the incoming DNS request is the wrong format.
> {code:java}
> 2019-11-29 10:51:12,178 ERROR dns.RegistryDNS (RegistryDNS.java:call(932)) - 
> Error initializing DNS UDP listener
> java.lang.NullPointerException
> at java.nio.ByteBuffer.put(ByteBuffer.java:859)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2019-11-29 10:51:12,180 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception 
> when running task in RegistryDNS 1
> 2019-11-29 10:51:12,180 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
> thread RegistryDNS 1:
> java.lang.NullPointerException
> at java.nio.ByteBuffer.put(ByteBuffer.java:859)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930)
> at 
> org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10021) NPE in YARN Registry DNS when wrong DNS message is incoming

2019-12-09 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10021:
---

 Summary: NPE in YARN Registry DNS when wrong DNS message is 
incoming
 Key: YARN-10021
 URL: https://issues.apache.org/jira/browse/YARN-10021
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


I’ve met NPE in YARN Registry DNS as below.
It looks like this happens if the incoming DNS request is the wrong format.

{code:java}
2019-11-29 10:51:12,178 ERROR dns.RegistryDNS (RegistryDNS.java:call(932)) - 
Error initializing DNS UDP listener
java.lang.NullPointerException
at java.nio.ByteBuffer.put(ByteBuffer.java:859)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2019-11-29 10:51:12,180 WARN  concurrent.ExecutorHelper 
(ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception 
when running task in RegistryDNS 1
2019-11-29 10:51:12,180 WARN  concurrent.ExecutorHelper 
(ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
thread RegistryDNS 1:
java.lang.NullPointerException
at java.nio.ByteBuffer.put(ByteBuffer.java:859)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner

2019-11-19 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977263#comment-16977263
 ] 

kyungwan nam commented on YARN-9986:


[~prabhujoseph], thank you for your comment.
I've attached a new patch with the modified test code.

> signalToContainer REST API does not work even if requested by the app owner
> ---
>
> Key: YARN-9986
> URL: https://issues.apache.org/jira/browse/YARN-9986
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: restapi
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9986.001.patch, YARN-9986.002.patch
>
>
> signalToContainer REST API introduced in YARN-8693 does not work even if 
> requested by the app owner. 
> It works well only if requested by an admin user
> {code}
> $ kinit kwnam
> Password for kw...@test.org:
> $ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
> https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
> {"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception:
>  Only admins can carry out this 
> operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$
> $ kinit admin
> Password for ad...@test.org:
> $
> $ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
> https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
> $
> {code}
> in contrast, the app owner can do it using the command line as below.
> {code}
> $ kinit kwnam
> Password for kw...@test.org:
> $ yarn container -signal container_e58_1573625560605_29927_01_02  
> GRACEFUL_SHUTDOWN
> Signalling container container_e58_1573625560605_29927_01_02
> 2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container 
> container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN
> 2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> $
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner

2019-11-19 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9986:
---
Attachment: YARN-9986.002.patch

> signalToContainer REST API does not work even if requested by the app owner
> ---
>
> Key: YARN-9986
> URL: https://issues.apache.org/jira/browse/YARN-9986
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: restapi
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9986.001.patch, YARN-9986.002.patch
>
>
> signalToContainer REST API introduced in YARN-8693 does not work even if 
> requested by the app owner. 
> It works well only if requested by an admin user
> {code}
> $ kinit kwnam
> Password for kw...@test.org:
> $ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
> https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
> {"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception:
>  Only admins can carry out this 
> operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$
> $ kinit admin
> Password for ad...@test.org:
> $
> $ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
> https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
> $
> {code}
> in contrast, the app owner can do it using the command line as below.
> {code}
> $ kinit kwnam
> Password for kw...@test.org:
> $ yarn container -signal container_e58_1573625560605_29927_01_02  
> GRACEFUL_SHUTDOWN
> Signalling container container_e58_1573625560605_29927_01_02
> 2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container 
> container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN
> 2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> $
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner

2019-11-18 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9986:
--

 Summary: signalToContainer REST API does not work even if 
requested by the app owner
 Key: YARN-9986
 URL: https://issues.apache.org/jira/browse/YARN-9986
 Project: Hadoop YARN
  Issue Type: Bug
  Components: restapi
Reporter: kyungwan nam
Assignee: kyungwan nam


signalToContainer REST API introduced in YARN-8693 does not work even if 
requested by the app owner. 
It works well only if requested by an admin user

{code}
$ kinit kwnam
Password for kw...@test.org:
$ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
{"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception:
 Only admins can carry out this 
operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$
$ kinit admin
Password for ad...@test.org:
$
$ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
$
{code}

in contrast, the app owner can do it using the command line as below.

{code}
$ kinit kwnam
Password for kw...@test.org:
$ yarn container -signal container_e58_1573625560605_29927_01_02  
GRACEFUL_SHUTDOWN
Signalling container container_e58_1573625560605_29927_01_02
2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container 
container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN
2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
$
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9953) YARN Service dependency should be configurable for each app

2019-11-04 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-9953:
--

Attachment: YARN-9953.001.patch
  Assignee: kyungwan nam

yarn.service.framework.path can be set in yarnfile.
if it does not exist in yarnfile, it respects as configured in RM.


> YARN Service dependency should be configurable for each app
> ---
>
> Key: YARN-9953
> URL: https://issues.apache.org/jira/browse/YARN-9953
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9953.001.patch
>
>
> Currently, YARN Service dependency can be set as yarn.service.framework.path.
> But, It works only as configured in RM.
> This makes it impossible for the user to choose their YARN Service dependency.
> It should be configurable for each app.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9953) YARN Service dependency should be configurable for each app

2019-11-04 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9953:
---
Affects Version/s: 3.1.2

> YARN Service dependency should be configurable for each app
> ---
>
> Key: YARN-9953
> URL: https://issues.apache.org/jira/browse/YARN-9953
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
>
> Currently, YARN Service dependency can be set as yarn.service.framework.path.
> But, It works only as configured in RM.
> This makes it impossible for the user to choose their YARN Service dependency.
> It should be configurable for each app.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9953) YARN Service dependency should be configurable for each app

2019-11-04 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9953:
--

 Summary: YARN Service dependency should be configurable for each 
app
 Key: YARN-9953
 URL: https://issues.apache.org/jira/browse/YARN-9953
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


Currently, YARN Service dependency can be set as yarn.service.framework.path.
But, It works only as configured in RM.
This makes it impossible for the user to choose their YARN Service dependency.
It should be configurable for each app.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9929) NodeManager OOM because of stuck DeletionService

2019-10-22 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957039#comment-16957039
 ] 

kyungwan nam commented on YARN-9929:


attaches a patch, which set the timeout for _ShellCommandExecutor_
any comments and suggestions are welcome

> NodeManager OOM because of stuck DeletionService
> 
>
> Key: YARN-9929
> URL: https://issues.apache.org/jira/browse/YARN-9929
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9929.001.patch, nm_heapdump.png
>
>
> NMs go through frequent Full GC due to a lack of heap memory.
> we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the 
> heap dump (screenshot is attached)
> and after analyzing the thread dump, we can figure out _DeletionService_ gets 
> stuck in _executeStatusCommand_ which run 'docker inspect'
> {code:java}
> "DeletionService #0" - Thread t@41
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
>   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   - locked <3e45c938> (a java.io.InputStreamReader)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.read1(BufferedReader.java:212)
>   at java.io.BufferedReader.read(BufferedReader.java:286)
>   - locked <3e45c938> (a java.io.InputStreamReader)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
>   at org.apache.hadoop.util.Shell.run(Shell.java:902)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>Locked ownable synchronizers:
>   - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) 
> {code}
> also, we found 'docker inspect' processes are running for a long time as 
> follows.
> {code:java}
>  root      95637  0.0  0.0 2650984 35776 ?       Sl   Aug23   5:48 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e30_1555419799458_0014_01_30
> root      95638  0.0  0.0 2773860 33908 ?       Sl   Aug23   5:33 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_25316_01_001455
> root      95641  0.0  0.0 2445924 34204 ?       Sl   Aug23   5:34 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e49_1560851258686_2107_01_24
> root      95643  0.0  0.0 2642532 34428 ?       Sl   Aug23   5:30 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_8111_01_002657{code}
>  

[jira] [Updated] (YARN-9929) NodeManager OOM because of stuck DeletionService

2019-10-22 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9929:
---
Attachment: YARN-9929.001.patch

> NodeManager OOM because of stuck DeletionService
> 
>
> Key: YARN-9929
> URL: https://issues.apache.org/jira/browse/YARN-9929
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9929.001.patch, nm_heapdump.png
>
>
> NMs go through frequent Full GC due to a lack of heap memory.
> we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the 
> heap dump (screenshot is attached)
> and after analyzing the thread dump, we can figure out _DeletionService_ gets 
> stuck in _executeStatusCommand_ which run 'docker inspect'
> {code:java}
> "DeletionService #0" - Thread t@41
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
>   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   - locked <3e45c938> (a java.io.InputStreamReader)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.read1(BufferedReader.java:212)
>   at java.io.BufferedReader.read(BufferedReader.java:286)
>   - locked <3e45c938> (a java.io.InputStreamReader)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
>   at org.apache.hadoop.util.Shell.run(Shell.java:902)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>Locked ownable synchronizers:
>   - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) 
> {code}
> also, we found 'docker inspect' processes are running for a long time as 
> follows.
> {code:java}
>  root      95637  0.0  0.0 2650984 35776 ?       Sl   Aug23   5:48 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e30_1555419799458_0014_01_30
> root      95638  0.0  0.0 2773860 33908 ?       Sl   Aug23   5:33 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_25316_01_001455
> root      95641  0.0  0.0 2445924 34204 ?       Sl   Aug23   5:34 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e49_1560851258686_2107_01_24
> root      95643  0.0  0.0 2642532 34428 ?       Sl   Aug23   5:30 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_8111_01_002657{code}
>  
> I think It has occurred since docker daemon is restarted. 
> 'docker inspect' which was run while restarting 

[jira] [Updated] (YARN-9929) NodeManager OOM because of stuck DeletionService

2019-10-22 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9929:
---
Attachment: nm_heapdump.png

> NodeManager OOM because of stuck DeletionService
> 
>
> Key: YARN-9929
> URL: https://issues.apache.org/jira/browse/YARN-9929
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: nm_heapdump.png
>
>
> NMs go through frequent Full GC due to a lack of heap memory.
> we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the 
> heap dump (screenshot is attached)
> and after analyzing the thread dump, we can figure out _DeletionService_ gets 
> stuck in _executeStatusCommand_ which run 'docker inspect'
> {code:java}
> "DeletionService #0" - Thread t@41
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
>   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   - locked <3e45c938> (a java.io.InputStreamReader)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.read1(BufferedReader.java:212)
>   at java.io.BufferedReader.read(BufferedReader.java:286)
>   - locked <3e45c938> (a java.io.InputStreamReader)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
>   at org.apache.hadoop.util.Shell.run(Shell.java:902)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>Locked ownable synchronizers:
>   - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) 
> {code}
> also, we found 'docker inspect' processes are running for a long time as 
> follows.
> {code:java}
>  root      95637  0.0  0.0 2650984 35776 ?       Sl   Aug23   5:48 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e30_1555419799458_0014_01_30
> root      95638  0.0  0.0 2773860 33908 ?       Sl   Aug23   5:33 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_25316_01_001455
> root      95641  0.0  0.0 2445924 34204 ?       Sl   Aug23   5:34 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e49_1560851258686_2107_01_24
> root      95643  0.0  0.0 2642532 34428 ?       Sl   Aug23   5:30 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_8111_01_002657{code}
>  
> I think It has occurred since docker daemon is restarted. 
> 'docker inspect' which was run while restarting the docker daemon was not 

[jira] [Created] (YARN-9929) NodeManager OOM because of stuck DeletionService

2019-10-22 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9929:
--

 Summary: NodeManager OOM because of stuck DeletionService
 Key: YARN-9929
 URL: https://issues.apache.org/jira/browse/YARN-9929
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam
Assignee: kyungwan nam


NMs go through frequent Full GC due to a lack of heap memory.
we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the 
heap dump (screenshot is attached)

and after analyzing the thread dump, we can figure out _DeletionService_ gets 
stuck in _executeStatusCommand_ which run 'docker inspect'
{code:java}
"DeletionService #0" - Thread t@41
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <3e45c938> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <3e45c938> (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
at org.apache.hadoop.util.Shell.run(Shell.java:902)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) 
{code}
also, we found 'docker inspect' processes are running for a long time as 
follows.
{code:java}
 root      95637  0.0  0.0 2650984 35776 ?       Sl   Aug23   5:48 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e30_1555419799458_0014_01_30
root      95638  0.0  0.0 2773860 33908 ?       Sl   Aug23   5:33 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e50_1561100493387_25316_01_001455
root      95641  0.0  0.0 2445924 34204 ?       Sl   Aug23   5:34 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e49_1560851258686_2107_01_24
root      95643  0.0  0.0 2642532 34428 ?       Sl   Aug23   5:30 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e50_1561100493387_8111_01_002657{code}
 

I think It has occurred since docker daemon is restarted. 
'docker inspect' which was run while restarting the docker daemon was not 
working. and not even it was not terminated.

It can be considered as a docker issue.
but It could happen whenever if 'docker inspect' does not work due to docker 
daemon restarting or docker bug.
It would be good to set the timeout for 'docker inspect' to avoid this issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs

2019-10-16 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9905:
---
Attachment: YARN-9905.002.patch

> yarn-service is failed to setup application log if app-log-dir is not 
> default-fs
> 
>
> Key: YARN-9905
> URL: https://issues.apache.org/jira/browse/YARN-9905
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9905.001.patch, YARN-9905.002.patch
>
>
> Currently, yarn-service takes a token of default namenode only.
>  it might cause authentication failure under HDFS federation.
> how to reproduce
>  - kerberized cluster
>  - multiple namespaces by HDFS federation.
>  - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not 
> default-fs
> here are the nodemanager logs at that time.
> {code:java}
> 2019-10-15 11:52:50,217 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new 
> application reference for app application_1569373267731_9571
> 2019-10-15 11:52:50,217 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(655)) - Application 
> application_1569373267731_9571 transitioned from NEW to INITING
> ...
>  Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]
> at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
> at org.apache.hadoop.ipc.Client.call(Client.java:1457)
> at org.apache.hadoop.ipc.Client.call(Client.java:1367)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
> at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595)
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396)
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254)
> at 
> 

[jira] [Created] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs

2019-10-15 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9905:
--

 Summary: yarn-service is failed to setup application log if 
app-log-dir is not default-fs
 Key: YARN-9905
 URL: https://issues.apache.org/jira/browse/YARN-9905
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


Currently, yarn-service takes a token of default namenode only.
 it might cause authentication failure under HDFS federation.

how to reproduce
 - kerberized cluster
 - multiple namespaces by HDFS federation.
 - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not 
default-fs

here are the nodemanager logs at that time.
{code:java}
2019-10-15 11:52:50,217 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new 
application reference for app application_1569373267731_9571
2019-10-15 11:52:50,217 INFO  application.ApplicationImpl 
(ApplicationImpl.java:handle(655)) - Application application_1569373267731_9571 
transitioned from NEW to INITING
...

 Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:204)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:347)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:69)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 

[jira] [Commented] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-29 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919098#comment-16919098
 ] 

kyungwan nam commented on YARN-9790:


[~Prabhu Joseph] I've attached a new patch
It fixes failed test case, also some test case for this issue is added.
Thanks

> Failed to set default-application-lifetime if maximum-application-lifetime is 
> less than or equal to zero
> 
>
> Key: YARN-9790
> URL: https://issues.apache.org/jira/browse/YARN-9790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9790.001.patch, YARN-9790.002.patch, 
> YARN-9790.003.patch, YARN-9790.004.patch
>
>
> capacity-scheduler
> {code}
> ...
> yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
> yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
> {code}
> refreshQueue was failed as follows
> {code}
> 2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
> (AdminService.java:logAndWrapException(910)) - Exception refresh queues.
> java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
> exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
> at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
> lifetime604800 can't exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
> ... 12 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-29 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9790:
---
Attachment: YARN-9790.004.patch

> Failed to set default-application-lifetime if maximum-application-lifetime is 
> less than or equal to zero
> 
>
> Key: YARN-9790
> URL: https://issues.apache.org/jira/browse/YARN-9790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9790.001.patch, YARN-9790.002.patch, 
> YARN-9790.003.patch, YARN-9790.004.patch
>
>
> capacity-scheduler
> {code}
> ...
> yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
> yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
> {code}
> refreshQueue was failed as follows
> {code}
> 2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
> (AdminService.java:logAndWrapException(910)) - Exception refresh queues.
> java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
> exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
> at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
> lifetime604800 can't exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
> ... 12 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-29 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9790:
---
Attachment: YARN-9790.003.patch

> Failed to set default-application-lifetime if maximum-application-lifetime is 
> less than or equal to zero
> 
>
> Key: YARN-9790
> URL: https://issues.apache.org/jira/browse/YARN-9790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9790.001.patch, YARN-9790.002.patch, 
> YARN-9790.003.patch
>
>
> capacity-scheduler
> {code}
> ...
> yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
> yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
> {code}
> refreshQueue was failed as follows
> {code}
> 2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
> (AdminService.java:logAndWrapException(910)) - Exception refresh queues.
> java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
> exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
> at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
> lifetime604800 can't exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
> ... 12 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-29 Thread kyungwan nam (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9790:
---
Attachment: (was: YARN-9790.003.patch)

> Failed to set default-application-lifetime if maximum-application-lifetime is 
> less than or equal to zero
> 
>
> Key: YARN-9790
> URL: https://issues.apache.org/jira/browse/YARN-9790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9790.001.patch, YARN-9790.002.patch, 
> YARN-9790.003.patch
>
>
> capacity-scheduler
> {code}
> ...
> yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
> yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
> {code}
> refreshQueue was failed as follows
> {code}
> 2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
> (AdminService.java:logAndWrapException(910)) - Exception refresh queues.
> java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
> exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
> at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
> lifetime604800 can't exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
> ... 12 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-28 Thread kyungwan nam (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918227#comment-16918227
 ] 

kyungwan nam commented on YARN-9790:


[~Prabhu Joseph] thank you for your review and helpful comment!

if maximum-lifetime is -1 or 0, it means no limit.
therefore, default-lifetime should be able to be any higher lifetime.
in my opinion, it should be checked as follows.
{code}
-  if (defaultApplicationLifetime > maxApplicationLifetime) {
+  if (maxApplicationLifetime > 0 &&
+  defaultApplicationLifetime > maxApplicationLifetime) {
{code}

and I think a fix is needed in CapacityScheduler#checkAndGetApplicationLifetime
if there is no specified lifetime for an app it should respect to 
default-lifetime, even though maximum-lifetime is -1 or 0.

CapacityScheduler#checkAndGetApplicationLifetime
{code}
   // check only for maximum, that's enough because default can't
   // exceed maximum
   if (maximumApplicationLifetime <= 0) {
-return lifetimeRequestedByApp;
+return (lifetimeRequestedByApp <= 0) ? defaultApplicationLifetime :
+lifetimeRequestedByApp;
   }
{code}

please let me know If you have any thought about this.
thanks.


> Failed to set default-application-lifetime if maximum-application-lifetime is 
> less than or equal to zero
> 
>
> Key: YARN-9790
> URL: https://issues.apache.org/jira/browse/YARN-9790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9790.001.patch, YARN-9790.002.patch
>
>
> capacity-scheduler
> {code}
> ...
> yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
> yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
> {code}
> refreshQueue was failed as follows
> {code}
> 2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
> (AdminService.java:logAndWrapException(910)) - Exception refresh queues.
> java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
> exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
> at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
> lifetime604800 can't exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
> ... 12 more
> 

[jira] [Created] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-28 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9790:
--

 Summary: Failed to set default-application-lifetime if 
maximum-application-lifetime is less than or equal to zero
 Key: YARN-9790
 URL: https://issues.apache.org/jira/browse/YARN-9790
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


capacity-scheduler
{code}
...
yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
{code}

refreshQueue was failed as follows

{code}
2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
(AdminService.java:logAndWrapException(910)) - Exception refresh queues.
java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
exceed maximum lifetime -1
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
lifetime604800 can't exceed maximum lifetime -1
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
... 12 more
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-12 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904875#comment-16904875
 ] 

kyungwan nam commented on YARN-9719:


[~eyang], [~Prabhu Joseph]
007 patch was passed without failure.
Could you review it? Thanks.

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, 
> YARN-9719.006.patch, YARN-9719.007.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-09 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.007.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, 
> YARN-9719.006.patch, YARN-9719.007.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-09 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.006.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, 
> YARN-9719.006.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-09 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: (was: YARN-9719.006.patch)

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-09 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.006.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, 
> YARN-9719.006.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-09 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.005.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-08 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902951#comment-16902951
 ] 

kyungwan nam commented on YARN-9719:


attaches a new patch, which clear the config used for completed test

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-08 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.004.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch, YARN-9719.004.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-05 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900546#comment-16900546
 ] 

kyungwan nam commented on YARN-9719:


[~Prabhu Joseph], [~eyang] Thank you for your comments.
I've attached a new patch including test code.

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-05 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.003.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch, 
> YARN-9719.003.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-04 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899711#comment-16899711
 ] 

kyungwan nam commented on YARN-9719:


[~Prabhu Joseph] Thank you for your review and comment
attaches a new patch based on trunk

 

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-04 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9719:
---
Attachment: YARN-9719.002.patch

> Failed to restart yarn-service if it doesn’t exist in RM
> 
>
> Key: YARN-9719
> URL: https://issues.apache.org/jira/browse/YARN-9719
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9719.001.patch, YARN-9719.002.patch
>
>
> Sometimes, restarting a yarn-service is failed as follows.
> {code}
> {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
> exist in RM. Please check that the job submission was successful.\n\tat 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
>  
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
>  
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
> {code}
> It seems like that it occurs when restarting a yarn-service who was stopped 
> long ago.
> by default, RM keeps up to 1000 completed applications 
> (yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-02 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9719:
--

 Summary: Failed to restart yarn-service if it doesn’t exist in RM
 Key: YARN-9719
 URL: https://issues.apache.org/jira/browse/YARN-9719
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam
Assignee: kyungwan nam


Sometimes, restarting a yarn-service is failed as follows.

{code}
{"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
exist in RM. Please check that the job submission was successful.\n\tat 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
java.security.AccessController.doPrivileged(Native Method)\n\tat 
javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
{code}

It seems like that it occurs when restarting a yarn-service who was stopped 
long ago.
by default, RM keeps up to 1000 completed applications 
(yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9703) Failed to cancel yarn service upgrade when canceling multiple times

2019-07-25 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-9703:
--

  Assignee: kyungwan nam
Attachment: YARN-9703.001.patch

I've attached a patch that fixes it
please review or comment
thanks

> Failed to cancel yarn service upgrade when canceling multiple times
> ---
>
> Key: YARN-9703
> URL: https://issues.apache.org/jira/browse/YARN-9703
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9703.001.patch
>
>
> sleeptest.yarnfile
> {code:java}
> {
>"name":"sleeptest",
>"version":"1.0.0",
>"lifetime":"-1",
>"components":[
>   {
>  "name":"sleep",
>  "number_of_containers":3,
> …
> }
> {code}
> how to reproduce
>  * initiate upgrade
>  * upgrade instance sleep-0
>  * cancel upgrade -> it succeeded without any problem
>  * initiate upgrade
>  * upgrade instance sleep-0
>  * cancel upgrade -> it didn’t work. at that time, AM logs are as follows.
> {code:java}
> 2019-07-26 10:12:20,057 [Component  dispatcher] INFO  
> instance.ComponentInstance - container_e72_1564103075282_0002_01_04 
> pending cancellation
> 2019-07-26 10:12:20,057 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-2 : 
> container_e72_1564103075282_0002_01_04] Transitioned from READY to 
> CANCEL_UPGRADING on CANCEL_UPGRADE event
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9703) Failed to cancel yarn service upgrade when canceling multiple times

2019-07-25 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9703:
--

 Summary: Failed to cancel yarn service upgrade when canceling 
multiple times
 Key: YARN-9703
 URL: https://issues.apache.org/jira/browse/YARN-9703
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam


sleeptest.yarnfile
{code:java}
{
   "name":"sleeptest",
   "version":"1.0.0",
   "lifetime":"-1",
   "components":[
  {
 "name":"sleep",
 "number_of_containers":3,
…
}
{code}
how to reproduce
 * initiate upgrade
 * upgrade instance sleep-0
 * cancel upgrade -> it succeeded without any problem
 * initiate upgrade
 * upgrade instance sleep-0
 * cancel upgrade -> it didn’t work. at that time, AM logs are as follows.

{code:java}
2019-07-26 10:12:20,057 [Component  dispatcher] INFO  
instance.ComponentInstance - container_e72_1564103075282_0002_01_04 pending 
cancellation
2019-07-26 10:12:20,057 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-2 : 
container_e72_1564103075282_0002_01_04] Transitioned from READY to 
CANCEL_UPGRADING on CANCEL_UPGRADE event
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing

2019-07-24 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9691:
---
Attachment: YARN-9691.002.patch

> canceling upgrade does not work if upgrade failed container is existing
> ---
>
> Key: YARN-9691
> URL: https://issues.apache.org/jira/browse/YARN-9691
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9691.001.patch, YARN-9691.002.patch
>
>
> if a container is failed to upgrade during yarn service upgrade, it will be 
> released container and transition to FAILED_UPGRADE state.
> After then, I expected it is able to be back to the previous version using 
> cancel-upgrade. but, It didn’t work.
> At that time, AM log is as follows
> {code}
> # failed to upgrade container_e62_1563179597798_0006_01_08
> 2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO  
> service.ClientAMService - Upgrade container 
> container_e62_1563179597798_0006_01_08
> 2019-07-16 18:21:55,153 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08] spec state state changed from 
> NEEDS_UPGRADE -> UPGRADING
> 2019-07-16 18:21:55,154 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08] Transitioned from READY to 
> UPGRADING on UPGRADE event
> 2019-07-16 18:21:55,154 [pool-5-thread-4] INFO  
> registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08]: Deleting registry path 
> /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08
> 2019-07-16 18:21:55,156 [pool-6-thread-6] INFO  provider.ProviderUtils - 
> [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version 
> 1.0.1 : Creating dir on hdfs: 
> hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
> 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
> containerlaunch.ContainerLaunchService - reInitializing container 
> container_e62_1563179597798_0006_01_08 with version 1.0.1
> 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
> containerlaunch.AbstractLauncher - yarn docker env var has been set 
> {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, 
> YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM,
>  WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, 
> LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, 
> YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=}
> 2019-07-16 18:21:55,158 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO  
> impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER 
> for Container container_e62_1563179597798_0006_01_08
> 2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08] spec state state changed from 
> UPGRADING -> RUNNING_BUT_UNREADY
> 2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08] retrieve status after 30
> 2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to 
> REINITIALIZED on START event
> 2019-07-16 18:22:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 
> KST 2019", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: sleep-0: IP is not available yet"
> 2019-07-16 18:22:37,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 
> KST 2019", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: sleep-0: IP is not available yet"
> 2019-07-16 18:23:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 
> KST 2019", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: sleep-0: IP is not available yet"
> 2019-07-16 18:23:08,225 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_08] spec state state changed from 
> RUNNING_BUT_UNREADY -> FAILED_UPGRADE
> # 

[jira] [Created] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing

2019-07-22 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9691:
--

 Summary: canceling upgrade does not work if upgrade failed 
container is existing
 Key: YARN-9691
 URL: https://issues.apache.org/jira/browse/YARN-9691
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


if a container is failed to upgrade during yarn service upgrade, it will be 
released container and transition to FAILED_UPGRADE state.
After then, I expected it is able to be back to the previous version using 
cancel-upgrade. but, It didn’t work.
At that time, AM log is as follows

{code}
# failed to upgrade container_e62_1563179597798_0006_01_08

2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO  
service.ClientAMService - Upgrade container 
container_e62_1563179597798_0006_01_08
2019-07-16 18:21:55,153 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
NEEDS_UPGRADE -> UPGRADING
2019-07-16 18:21:55,154 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] Transitioned from READY to 
UPGRADING on UPGRADE event
2019-07-16 18:21:55,154 [pool-5-thread-4] INFO  
registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08]: Deleting registry path 
/users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08
2019-07-16 18:21:55,156 [pool-6-thread-6] INFO  provider.ProviderUtils - 
[COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version 
1.0.1 : Creating dir on hdfs: 
hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
containerlaunch.ContainerLaunchService - reInitializing container 
container_e62_1563179597798_0006_01_08 with version 1.0.1
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
containerlaunch.AbstractLauncher - yarn docker env var has been set 
{LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, 
YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM,
 WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, 
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, 
LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, 
YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=}
2019-07-16 18:21:55,158 
[org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO  
impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for 
Container container_e62_1563179597798_0006_01_08
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
UPGRADING -> RUNNING_BUT_UNREADY
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] retrieve status after 30
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to 
REINITIALIZED on START event
2019-07-16 18:22:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:22:37,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:08,225 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
RUNNING_BUT_UNREADY -> FAILED_UPGRADE

# request canceling upgrade 

2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container container_e62_1563179597798_0006_01_04 true
2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container container_e62_1563179597798_0006_01_03 true
2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container 

[jira] [Commented] (YARN-9682) Wrong log message when finalizing the upgrade

2019-07-16 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886641#comment-16886641
 ] 

kyungwan nam commented on YARN-9682:


[~cheersyang] thank you for your review and comment

> Wrong log message when finalizing the upgrade
> -
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Fix For: 3.3.0
>
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9682) wrong log message when finalize upgrade

2019-07-16 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-9682:
--

  Assignee: kyungwan nam
Attachment: YARN-9682.001.patch

> wrong log message when finalize upgrade
> ---
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9682) wrong log message when finalize upgrade

2019-07-16 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9682:
--

 Summary: wrong log message when finalize upgrade
 Key: YARN-9682
 URL: https://issues.apache.org/jira/browse/YARN-9682
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


I've seen the wrong message as follows when finalize-upgrade for a yarn-service
{code:java}
2019-07-16 17:44:09,204 INFO  client.ServiceClient 
(ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM failed to start due to system services

2019-07-01 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876007#comment-16876007
 ] 

kyungwan nam commented on YARN-9521:


I attached a new patch which ApiServiceClient.actionCleanUp will be performed 
with ugi.doAs()

> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM failed to start due to system services

2019-07-01 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876006#comment-16876006
 ] 

kyungwan nam commented on YARN-9521:


after some further digging, I think that I figure out the cause of this issue 
more correctly.

normally, when yarn-service API is requested, a new ugi is created and it is 
performed inside of the ugi.doAs()
when calling FileSystem.get() inside of the ugi.doAs(), it always create a new 
FileSystem. because the ugi is used for the key of the FileSystem.CACHE. 
(YARN-3336 would be helpful to understand this)
so in this case, does not close a FileSystem from the FileSystem.CACHE
{code}
  UserGroupInformation ugi = getProxyUser(request);
  LOG.info("POST: createService = {} user = {}", service, ugi);
  if(service.getState()==ServiceState.STOPPED) {
ugi.doAs(new PrivilegedExceptionAction() {
  @Override
  public Void run() throws YarnException, IOException {
ServiceClient sc = getServiceClient();
try {
  sc.init(YARN_CONFIG);
  sc.start();
  sc.actionBuild(service);
} finally {
  sc.close();
}
return null;
  }
});
{code}

on the other hand, ApiServiceClient.actionCleanUp which is called at 
RMAppImpl.appAdminClientCleanUp is performed as the RM loginUser instead of 
doAs()
in this case, FileSystem.get() can return cached one which 
SystemServiceManagerImpl, FileSystemNodeLabelsStore refer
{code}
  @Override
  public int actionCleanUp(String appName, String userName) throws
  IOException, YarnException {
ServiceClient sc = new ServiceClient();
sc.init(getConfig());
sc.start();
int result = sc.actionCleanUp(appName, userName);
sc.close();
return result;
  }
{code}




> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> 

[jira] [Updated] (YARN-9521) RM failed to start due to system services

2019-07-01 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9521:
---
Attachment: YARN-9521.002.patch

> RM failed to start due to system services
> -
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch, YARN-9521.002.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM failed to start due to system services

2019-06-20 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868310#comment-16868310
 ] 

kyungwan nam commented on YARN-9521:


{code:java}
2019-06-18 18:47:38,634 INFO  nodelabels.CommonNodeLabelsManager 
(CommonNodeLabelsManager.java:internalUpdateLabelsOnNodes(664)) - REPLACE 
labels on nodes:
2019-06-18 18:47:38,634 INFO  nodelabels.CommonNodeLabelsManager 
(CommonNodeLabelsManager.java:internalUpdateLabelsOnNodes(666)) -   
NM=test.nm1.com:0, labels=[test]
2019-06-18 18:47:38,635 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1560841031202_0111_01 
container=null queue=dev clusterResource= 
type=OFF_SWITCH requestedPartition=
2019-06-18 18:47:38,635 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1560841031202_0111_01 
container=null queue=dev clusterResource= 
type=OFF_SWITCH requestedPartition=
2019-06-18 18:47:38,635 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1560841031202_0111_01 
container=null queue=dev clusterResource= 
type=OFF_SWITCH requestedPartition=
2019-06-18 18:47:38,635 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1560841031202_0111_01 
container=null queue=dev clusterResource= 
type=OFF_SWITCH requestedPartition=
2019-06-18 18:47:38,636 INFO  rmcontainer.RMContainerImpl 
(RMContainerImpl.java:handle(480)) - container_e48_1560841031202_0111_01_002020 
Container Transitioned from NEW to ALLOCATED
2019-06-18 18:47:38,636 ERROR nodelabels.CommonNodeLabelsManager 
(CommonNodeLabelsManager.java:handleStoreEvent(201)) - Failed to store label 
modification to storage
2019-06-18 18:47:38,637 INFO  fica.FiCaSchedulerNode 
(FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container 
container_e48_1560841031202_0111_01_002020 of capacity  
on host test.nm3.com:8454, which has 3 containers,  
used and  available after allocation
2019-06-18 18:47:38,637 FATAL event.AsyncDispatcher 
(AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Filesystem closed
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:202)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:174)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:169)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1412)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1383)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:427)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:435)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:404)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1379)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.ensureAppendEditlogFile(FileSystemNodeLabelsStore.java:107)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.updateNodeToLabelsMappings(FileSystemNodeLabelsStore.java:118)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
... 5 more
2019-06-18 18:47:38,637 INFO  capacity.ParentQueue 
(ParentQueue.java:apply(1340)) - assignedContainer queue=root 
usedCapacity=0.08724866 absoluteUsedCapacity=0.08724866 used= cluster=
2019-06-18 18:47:38,637 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2894)) - Allocation proposal accepted
2019-06-18 18:47:38,637 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2900)) - Failed to accept allocation proposal
2019-06-18 18:47:38,637 INFO  capacity.CapacityScheduler 

[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state

2019-06-16 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864996#comment-16864996
 ] 

kyungwan nam commented on YARN-9386:


[~billie.rinaldi], [~wangda]
Sorry for bothering you...
Could you please review this when you are available?
Thanks :)

> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch, 
> YARN-9386.003.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance

2019-06-16 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam reassigned YARN-9628:
--

Assignee: kyungwan nam

> incorrect ‘number of containers’ is written when decommission for 
> non-existing component instance
> -
>
> Key: YARN-9628
> URL: https://issues.apache.org/jira/browse/YARN-9628
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9628.001.patch
>
>
> Decommission for component instances is introduced in YARN-8761.
> Currently, decommission is succeeded even though the component instance does 
> not exist.
> As a result, incorrect ‘number of containers’ would be written to the service 
> spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance

2019-06-16 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9628:
---
Attachment: YARN-9628.001.patch

> incorrect ‘number of containers’ is written when decommission for 
> non-existing component instance
> -
>
> Key: YARN-9628
> URL: https://issues.apache.org/jira/browse/YARN-9628
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9628.001.patch
>
>
> Decommission for component instances is introduced in YARN-8761.
> Currently, decommission is succeeded even though the component instance does 
> not exist.
> As a result, incorrect ‘number of containers’ would be written to the service 
> spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance

2019-06-16 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9628:
--

 Summary: incorrect ‘number of containers’ is written when 
decommission for non-existing component instance
 Key: YARN-9628
 URL: https://issues.apache.org/jira/browse/YARN-9628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam


Decommission for component instances is introduced in YARN-8761.
Currently, decommission is succeeded even though the component instance does 
not exist.
As a result, incorrect ‘number of containers’ would be written to the service 
spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state

2019-06-04 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856300#comment-16856300
 ] 

kyungwan nam commented on YARN-9386:


[~billie.rinaldi], I've attached a new patch including your suggestion.
Thanks


> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch, 
> YARN-9386.003.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9386) destroying yarn-service is allowed even though running state

2019-06-04 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9386:
---
Attachment: YARN-9386.003.patch

> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch, 
> YARN-9386.003.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state

2019-05-31 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852719#comment-16852719
 ] 

kyungwan nam commented on YARN-9386:


Thank you for your comment!

[~billie.rinaldi]
I agreed with you. I will upload it shortly.

[~wangda]
Yes, only owner or admin can do operations like start/stop/destroy as you said.
It is not about granular permission.

stopped service can be restarted with existing configuration whenever we want.
unlike stop, destroy is irreversible. once destroy is requested, it will delete 
permanently.
when a running service is destroyed by mistake, it is not possible to recover.
that’s the dangerous thing I’m thinking.
so, destroy should be allowed for stopped service only.


> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM filed to start due to system services

2019-05-22 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845571#comment-16845571
 ] 

kyungwan nam commented on YARN-9521:


Please let me know if anyone has any ideas on how to resolve.
Thanks.

> RM filed to start due to system services
> 
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM filed to start due to system services

2019-05-15 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840096#comment-16840096
 ] 

kyungwan nam commented on YARN-9521:


I think the cause of this problem is as follows.

1. _fs_ is set by calling FileSystem.get() on 
SystemServiceManagerImpl.serviceInit

2. RMAppImpl.appAdminClientCleanUp will be called on RMAppImpl.FinalTransition, 
if APP_COMPLETED event occurs during RMStateStore recovery 

{code}
  static void appAdminClientCleanUp(RMAppImpl app) {
try {
  AppAdminClient client = AppAdminClient.createAppAdminClient(app
  .applicationType, app.conf);
  int result = client.actionCleanUp(app.name, app.user);
{code}

ApiServiceClient.actionCleanUp
{code}
  @Override
  public int actionCleanUp(String appName, String userName) throws
  IOException, YarnException {
ServiceClient sc = new ServiceClient();
sc.init(getConfig());
sc.start();
int result = sc.actionCleanUp(appName, userName);
sc.close();
return result;
  }
{code}

ServiceClient instance has a FileSystem by calling FileSystem.get() at 
initialization time. but, it might be a cached one.
the FileSystem cached will be closed by _sc.close()_

3. scanForUserServices is called on SystemServiceManagerImpl.serviceStart. but, 
_fs_ has been closed already.



RM log

{code}

// 1. SystemServiceManagerImpl.serviceInit
//
2019-05-15 10:27:59,445 DEBUG service.AbstractService 
(AbstractService.java:enterState(443)) - Service: 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl entered state 
INITED
2019-05-15 10:27:59,446 INFO  client.SystemServiceManagerImpl 
(SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is 
configured to /services
2019-05-15 10:27:59,472 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3209)) - Loading filesystems
2019-05-15 10:27:59,483 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - file:// = class 
org.apache.hadoop.fs.LocalFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,488 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - viewfs:// = class 
org.apache.hadoop.fs.viewfs.ViewFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,491 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - har:// = class 
org.apache.hadoop.fs.HarFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,492 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - http:// = class 
org.apache.hadoop.fs.http.HttpFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,493 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - https:// = class 
org.apache.hadoop.fs.http.HttpsFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,503 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - hdfs:// = class 
org.apache.hadoop.hdfs.DistributedFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,511 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - webhdfs:// = class 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,512 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - swebhdfs:// = class 
org.apache.hadoop.hdfs.web.SWebHdfsFileSystem from 
/usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,514 DEBUG fs.FileSystem 
(FileSystem.java:loadFileSystems(3221)) - s3n:// = class 
org.apache.hadoop.fs.s3native.NativeS3FileSystem from 
/usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-aws-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,514 DEBUG fs.FileSystem 
(FileSystem.java:getFileSystemClass(3264)) - Looking for FS supporting hdfs
2019-05-15 10:27:59,514 DEBUG fs.FileSystem 
(FileSystem.java:getFileSystemClass(3268)) - looking for configuration option 
fs.hdfs.impl
2019-05-15 10:27:59,528 DEBUG fs.FileSystem 
(FileSystem.java:getFileSystemClass(3275)) - Looking in service filesystems for 
implementation class
2019-05-15 10:27:59,528 DEBUG fs.FileSystem 
(FileSystem.java:getFileSystemClass(3284)) - FS for hdfs is class 
org.apache.hadoop.hdfs.DistributedFileSystem

// 2. APP_COMPLETED event occurs
//
2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:handle(895)) - 
Processing event for application_1556612756829_0001 of type RECOVER
2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:recover(933)) - 
Recovering app: application_1556612756829_0001 with 2 attempts and final state 
= FAILED
2019-05-15 10:28:02,931 DEBUG attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:(544)) - yarn.app.attempt.diagnostics.limit.kc : 64

[jira] [Updated] (YARN-9521) RM filed to start due to system services

2019-04-30 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9521:
---
Attachment: YARN-9521.001.patch

> RM filed to start due to system services
> 
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9521) RM filed to start due to system services

2019-04-30 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9521:
--

 Summary: RM filed to start due to system services
 Key: YARN-9521
 URL: https://issues.apache.org/jira/browse/YARN-9521
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam


when starting RM, listing system services directory has failed as follows.

{code}
2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
(SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is 
configured to /services
2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
(SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
initialized to yarn (auth:SIMPLE)
2019-04-30 17:18:25,467 INFO  service.AbstractService 
(AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
state STARTED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
Filesystem closed
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
at 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
at 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
at 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
... 13 more
{code}

it looks like due to the usage of filesystem cache.
this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
yarn-site




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9307) node_partitions constraint does not work

2019-04-26 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1682#comment-1682
 ] 

kyungwan nam commented on YARN-9307:


Thank you, [~cheersyang]!

> node_partitions constraint does not work
> 
>
> Key: YARN-9307
> URL: https://issues.apache.org/jira/browse/YARN-9307
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Fix For: 3.1.3
>
> Attachments: YARN-9307.branch-3.1.001.patch
>
>
> when a yarn-service app is submitted with below configuration, 
> node_partitions constraint does not work.
> {code}
> …
>  "placement_policy": {
>"constraints": [
>  {
>"type": "ANTI_AFFINITY",
>"scope": "NODE",
>"target_tags": [
>  "ws"
>],
>"node_partitions": [
>  ""
>]
>  }
>]
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state

2019-03-21 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798650#comment-16798650
 ] 

kyungwan nam commented on YARN-9386:


attaches a new patch, which fixes test code.

> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9386) destroying yarn-service is allowed even though running state

2019-03-21 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9386:
---
Attachment: YARN-9386.002.patch

> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9386) destroying yarn-service is allowed even though running state

2019-03-14 Thread kyungwan nam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9386:
---
Attachment: YARN-9386.001.patch

> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >