[jira] [Created] (YARN-10721) YARN Service containers are restarted when RM failover

2021-03-29 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10721:
---

 Summary: YARN Service containers are restarted when RM failover
 Key: YARN-10721
 URL: https://issues.apache.org/jira/browse/YARN-10721
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


Our cluster has a large number of NMs.
When RM failover, it took 7 minutes for most of NMs to register with RM.
After, I’ve seen that a lot of containers was restarted 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10603) Failed to reinitialize for recovered container

2021-01-31 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10603:
---

 Summary: Failed to reinitialize for recovered container
 Key: YARN-10603
 URL: https://issues.apache.org/jira/browse/YARN-10603
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


Container reinitializing request does not work after restarting NM.

I found some problem as below.

- when a recovered container is terminated, exiting occurs because it makes 
always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS
- container’s *recoveredStatus* is set at the time of NM recovery. and it is 
never changed even though the container is terminated.
as a result, newly reinitializing container will be launched as a recovered 
container, but it doesn't work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10567) Support parallelism for YARN Service

2021-01-11 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10567:
---

 Summary: Support parallelism for YARN Service
 Key: YARN-10567
 URL: https://issues.apache.org/jira/browse/YARN-10567
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: kyungwan nam


YARN Service support job-like by using "restart_policy" introduced in YARN-8080.
But, we cannot set how many containers can be launched concurrently.
This feature is something like "parallelism" in kubernetes.
https://kubernetes.io/docs/concepts/workloads/controllers/job/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10305) Lost system-credentials when restarting RM

2020-06-02 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10305:
---

 Summary: Lost system-credentials when restarting RM
 Key: YARN-10305
 URL: https://issues.apache.org/jira/browse/YARN-10305
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


System-credentials introduced in YARN-2704, it makes it to keep the 
long-running apps.
I’ve met a situation where system-credentials lost when restarting RM.
Since then, if an app’s AM is stopped, restarting AM will be failed because NMs 
do not have HDFS delegation token which is needed for resource localization.


The app has a couple of delegation token including timeline-server token and 
HDFS delegation token.
When restarting RM, RM will request a new HDFS delegation token for an app that 
was submitted long ago. (It's fixed by YARN-5098)
But, If an app has a couple of delegation token and an exception occur in the 
token processed first, the next tokens are not processed.
I think that’s why lost system-credentials.

Here are RM’s logs at the time of restarting RM.
{code}
2020-05-19 14:25:05,712 WARN  security.DelegationTokenRenewer 
(DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to 
add the application to the delegation token renewer on recovery.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
sequenceNumber=2193, masterKeyId=340)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [403], message 
[org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to 
renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 
10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:512)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:629)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:626)
at java.security.AccessController.doPrivileged(Native Method)
at 

[jira] [Created] (YARN-10267) Add description, version as allocationTags for YARN Service

2020-05-14 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10267:
---

 Summary: Add description, version as allocationTags for YARN 
Service   
 Key: YARN-10267
 URL: https://issues.apache.org/jira/browse/YARN-10267
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam
Assignee: kyungwan nam


applicationTags for YARN Service only has the name.

It makes it difficult to identify what kind of apps are. 

It would be good if description, version are added to applicationTags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10262) Support application ACLs for YARN Service

2020-05-10 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10262:
---

 Summary: Support application ACLs for YARN Service
 Key: YARN-10262
 URL: https://issues.apache.org/jira/browse/YARN-10262
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam
Assignee: kyungwan nam


Currently, a user can access own yarn-service only. 
There’s no way to access the other user’s yarn-service.
It makes it difficult to collaborate between users.
User should be able to set the application ACLs for yarn-service.
It's like mapreduce.job.acl-view-job, mapreduce.job.acl-modify-job for 
MapReduce. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10206) Service stuck in the STARTED state when it has a component having no instance

2020-03-24 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10206:
---

 Summary: Service stuck in the STARTED state when it has a 
component having no instance
 Key: YARN-10206
 URL: https://issues.apache.org/jira/browse/YARN-10206
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam



* 'compb' has no instance. it means 'number_of_containers' is 0
* 'compb' has a dependency on 'compa'.

{code}
"components": [
   {
  "name”:”compa”,
  "number_of_containers": 1,
  "dependencies" : [
  ]
},
{
  "name":"compb”,
  "number_of_containers": 0,
  "dependencies" : [
"compa"
  ],
{code}
when launching the service, it stuck in the STARTED state




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10203) Stuck in express_upgrading if there is any component which has no instance

2020-03-20 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10203:
---

 Summary: Stuck in express_upgrading if there is any component 
which has no instance
 Key: YARN-10203
 URL: https://issues.apache.org/jira/browse/YARN-10203
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


I was trying to "express upgrade" which introduced in YARN-8298.
https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceUpgrade.html

but, service state stuck in EXPRESS_UPGRADING.
It happens only If there is any component that has no instance. 
("number_of_containers" : 0)

the component which has no instance should be excepted from upgrade target




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10196) destroying app leaks zookeeper connection

2020-03-13 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10196:
---

 Summary: destroying app leaks zookeeper connection
 Key: YARN-10196
 URL: https://issues.apache.org/jira/browse/YARN-10196
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


when destroying app, curatorClient in ServiceClient is started. but It is never 
closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10190) Typo in NMClientAsyncImpl

2020-03-09 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10190:
---

 Summary: Typo in NMClientAsyncImpl
 Key: YARN-10190
 URL: https://issues.apache.org/jira/browse/YARN-10190
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


Small typo in NMClientAsyncImpl.java

* ReInitializeContainerEvevnt -> ReInitializeContainerEvent
* containerLaunchContex -> containerLaunchContext



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10184) NPE happens in NMClient when reinitializeContainer

2020-03-09 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10184:
---

 Summary: NPE happens in NMClient when reinitializeContainer
 Key: YARN-10184
 URL: https://issues.apache.org/jira/browse/YARN-10184
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


NPE happens in NMClient when upgrading a yarn-service app which AM has been 
restarted.
Here is AM’s log at the time of the NPE.

{code}
2020-02-20 16:43:35,962 [Container  Event Dispatcher] ERROR 
yarn.YarnUncaughtExceptionHandler - Thread Thread[Container  Event 
Dispatcher,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$1.run(NMClientAsyncImpl.java:172)
2020-02-20 16:43:36,398 [AMRM Callback Handler Thread] WARN  
service.ServiceScheduler - Container container_e58_1581930783345_1954_01_06 
Completed. No component instance exists. exitStatus=-100. diagnostics=Container 
released by application 
{code}

NMClient keeps containers since the container has been started.
But, when restarting AM, NMClient is initialized and previous containers are 
lost. 
Since then, NPE will happen when reinitializeContainer is requested.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10119) Cannot reset the AM failure count for YARN Service

2020-02-06 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10119:
---

 Summary: Cannot reset the AM failure count for YARN Service
 Key: YARN-10119
 URL: https://issues.apache.org/jira/browse/YARN-10119
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam
Assignee: kyungwan nam


Currently, YARN Service does not support to reset AM failure count, which 
introduced in YARN-611

Since the AM failure count is never reset, eventually that will reach 
yarn.service.am-restart.max-attempts and the app will be stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10034) Allocation tags are not removed when node decommission

2019-12-16 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10034:
---

 Summary: Allocation tags are not removed when node decommission
 Key: YARN-10034
 URL: https://issues.apache.org/jira/browse/YARN-10034
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


When a node is decommissioned, allocation tags that are attached to the node 
are not removed.
I could see that allocation tags are revived when recommissioning the node.

RM removes allocation tags only if NM confirms the container releases by 
YARN-8511. but, decommissioned NM does not connect to RM anymore.
Once a node is decommissioned, allocation tags that attached to the node should 
be removed immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10021) NPE in YARN Registry DNS when wrong DNS message is incoming

2019-12-09 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10021:
---

 Summary: NPE in YARN Registry DNS when wrong DNS message is 
incoming
 Key: YARN-10021
 URL: https://issues.apache.org/jira/browse/YARN-10021
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


I’ve met NPE in YARN Registry DNS as below.
It looks like this happens if the incoming DNS request is the wrong format.

{code:java}
2019-11-29 10:51:12,178 ERROR dns.RegistryDNS (RegistryDNS.java:call(932)) - 
Error initializing DNS UDP listener
java.lang.NullPointerException
at java.nio.ByteBuffer.put(ByteBuffer.java:859)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2019-11-29 10:51:12,180 WARN  concurrent.ExecutorHelper 
(ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception 
when running task in RegistryDNS 1
2019-11-29 10:51:12,180 WARN  concurrent.ExecutorHelper 
(ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
thread RegistryDNS 1:
java.lang.NullPointerException
at java.nio.ByteBuffer.put(ByteBuffer.java:859)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930)
at 
org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner

2019-11-18 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9986:
--

 Summary: signalToContainer REST API does not work even if 
requested by the app owner
 Key: YARN-9986
 URL: https://issues.apache.org/jira/browse/YARN-9986
 Project: Hadoop YARN
  Issue Type: Bug
  Components: restapi
Reporter: kyungwan nam
Assignee: kyungwan nam


signalToContainer REST API introduced in YARN-8693 does not work even if 
requested by the app owner. 
It works well only if requested by an admin user

{code}
$ kinit kwnam
Password for kw...@test.org:
$ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
{"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception:
 Only admins can carry out this 
operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$
$ kinit admin
Password for ad...@test.org:
$
$ curl  -H 'Content-Type: application/json' --negotiate -u : -X POST 
https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN
$
{code}

in contrast, the app owner can do it using the command line as below.

{code}
$ kinit kwnam
Password for kw...@test.org:
$ yarn container -signal container_e58_1573625560605_29927_01_02  
GRACEFUL_SHUTDOWN
Signalling container container_e58_1573625560605_29927_01_02
2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container 
container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN
2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
$
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9953) YARN Service dependency should be configurable for each app

2019-11-04 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9953:
--

 Summary: YARN Service dependency should be configurable for each 
app
 Key: YARN-9953
 URL: https://issues.apache.org/jira/browse/YARN-9953
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


Currently, YARN Service dependency can be set as yarn.service.framework.path.
But, It works only as configured in RM.
This makes it impossible for the user to choose their YARN Service dependency.
It should be configurable for each app.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9929) NodeManager OOM because of stuck DeletionService

2019-10-22 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9929:
--

 Summary: NodeManager OOM because of stuck DeletionService
 Key: YARN-9929
 URL: https://issues.apache.org/jira/browse/YARN-9929
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam
Assignee: kyungwan nam


NMs go through frequent Full GC due to a lack of heap memory.
we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the 
heap dump (screenshot is attached)

and after analyzing the thread dump, we can figure out _DeletionService_ gets 
stuck in _executeStatusCommand_ which run 'docker inspect'
{code:java}
"DeletionService #0" - Thread t@41
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <3e45c938> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <3e45c938> (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
at org.apache.hadoop.util.Shell.run(Shell.java:902)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) 
{code}
also, we found 'docker inspect' processes are running for a long time as 
follows.
{code:java}
 root      95637  0.0  0.0 2650984 35776 ?       Sl   Aug23   5:48 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e30_1555419799458_0014_01_30
root      95638  0.0  0.0 2773860 33908 ?       Sl   Aug23   5:33 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e50_1561100493387_25316_01_001455
root      95641  0.0  0.0 2445924 34204 ?       Sl   Aug23   5:34 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e49_1560851258686_2107_01_24
root      95643  0.0  0.0 2642532 34428 ?       Sl   Aug23   5:30 
/usr/bin/docker inspect --format={{.State.Status}} 
container_e50_1561100493387_8111_01_002657{code}
 

I think It has occurred since docker daemon is restarted. 
'docker inspect' which was run while restarting the docker daemon was not 
working. and not even it was not terminated.

It can be considered as a docker issue.
but It could happen whenever if 'docker inspect' does not work due to docker 
daemon restarting or docker bug.
It would be good to set the timeout for 'docker inspect' to avoid this issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs

2019-10-15 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9905:
--

 Summary: yarn-service is failed to setup application log if 
app-log-dir is not default-fs
 Key: YARN-9905
 URL: https://issues.apache.org/jira/browse/YARN-9905
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


Currently, yarn-service takes a token of default namenode only.
 it might cause authentication failure under HDFS federation.

how to reproduce
 - kerberized cluster
 - multiple namespaces by HDFS federation.
 - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not 
default-fs

here are the nodemanager logs at that time.
{code:java}
2019-10-15 11:52:50,217 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new 
application reference for app application_1569373267731_9571
2019-10-15 11:52:50,217 INFO  application.ApplicationImpl 
(ApplicationImpl.java:handle(655)) - Application application_1569373267731_9571 
transitioned from NEW to INITING
...

 Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:204)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:347)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:69)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 

[jira] [Created] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2019-08-28 Thread kyungwan nam (Jira)
kyungwan nam created YARN-9790:
--

 Summary: Failed to set default-application-lifetime if 
maximum-application-lifetime is less than or equal to zero
 Key: YARN-9790
 URL: https://issues.apache.org/jira/browse/YARN-9790
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


capacity-scheduler
{code}
...
yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
{code}

refreshQueue was failed as follows

{code}
2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
(AdminService.java:logAndWrapException(910)) - Exception refresh queues.
java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
exceed maximum lifetime -1
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
lifetime604800 can't exceed maximum lifetime -1
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
... 12 more
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM

2019-08-02 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9719:
--

 Summary: Failed to restart yarn-service if it doesn’t exist in RM
 Key: YARN-9719
 URL: https://issues.apache.org/jira/browse/YARN-9719
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam
Assignee: kyungwan nam


Sometimes, restarting a yarn-service is failed as follows.

{code}
{"diagnostics":"Application with id 'application_1562735362534_10461' doesn't 
exist in RM. Please check that the job submission was successful.\n\tat 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat
 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat
 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat 
java.security.AccessController.doPrivileged(Native Method)\n\tat 
javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"}
{code}

It seems like that it occurs when restarting a yarn-service who was stopped 
long ago.
by default, RM keeps up to 1000 completed applications 
(yarn.resourcemanager.max-completed-applications)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9703) Failed to cancel yarn service upgrade when canceling multiple times

2019-07-25 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9703:
--

 Summary: Failed to cancel yarn service upgrade when canceling 
multiple times
 Key: YARN-9703
 URL: https://issues.apache.org/jira/browse/YARN-9703
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam


sleeptest.yarnfile
{code:java}
{
   "name":"sleeptest",
   "version":"1.0.0",
   "lifetime":"-1",
   "components":[
  {
 "name":"sleep",
 "number_of_containers":3,
…
}
{code}
how to reproduce
 * initiate upgrade
 * upgrade instance sleep-0
 * cancel upgrade -> it succeeded without any problem
 * initiate upgrade
 * upgrade instance sleep-0
 * cancel upgrade -> it didn’t work. at that time, AM logs are as follows.

{code:java}
2019-07-26 10:12:20,057 [Component  dispatcher] INFO  
instance.ComponentInstance - container_e72_1564103075282_0002_01_04 pending 
cancellation
2019-07-26 10:12:20,057 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-2 : 
container_e72_1564103075282_0002_01_04] Transitioned from READY to 
CANCEL_UPGRADING on CANCEL_UPGRADE event
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing

2019-07-22 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9691:
--

 Summary: canceling upgrade does not work if upgrade failed 
container is existing
 Key: YARN-9691
 URL: https://issues.apache.org/jira/browse/YARN-9691
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


if a container is failed to upgrade during yarn service upgrade, it will be 
released container and transition to FAILED_UPGRADE state.
After then, I expected it is able to be back to the previous version using 
cancel-upgrade. but, It didn’t work.
At that time, AM log is as follows

{code}
# failed to upgrade container_e62_1563179597798_0006_01_08

2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO  
service.ClientAMService - Upgrade container 
container_e62_1563179597798_0006_01_08
2019-07-16 18:21:55,153 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
NEEDS_UPGRADE -> UPGRADING
2019-07-16 18:21:55,154 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] Transitioned from READY to 
UPGRADING on UPGRADE event
2019-07-16 18:21:55,154 [pool-5-thread-4] INFO  
registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08]: Deleting registry path 
/users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08
2019-07-16 18:21:55,156 [pool-6-thread-6] INFO  provider.ProviderUtils - 
[COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version 
1.0.1 : Creating dir on hdfs: 
hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
containerlaunch.ContainerLaunchService - reInitializing container 
container_e62_1563179597798_0006_01_08 with version 1.0.1
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
containerlaunch.AbstractLauncher - yarn docker env var has been set 
{LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, 
YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM,
 WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, 
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, 
LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, 
YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=}
2019-07-16 18:21:55,158 
[org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO  
impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for 
Container container_e62_1563179597798_0006_01_08
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
UPGRADING -> RUNNING_BUT_UNREADY
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] retrieve status after 30
2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to 
REINITIALIZED on START event
2019-07-16 18:22:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:22:37,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 
2019", outcome="failure", message="Failure in Default probe: IP presence", 
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:08,225 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
container_e62_1563179597798_0006_01_08] spec state state changed from 
RUNNING_BUT_UNREADY -> FAILED_UPGRADE

# request canceling upgrade 

2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container container_e62_1563179597798_0006_01_04 true
2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container container_e62_1563179597798_0006_01_03 true
2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - 
Upgrade container 

[jira] [Created] (YARN-9682) wrong log message when finalize upgrade

2019-07-16 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9682:
--

 Summary: wrong log message when finalize upgrade
 Key: YARN-9682
 URL: https://issues.apache.org/jira/browse/YARN-9682
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


I've seen the wrong message as follows when finalize-upgrade for a yarn-service
{code:java}
2019-07-16 17:44:09,204 INFO  client.ServiceClient 
(ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance

2019-06-16 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9628:
--

 Summary: incorrect ‘number of containers’ is written when 
decommission for non-existing component instance
 Key: YARN-9628
 URL: https://issues.apache.org/jira/browse/YARN-9628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam


Decommission for component instances is introduced in YARN-8761.
Currently, decommission is succeeded even though the component instance does 
not exist.
As a result, incorrect ‘number of containers’ would be written to the service 
spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9521) RM filed to start due to system services

2019-04-30 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9521:
--

 Summary: RM filed to start due to system services
 Key: YARN-9521
 URL: https://issues.apache.org/jira/browse/YARN-9521
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam


when starting RM, listing system services directory has failed as follows.

{code}
2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
(SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is 
configured to /services
2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
(SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
initialized to yarn (auth:SIMPLE)
2019-04-30 17:18:25,467 INFO  service.AbstractService 
(AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
state STARTED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
Filesystem closed
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
at 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
at 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
at 
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
... 13 more
{code}

it looks like due to the usage of filesystem cache.
this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
yarn-site




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9386) destroying yarn-service is allowed even though running state

2019-03-14 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9386:
--

 Summary: destroying yarn-service is allowed even though running 
state
 Key: YARN-9386
 URL: https://issues.apache.org/jira/browse/YARN-9386
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam


It looks very dangerous to destroy a running app. It should not be allowed.

{code}
[yarn-ats@test ~]$ yarn app -list
19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
test1.com/10.1.1.11:8050
19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
server at test1.com/10.1.1.101:10200
Total number of applications (application-types: [], states: [SUBMITTED, 
ACCEPTED, RUNNING] and tags: []):3
Application-Id  Application-NameApplication-Type
  User   Queue   State Final-State  
   ProgressTracking-URL
application_1551250841677_0003fbyarn-service
 ambari-qa default RUNNING   UNDEFINED  
   100% N/A
application_1552379723611_0002   fb1yarn-service
  yarn-ats default RUNNING   UNDEFINED  
   100% N/A
application_1550801435420_0001 ats-hbaseyarn-service
  yarn-ats default RUNNING   UNDEFINED  
   100% N/A
[yarn-ats@test ~]$ yarn app -destroy fb1
19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
test1.com/10.1.1.11:8050
19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
server at test1.com/10.1.1.101:10200
19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
test1.com/10.1.1.11:8050
19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
server at test1.com/10.1.1.101:10200
19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed service 
fb1

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9307) node_partitions constraint does not work

2019-02-14 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9307:
--

 Summary: node_partitions constraint does not work
 Key: YARN-9307
 URL: https://issues.apache.org/jira/browse/YARN-9307
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: kyungwan nam


when a yarn-service app is submitted with below configuration, node_partitions 
constraint does not work.

{code}
…
 "placement_policy": {
   "constraints": [
 {
   "type": "ANTI_AFFINITY",
   "scope": "NODE",
   "target_tags": [
 "ws"
   ],
   "node_partitions": [
 ""
   ]
 }
   ]
 }
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9243) Support limiting network outbound bandwidth for multiple interfaces

2019-01-27 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9243:
--

 Summary: Support limiting network outbound bandwidth for multiple 
interfaces
 Key: YARN-9243
 URL: https://issues.apache.org/jira/browse/YARN-9243
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam


YARN-3366 introduced limiting network outbound bandwidth. currently it is 
available for one network interface.
but, we need to set for multiple interfaces in some circumstances.
It would be good if it can be set for multiple interfaces.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9197) NPE in service AM when failed to launch container

2019-01-14 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-9197:
--

 Summary: NPE in service AM when failed to launch container
 Key: YARN-9197
 URL: https://issues.apache.org/jira/browse/YARN-9197
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: kyungwan nam


I’ve met NPE in service AM as follows.

{code}
2019-01-02 22:35:47,582 [Component  dispatcher] INFO  component.Component - 
[COMPONENT regionserver]: Assigned container_e15_1542704944343_0001_01_01 
to component instance regionserver-1 and launch on host test2.com:45454 
2019-01-02 22:35:47,588 [pool-6-thread-5] WARN  ipc.Client - Exception 
encountered while connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for yarn-ats: HDFS_DELEGATION_TOKEN owner=yarn-ats, renewer=yarn, 
realUser=rm/test1.nfra...@example.com, issueDate=1542704946397, 
maxDate=1543309746397, sequenceNumber=97, masterKeyId=90) can't be found in 
cache
2019-01-02 22:35:47,592 [pool-6-thread-5] ERROR 
containerlaunch.ContainerLaunchService - [COMPINSTANCE regionserver-1 : 
container_e15_1542704944343_0001_01_01]: Failed to launch container.
java.io.IOException: Package doesn't exist as a resource: 
/hdp/apps/3.0.0.0-1634/hbase/hbase.tar.gz
at 
org.apache.hadoop.yarn.service.provider.tarball.TarballProviderService.processArtifact(TarballProviderService.java:41)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:144)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:107)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2019-01-02 22:35:47,592 [Component  dispatcher] INFO  component.Component - 
[COMPONENT regionserver] Requesting for 1 container(s)
2019-01-02 22:35:47,592 [Component  dispatcher] INFO  component.Component - 
[COMPONENT regionserver] Submitting scheduling request: 
SchedulingRequestPBImpl{priority=1, allocationReqId=1, executionType={Execution 
Type: GUARANTEED, Enforce Execution Type: true}, allocationTags=[regionserver], 
resourceSizing=ResourceSizingPBImpl{numAllocations=1, resources=}, placementConstraint=notin,node,regionserver}
2019-01-02 22:35:47,593 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE regionserver-1 : 
container_e15_1542704944343_0001_01_01]: 
container_e15_1542704944343_0001_01_01 completed. Reinsert back to pending 
list and requested a new container.
 exitStatus=null, diagnostics=failed before launch
2019-01-02 22:35:47,593 [Component  dispatcher] INFO  
instance.ComponentInstance - Publishing component instance status 
container_e15_1542704944343_0001_01_01 FAILED 
2019-01-02 22:35:47,593 [Component  dispatcher] ERROR service.ServiceScheduler 
- [COMPINSTANCE regionserver-1 : container_e15_1542704944343_0001_01_01]: 
Error in handling event type STOP
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance.handleComponentInstanceRelaunch(ComponentInstance.java:342)
at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStoppedTransition.transition(ComponentInstance.java:482)
at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStoppedTransition.transition(ComponentInstance.java:375)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance.handle(ComponentInstance.java:679)
at 
org.apache.hadoop.yarn.service.ServiceScheduler$ComponentInstanceEventHandler.handle(ServiceScheduler.java:654)
at 
org.apache.hadoop.yarn.service.ServiceScheduler$ComponentInstanceEventHandler.handle(ServiceScheduler.java:643)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For 

[jira] [Created] (YARN-8935) stopped yarn service app does not show when "yarn app -list"

2018-10-23 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8935:
--

 Summary: stopped yarn service app does not show when "yarn app 
-list"
 Key: YARN-8935
 URL: https://issues.apache.org/jira/browse/YARN-8935
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
 Environment: stopped yarn service app can be re-started or destroyed 
even if it does not exist in RM
It should show including stopped yarn service app.

{code}
$ yarn app -list
18/10/23 15:24:19 INFO client.RMProxy: Connecting to ResourceManager at 
a.com/10.1.1.100:8050
18/10/23 15:24:19 INFO client.AHSProxy: Connecting to Application History 
server at a.com/10.1.1.100:10200
Total number of applications (application-types: [], states: [SUBMITTED, 
ACCEPTED, RUNNING] and tags: []):0
Application-Id  Application-NameApplication-Type
  User   Queue   State Final-State  
   ProgressTracking-URL
$
$ yarn app -destroy ats-hbase
18/10/23 15:24:50 INFO client.RMProxy: Connecting to ResourceManager at 
a.com/10.1.1.100:8050
18/10/23 15:24:51 INFO client.AHSProxy: Connecting to Application History 
server at a.com/10.1.1.100:10200
18/10/23 15:24:51 INFO client.RMProxy: Connecting to ResourceManager at 
a.com/10.1.1.100:8050
18/10/23 15:24:51 INFO client.AHSProxy: Connecting to Application History 
server at a.com/10.1.1.100:10200
18/10/23 15:24:51 INFO util.log: Logging initialized @1617ms
18/10/23 15:24:52 INFO client.ApiServiceClient: Successfully destroyed service 
ats-hbase
{code}
Reporter: kyungwan nam






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8694) app flex with relative changes does not work

2018-08-21 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8694:
--

 Summary: app flex with relative changes does not work
 Key: YARN-8694
 URL: https://issues.apache.org/jira/browse/YARN-8694
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Affects Versions: 3.1.1
Reporter: kyungwan nam


I'd like to increase 2 containers as belows.
{code:java}
yarn app -flex my-sleeper -component sleeper +2{code}
but, It did not work. it seems to request 2, not +2.

 

ApiServiceClient.actionFlex
{code:java}
@Override
public int actionFlex(String appName, Map componentCounts)
throws IOException, YarnException {
  int result = EXIT_SUCCESS;
  try {
Service service = new Service();
service.setName(appName);
service.setState(ServiceState.FLEX);
for (Map.Entry entry : componentCounts.entrySet()) {
  Component component = new Component();
  component.setName(entry.getKey());

  Long numberOfContainers = Long.parseLong(entry.getValue());
  component.setNumberOfContainers(numberOfContainers);
  service.addComponent(component);
}
String buffer = jsonSerDeser.toJson(service);
ClientResponse response = getApiClient(getServicePath(appName))
.put(ClientResponse.class, buffer);{code}
It looks like there is no code, which handle “+”, “-“ in 
ApiServiceClient.actionFlex



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8421) when moving app, activeUsers is increased, even though app does not have outstanding request

2018-06-12 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8421:
--

 Summary: when moving app, activeUsers is increased, even though 
app does not have outstanding request 
 Key: YARN-8421
 URL: https://issues.apache.org/jira/browse/YARN-8421
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.4
Reporter: kyungwan nam


all containers for app1 have been allocated.
move app1 from default Queue to test Queue as follows.
{code}
  yarn rmadmin application -movetoqueue app1 -queue test
{code}
_activeUsers_ of the test Queue is increased even though app1 which does not 
have outstanding request.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8269) NPE happens when submit distributed shell with non-existing queue

2018-05-09 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8269:
--

 Summary: NPE happens when submit distributed shell with 
non-existing queue
 Key: YARN-8269
 URL: https://issues.apache.org/jira/browse/YARN-8269
 Project: Hadoop YARN
  Issue Type: Bug
  Components: distributed-shell
Reporter: kyungwan nam


when submit the distributed shell with non-existing queue, NullPointerException 
happens as follows.
{code:java}
18/05/04 12:20:20 INFO distributedshell.Client: Initializing Client
18/05/04 12:20:20 INFO distributedshell.Client: Running Client
18/05/04 12:20:20 INFO client.AHSProxy: Connecting to Application History 
server at node1/10.1.1.1:10200
18/05/04 12:20:21 INFO distributedshell.Client: Got Cluster metric info from 
ASM, numNodeManagers=1
18/05/04 12:20:21 INFO distributedshell.Client: Got Cluster node info from ASM
18/05/04 12:20:21 INFO distributedshell.Client: Got node report from ASM for, 
nodeId=node2:45454, nodeAddressnode2:8042, nodeRackName/rack-9, 
nodeNumContainers0
18/05/04 12:20:21 FATAL distributedshell.Client: Error running Client
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:462)
at 
org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:215)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
{code}
 
that makes it difficult to find the cause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8179) Preemption does not happen due to natural_termination_factor when DRF is used

2018-04-19 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8179:
--

 Summary: Preemption does not happen due to 
natural_termination_factor when DRF is used
 Key: YARN-8179
 URL: https://issues.apache.org/jira/browse/YARN-8179
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam



cluster
* DominantResourceCalculator
* QueueA : 50 (capacity) ~ 100 (max capacity)
* QueueB : 50 (capacity) ~ 50 (max capacity)

all of resources have been allocated to QueueA. (all Vcores are allocated to 
QueueA)
if App1 is submitted to QueueB, over-utilized QueueA should be preempted.
but, I’ve met the problem, which preemption does not happen. it caused that 
App1 AM can not allocated.

when App1 is submitted, pending resources for asking App1 AM would be 

so, Vcores which need to be preempted from QueueB should be 1.
but, it can be 0 due to natural_termination_factor (default is 0.2)

we should guarantee that resources not to be 0 even though applying 
natural_termination_factor




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8095) Allow disable non-exclusive allocation

2018-03-30 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8095:
--

 Summary: Allow disable non-exclusive allocation
 Key: YARN-8095
 URL: https://issues.apache.org/jira/browse/YARN-8095
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Affects Versions: 2.8.3
Reporter: kyungwan nam


We have 'longlived' Queue, which is used for long-lived apps.
In situation where default Partition resources are not enough, containers for 
long-lived app can be allocated to sharable Partition.
Since then, containers for long-lived app can be easily preempted.
We don’t want long-lived apps to be killed abruptly.

Currently, non-exclusive allocation can happen regardless of whether the queue 
is accessible to the sharable Partition.
It would be good if non-exclusive allocation can be disabled at queue level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8020) when DRF is used, preemption does not trigger due to incorrect idealAssigned

2018-03-09 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-8020:
--

 Summary: when DRF is used, preemption does not trigger due to 
incorrect idealAssigned
 Key: YARN-8020
 URL: https://issues.apache.org/jira/browse/YARN-8020
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


I’ve met that Inter Queue Preemption does not work.
It happens when DRF is used and submitting application with a large number of 
vcores.

IMHO, idealAssigned can be set incorrectly by following code.
{code}
// This function "accepts" all the resources it can (pending) and return
// the unused ones
Resource offer(Resource avail, ResourceCalculator rc,
Resource clusterResource, boolean considersReservedResource) {
  Resource absMaxCapIdealAssignedDelta = Resources.componentwiseMax(
  Resources.subtract(getMax(), idealAssigned),
  Resource.newInstance(0, 0));
  // accepted = min{avail,
  //   max - assigned,
  //   current + pending - assigned,
  //   # Make sure a queue will not get more than max of its
  //   # used/guaranteed, this is to make sure preemption won't
  //   # happen if all active queues are beyond their guaranteed
  //   # This is for leaf queue only.
  //   max(guaranteed, used) - assigned}
  // remain = avail - accepted
  Resource accepted = Resources.min(rc, clusterResource,
  absMaxCapIdealAssignedDelta,
  Resources.min(rc, clusterResource, avail, Resources
  /*
   * When we're using FifoPreemptionSelector (considerReservedResource
   * = false).
   *
   * We should deduct reserved resource from pending to avoid excessive
   * preemption:
   *
   * For example, if an under-utilized queue has used = reserved = 20.
   * Preemption policy will try to preempt 20 containers (which is not
   * satisfied) from different hosts.
   *
   * In FifoPreemptionSelector, there's no guarantee that preempted
   * resource can be used by pending request, so policy will preempt
   * resources repeatly.
   */
  .subtract(Resources.add(getUsed(),
  (considersReservedResource ? pending : pendingDeductReserved)),
  idealAssigned)));
{code}

let’s say,

* cluster resource : 
* idealAssigned(assigned): 
* avail: 
* current: 
* pending: 

current + pending - assigned: 
min ( avail, (current + pending - assigned) ) : 
accepted: 

as a result, idealAssigned will be , which does not 
trigger preemption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7408) total capacity could be occupied by a large container request

2017-10-27 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-7408:
--

 Summary: total capacity could be occupied by a large container 
request
 Key: YARN-7408
 URL: https://issues.apache.org/jira/browse/YARN-7408
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam


if NM can not afford to allocate a large container request, it will be reserved 
container.
but, in a cluster with long running apps, it is not often that running 
containers are released.
in cases like this, reserved containers will be increased as time goes on. as a 
result, total capacity could be occupied by reserved resources.
it makes other container requests starve.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6401) terminating signal should be able to specify per application to support graceful-stop

2017-03-27 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-6401:
--

 Summary: terminating signal should be able to specify per 
application to support graceful-stop
 Key: YARN-6401
 URL: https://issues.apache.org/jira/browse/YARN-6401
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam


when stop container, first send SIGTERM to the process.
after a while, send SIGKILL if the process is still alive.

above process is always the same for any application.
but, to graceful-stop, sometimes it need to send another signal instead of 
SIGTERM.

for instance, if apache httpd on slider is running, SIGWINCH should be came to 
stop gracefully.
the way to stop gracefully is depend on application.
it will be good if we can define a signal to terminate per application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6153) keepContainer does not work when AM retry window is set

2017-02-06 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-6153:
--

 Summary: keepContainer does not work when AM retry window is set
 Key: YARN-6153
 URL: https://issues.apache.org/jira/browse/YARN-6153
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: kyungwan nam


yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster.
I submitted a YARN application (slider app) that keepContainers=true, 
attemptFailuresValidityInterval=30.

it did work properly when AM was failed firstly.
all containers launched by previous AM were resynced with new AM (attempt2) 
without killing containers.

after 10 minutes, I thought AM failure count was reset by 
attemptFailuresValidityInterval (5 minutes).
but, all containers were killed when AM was failed secondly. (new AM attempt3 
was launched properly)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5844) fair ordering policy with DRF

2016-11-07 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-5844:
--

 Summary: fair ordering policy with DRF
 Key: YARN-5844
 URL: https://issues.apache.org/jira/browse/YARN-5844
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: kyungwan nam


FairOrderingPolicy that was added in YARN-3319 is memory-based fair sharing.
therefore, it does not respect vcores demand.
multi-resources fair sharing with Dominant Resource Fairness should be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5696) add container_tag to Distributedshell's env

2016-10-02 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-5696:
--

 Summary: add container_tag to Distributedshell's env
 Key: YARN-5696
 URL: https://issues.apache.org/jira/browse/YARN-5696
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Reporter: kyungwan nam
Assignee: kyungwan nam
Priority: Minor


a number of containers can be allocated by "num_containers" option.
but, there is no way to assign different input to each container.

each container has a unique id between 0 and (num_containers - 1).
so, user can assign different input (or role) per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4703) blacklist option for distributedshell

2016-02-18 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-4703:
--

 Summary: blacklist option for distributedshell 
 Key: YARN-4703
 URL: https://issues.apache.org/jira/browse/YARN-4703
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.7.1
Reporter: kyungwan nam
Assignee: kyungwan nam
Priority: Minor


we need to option “blacklist” in distributedshell.
It can be set so that the container is not allocated to specific nodes.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-3931:
--

 Summary: default-node-label-expression doesn’t apply when an 
application is submitted by RM rest api
 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam


* yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
* submit an application using rest api without app-node-label-expression”, 
am-container-node-label-expression”
* RM doesn’t allocate containers to the hosts associated with large_disk node 
label




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)