Prabhu Joseph created YARN-5933:
-----------------------------------
Summary: ATS stale entries in active directory causes
ApplicationNotFoundException in RM
Key: YARN-5933
URL: https://issues.apache.org/jira/browse/YARN-5933
Project: Hadoop YARN
Issue Type: Bug
Components: ATSv2
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
On Secure cluster where ATS is down, Tez job submitted will fail while getting
TIMELINE_DELEGATION_TOKEN with below exception
{code}
0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from
alltypesorc group by csmallint;
INFO : Session is already open
INFO : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
INFO : Tez session was closed. Reopening...
ERROR : Failed to execute tez graph.
java.lang.RuntimeException: Failed to connect to timeline server. Connection
retries limit exceeded. The posted timeline event may be missing
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
at
org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
at org.apache.tez.client.TezClient.start(TezClient.java:409)
at
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
at
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
at
org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
at
org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at
org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Tez YarnClient has received an applicationID from RM. On Restarting ATS now,
ATS tries to get the application report from RM and so RM will throw
ApplicationNotFoundException. ATS will keep on requesting and which floods RM.
{code}
RM logs:
2016-11-23 13:53:57,345 INFO
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new
applicationId: 5
2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9
on 8050, call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
from 172.26.71.120:37699 Call#26 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application
with id 'application_1479897867169_0005' doesn't exist in RM.
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
{code}
There is a stale application entry inside /ats/active directory. ATS stops
requesting when we remove this directory.
[hive@kerberos-2 bin]$ hadoop fs -ls /ats/active
drwxrwx--- - hive hadoop 0 2016-11-23 13:54
/ats/active/application_1479897867169_0005
This issue with ATS is exposed by Tez job as Tez uses putDomain method. On
calling TimelineClientImpl#putDomain() -> writeDomain() -> getAppAttemptDir()
-> createApplicationDir() which creates a application directory inside ATS
activePath. After Tez job created this, it fails as unable to connect to ATS.
Now when ATS comes back, it scans activePath for every 60 seconds
(yarn.timeline-service.entity-group-fs-store.scan-interval-seconds) and calls
GetApplicationReport which leads to ApplicationNotFoundException in RM.
For this negative case - we can delete the appDirectory inside activePath from
ATS EntityGroupFSTimelineStore#getAppState() once the RM throws
ApplicationNotFoundException.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]