[
https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773670#comment-16773670
]
Bibin A Chundatt commented on YARN-5933:
----------------------------------------
[~Prabhu Joseph]
YARN-8201 solves the log flooding issue rt ??
> ATS stale entries in active directory causes ApplicationNotFoundException in
> RM
> -------------------------------------------------------------------------------
>
> Key: YARN-5933
> URL: https://issues.apache.org/jira/browse/YARN-5933
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.3
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
> Priority: Major
>
> On Secure cluster where ATS is down, Tez job submitted will fail while
> getting TIMELINE_DELEGATION_TOKEN with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from
> alltypesorc group by csmallint;
> INFO : Session is already open
> INFO : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection
> retries limit exceeded. The posted timeline event may be missing
> at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
> at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
> at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
> at
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
> at
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
> at
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
> at
> org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
> at org.apache.tez.client.TezClient.start(TezClient.java:409)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
> at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
> at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
> at
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
> at
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now,
> ATS tries to get the application report from RM and so RM will throw
> ApplicationNotFoundException. ATS will keep on requesting and which floods RM.
> {code}
> RM logs:
> 2016-11-23 13:53:57,345 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new
> applicationId: 5
> 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 9 on 8050, call
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
> from 172.26.71.120:37699 Call#26 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application
> with id 'application_1479897867169_0005' doesn't exist in RM.
> at
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
> at
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
> {code}
> There is a stale application entry inside /ats/active directory. ATS stops
> requesting when we remove this directory.
> [hive@kerberos-2 bin]$ hadoop fs -ls /ats/active
> drwxrwx--- - hive hadoop 0 2016-11-23 13:54
> /ats/active/application_1479897867169_0005
> This issue with ATS is exposed by Tez job as Tez uses putDomain method. On
> calling TimelineClientImpl#putDomain() -> writeDomain() -> getAppAttemptDir()
> -> createApplicationDir() which creates a application directory inside ATS
> activePath. After Tez job created this, it fails as unable to connect to ATS.
> Now when ATS comes back, it scans activePath for every 60 seconds
> (yarn.timeline-service.entity-group-fs-store.scan-interval-seconds) and calls
> GetApplicationReport which leads to ApplicationNotFoundException in RM.
> For this negative case - we can delete the appDirectory inside activePath
> from ATS EntityGroupFSTimelineStore#getAppState() once the RM throws
> ApplicationNotFoundException.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]