[ https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708237#comment-15708237 ]
Prabhu Joseph commented on YARN-5933: ------------------------------------- Hi [~gtCarrera9] Okay, I think AppLogs#parseSummaryLogs() can skip subsequent getAppState for Unknown apps and move them to complete after unknownActiveSecs. > ATS stale entries in active directory causes ApplicationNotFoundException in > RM > ------------------------------------------------------------------------------- > > Key: YARN-5933 > URL: https://issues.apache.org/jira/browse/YARN-5933 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.3 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > > On Secure cluster where ATS is down, Tez job submitted will fail while > getting TIMELINE_DELEGATION_TOKEN with below exception > {code} > 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from > alltypesorc group by csmallint; > INFO : Session is already open > INFO : Dag name: select csmallint from alltypesor...csmallint(Stage-1) > INFO : Tez session was closed. Reopening... > ERROR : Failed to execute tez graph. > java.lang.RuntimeException: Failed to connect to timeline server. Connection > retries limit exceeded. The posted timeline event may be missing > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250) > at > org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72) > at org.apache.tez.client.TezClient.start(TezClient.java:409) > at > org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196) > at > org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311) > at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453) > at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121) > at > org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154) > at > org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71) > at > org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at > org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Tez YarnClient has received an applicationID from RM. On Restarting ATS now, > ATS tries to get the application report from RM and so RM will throw > ApplicationNotFoundException. ATS will keep on requesting and which floods RM. > {code} > RM logs: > 2016-11-23 13:53:57,345 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new > applicationId: 5 > 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 9 on 8050, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 172.26.71.120:37699 Call#26 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1479897867169_0005' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200) > {code} > There is a stale application entry inside /ats/active directory. ATS stops > requesting when we remove this directory. > [hive@kerberos-2 bin]$ hadoop fs -ls /ats/active > drwxrwx--- - hive hadoop 0 2016-11-23 13:54 > /ats/active/application_1479897867169_0005 > This issue with ATS is exposed by Tez job as Tez uses putDomain method. On > calling TimelineClientImpl#putDomain() -> writeDomain() -> getAppAttemptDir() > -> createApplicationDir() which creates a application directory inside ATS > activePath. After Tez job created this, it fails as unable to connect to ATS. > Now when ATS comes back, it scans activePath for every 60 seconds > (yarn.timeline-service.entity-group-fs-store.scan-interval-seconds) and calls > GetApplicationReport which leads to ApplicationNotFoundException in RM. > For this negative case - we can delete the appDirectory inside activePath > from ATS EntityGroupFSTimelineStore#getAppState() once the RM throws > ApplicationNotFoundException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org