[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074636#comment-14074636 ]
Zhijie Shen commented on YARN-2262: ----------------------------------- [~nishan], thanks for sharing the logs. I've done a preliminary investigation into the RM logs. It seems that the FS history store messed up after RM failed over. There're two types of exception: 1. The history file of application_1406035038624_0005 shouldn't occur, because before failover, I didn't see application_1406035038624_0005 was already started according to the log (or the log is not complete?). However, FS store found the history file on HDFS, and wanted to append more information into the file, but failed to open the file in append mode. {code} 2014-07-23 17:00:03,066 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Error when openning history file of application application_1406035038624_0005 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/home/testos/timelinedata/generic-history/ApplicationHistoryDataRoot/application_1406035038624_0005] for [DFSClient_NONMAPREDUCE_-903472038_1] for client [10.18.40.84], because this file is already being created by [DFSClient_NONMAPREDUCE_1878412866_1] on [10.18.40.84] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2549) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2378) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2613) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2576) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:537) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy14.append(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:276) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at $Proxy15.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1569) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1609) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1597) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:316) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.<init>(FileSystemApplicationHistoryStore.java:723) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:140) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) {code} 2. It seems that this exception happened because the file was written by another process simultaneously. Perhaps, the previous RM that went to standby didn't close the outstanding history files properly. {code} 2014-07-23 17:00:03,497 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Error when openning history file of application application_1406002968974_0004 java.io.IOException: Output file not at zero offset. at org.apache.hadoop.io.file.tfile.BCFile$Writer.<init>(BCFile.java:288) at org.apache.hadoop.io.file.tfile.TFile$Writer.<init>(TFile.java:288) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.<init>(FileSystemApplicationHistoryStore.java:728) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:140) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) {code} > Few fields displaying wrong values in Timeline server after RM restart > ---------------------------------------------------------------------- > > Key: YARN-2262 > URL: https://issues.apache.org/jira/browse/YARN-2262 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Affects Versions: 2.4.0 > Reporter: Nishan Shetty > Assignee: Naganarasimha G R > Attachments: Capture.PNG, Capture1.PNG, > yarn-testos-historyserver-HOST-10-18-40-95.log, > yarn-testos-resourcemanager-HOST-10-18-40-84.log, > yarn-testos-resourcemanager-HOST-10-18-40-95.log > > > Few fields displaying wrong values in Timeline server after RM restart > State: null > FinalStatus: UNDEFINED > Started: 8-Jul-2014 14:58:08 > Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)