[
https://issues.apache.org/jira/browse/YARN-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808769#comment-15808769
]
Naganarasimha G R commented on YARN-6054:
-----------------------------------------
Thanks [~raviprakashu] for the patch, overall patch looks fine technically, but
has it been tested in in the actual scenario ? Assuming that you had
encountered this and tried this option, i am asking it. Also in the test we are
just ensuring that the api is just called, so if it has been tried and useful
at least once then ok.
Some points :
# Additionally we are using LevelDb in multiple other places like NM state
store etc, would it be good to handle in these places too as part of this jira
itself ?
# we are trying to backup the files hope test case could verify that scenario
too.
# {{setTestFactory}} can be annotated with VisibleForTesting and the name can
be just {{setFactory}}
> TimelineServer fails to start when some LevelDb state files are missing.
> ------------------------------------------------------------------------
>
> Key: YARN-6054
> URL: https://issues.apache.org/jira/browse/YARN-6054
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.0.0-alpha2
> Reporter: Ravi Prakash
> Assignee: Ravi Prakash
> Attachments: YARN-6054.01.patch, YARN-6054.02.patch
>
>
> We encountered an issue recently where the TimelineServer failed to start
> because some state files went missing.
> {code}
> 2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService:
> Service
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
> failed in state INITED
> ; cause: org.apache.hadoop.service.ServiceStateException:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9
> missing files; e.g.: <levelDbStorePath>/timelines
> erver/leveldb-timeline-store.ldb/127897.sst
> org.apache.hadoop.service.ServiceStateException:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9
> missing files; e.g.: <levelDbStorePath>/timelineserver/lev
> eldb-timeline-store.ldb/127897.sst
> 2016-11-21 20:46:43,135 FATAL
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
> Error starting ApplicationHistoryServer
> org.apache.hadoop.service.ServiceStateException:
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9
> missing files; e.g.:
> <levelDbStorePath>/timelineserver/leveldb-timeline-store.ldb/127897.sst
> at
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
> at
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException:
> Corruption: 9 missing files; e.g.:
> <levelDbStorePath>/timelineserver/leveldb-timeline-store.ldb/127897.sst
> at
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> at
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> ... 5 more
> 2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with
> status -1
> {code}
> Ideally we shouldn't have any missing state files. However I'd posit that the
> TimelineServer should have graceful degradation instead of failing to start
> at all.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]