[
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240801#comment-16240801
]
Jason Lowe commented on YARN-7272:
----------------------------------
bq. Another possible case to handle is the case where storage is down i.e.
instead of waiting for sync entity call to wait, it can be potentially
committed to WAL till backend is unavailable. We can potentially explore this
option.
My guess here is that this is going to be problematic because:
# By the time you get a robust, performant WAL implemented on HDFS you've
practically reinvented the core of HBase.
# The point of having a synchronous call is to tell the client, "yes, I promise
this has been persisted to the ATS database" yet it hasn't.
If the AM side-band signals another client to start reading from ATS then that
other client will not see those writes despite the AM's synchronous call to the
collector returning success. The synchronous call cannot return until HBase
says it has it.
In that sense, I don't see the WAL being so much a fault tolerance tool.
Instead I see it as a performance enhancement tool where it can buffer more
asynchronous events before blocking the caller or potentially recover more
asynchronous events in the case of a collector tool crash. The latter requires
a lot of work where I can see us essentially requiring or reinventing systems
like Apache BookKeeper. I don't see how the WAL helps in the synchronous call
scenario, since the whole point of the synchronous call is to guarantee the
result appears in the ATSv2 database.
> Enable timeline collector fault tolerance
> -----------------------------------------
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineclient, timelinereader, timelineserver
> Reporter: Vrushali C
> Assignee: Rohith Sharma K S
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a
> running yarn app, we would like that yarn app to re-establish connection with
> a new timeline collector.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]