[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240801#comment-16240801
 ] 

Jason Lowe commented on YARN-7272:
----------------------------------

bq. Another possible case to handle is the case where storage is down i.e. 
instead of waiting for sync entity call to wait, it can be potentially 
committed to WAL till backend is unavailable. We can potentially explore this 
option.

My guess here is that this is going to be problematic because:

# By the time you get a robust, performant WAL implemented on HDFS you've 
practically reinvented the core of HBase.
# The point of having a synchronous call is to tell the client, "yes, I promise 
this has been persisted to the ATS database" yet it hasn't.

If the AM side-band signals another client to start reading from ATS then that 
other client will not see those writes despite the AM's synchronous call to the 
collector returning success.  The synchronous call cannot return until HBase 
says it has it.

In that sense, I don't see the WAL being so much a fault tolerance tool.  
Instead I see it as a performance enhancement tool where it can buffer more 
asynchronous events before blocking the caller or potentially recover more 
asynchronous events in the case of a collector tool crash.  The latter requires 
a lot of work where I can see us essentially requiring or reinventing systems 
like Apache BookKeeper.  I don't see how the WAL helps in the synchronous call 
scenario, since the whole point of the synchronous call is to guarantee the 
result appears in the ATSv2 database.

> Enable timeline collector fault tolerance
> -----------------------------------------
>
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>         Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to