[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133483#comment-14133483
 ] 

bc Wong commented on YARN-1530:
-------------------------------

Hi [~zjshen]. My main concern with the write path is: *Does the ATS write path 
have the right reliability, robustness and scalability so that its failures 
would not affect my apps?* I'll try to explain it with specific scenarios and 
technology choices. Then maybe you can tell me if those are valid concerns.

First, to make it easy for other readers here, I'm advocating that this event 
flow:\\
_Client/App -> Reliable channel where event is persisted (HDFS/Kafka) -> ATS_ \\
is a lot better than:\\
_Client/App -> RPC -> ATS_

h4. Scenario 1. ATS service goes down
If we use a reliable channel (e.g. HDFS) for writes, then apps do not suffer at 
all even when the ATS goes down. The ATS service going down is a valid 
scenario, due to causes ranging from bug to hardware failures. Having the write 
path decoupled from the ATS service being up all the time seems a clear win to 
me. Writing decoupled components is also a good distributed systems design 
principle.

On the other hand, one may argue that _the ATS service will never go down 
entirely, or is not supposed to go down entirely_, just like we don't expect 
all the ZK nodes or all the RM nodes to down down. That argument then justifies 
using direct RPC for writes. Yes, you can design such an ATS service. To this 
I'll say:

* YARN apps already depend on ZK/RM/HDFS being up. Every new service dependency 
we add will only increase the chances of YARN apps failing or slowing down. 
That's true even if the ATS service's uptime is as good as ZK or RM.
* Realistically, getting the ATS service's uptime to the same level as ZK or 
HDFS is a long and winding road. Especially when most discussions here assume 
HBase as the backing store. HBase's uptime is lower than HDFS/ZK/RM because 
it's more complex to operate. If HBase going down means ATS service going down, 
then we certainly should guard against this failure scenario.

h4. Scenario 2. ATS service partially down
If the client writes directly to the ATS service using an unreliable channel 
(RPC), then the write path will do failover if one of ATS nodes fails. This 
transient failure still affects the performance of YARN apps. One can argue 
that _non-blocking RPC writes resolve this issue_. To this I'll say:

* Non-blocking RPC writes only works for *long-duration apps*. We already 
short-lived applications, in the range of a few minutes. With Spark getting 
more popular, this will continue to happen. How short will the app duration 
get? The answer is a few seconds, if we want YARN to be the generic cluster 
scheduler. Google already sees that kind of job profile, if you look at their 
cluster traces. Of course, our scheduler and container allocation needs to get 
a lot better for that to happen. But I think that's the goal. Our ATS design 
here should consider short-lived applications.
* It sucks if you're running an app that's supposed to finish under a minute, 
but then the ATS writes are stalled for an extra minute because one ATS node 
does a failover. Again, we can go back to the counter-argument in scenario #1, 
about how unlikely this is. I'll repeat that it's more likely that we think. 
And if we have a choice to decouple the write path from the ATS service, why 
not?

h4. Scenario 3. ATS backing store fails
By backing store, I mean the storage system where ATS persists the events, such 
as LevelDB and HBase. In a naive implementation, it seems that if the backing 
store fails, then the ATS service will be unavailable. Does that mean the event 
write path will fail, and the YARN apps will stall or fail? I hope not. It's 
not an issue if we use HDFS as the default write channel, because most YARN 
apps already depends on HDFS.

One may argue that _the ATS service will buffer writes (persist them elsewhere) 
if the backing store fails_. To this I'll say:

* If we have an alternate code path to persist events first before they hit the 
final backing store, why not do that all the time? Such a path will address 
scenario #1 and #2 as well.
* HBase has been mentioned as if it's the penicillin of event storage here. 
That is probably true for big shops like Twitter and Yahoo, who have the 
expertise to operate an HBase cluster well. But most enterprise users or 
startups don't. We should assume that those HBase instances will run 
suboptimally with occasional widespread failures. Using HBase for event storage 
is a poor fit for most people. And I think it's difficult to achieve good 
uptime for the ATS service as a whole.

> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
>                 Key: YARN-1530
>                 URL: https://issues.apache.org/jira/browse/YARN-1530
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, 
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, 
> application timeline design-20140116.pdf, application timeline 
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework 
> data all by itself as YARN doesn't have a common solution. This JIRA attempts 
> to solve the storage, management and serving of per-framework data from 
> various applications, both running and finished. The aim is to change YARN to 
> collect and store data in a generic manner with plugin points for frameworks 
> to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to