[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133483#comment-14133483
]
bc Wong commented on YARN-1530:
-------------------------------
Hi [~zjshen]. My main concern with the write path is: *Does the ATS write path
have the right reliability, robustness and scalability so that its failures
would not affect my apps?* I'll try to explain it with specific scenarios and
technology choices. Then maybe you can tell me if those are valid concerns.
First, to make it easy for other readers here, I'm advocating that this event
flow:\\
_Client/App -> Reliable channel where event is persisted (HDFS/Kafka) -> ATS_ \\
is a lot better than:\\
_Client/App -> RPC -> ATS_
h4. Scenario 1. ATS service goes down
If we use a reliable channel (e.g. HDFS) for writes, then apps do not suffer at
all even when the ATS goes down. The ATS service going down is a valid
scenario, due to causes ranging from bug to hardware failures. Having the write
path decoupled from the ATS service being up all the time seems a clear win to
me. Writing decoupled components is also a good distributed systems design
principle.
On the other hand, one may argue that _the ATS service will never go down
entirely, or is not supposed to go down entirely_, just like we don't expect
all the ZK nodes or all the RM nodes to down down. That argument then justifies
using direct RPC for writes. Yes, you can design such an ATS service. To this
I'll say:
* YARN apps already depend on ZK/RM/HDFS being up. Every new service dependency
we add will only increase the chances of YARN apps failing or slowing down.
That's true even if the ATS service's uptime is as good as ZK or RM.
* Realistically, getting the ATS service's uptime to the same level as ZK or
HDFS is a long and winding road. Especially when most discussions here assume
HBase as the backing store. HBase's uptime is lower than HDFS/ZK/RM because
it's more complex to operate. If HBase going down means ATS service going down,
then we certainly should guard against this failure scenario.
h4. Scenario 2. ATS service partially down
If the client writes directly to the ATS service using an unreliable channel
(RPC), then the write path will do failover if one of ATS nodes fails. This
transient failure still affects the performance of YARN apps. One can argue
that _non-blocking RPC writes resolve this issue_. To this I'll say:
* Non-blocking RPC writes only works for *long-duration apps*. We already
short-lived applications, in the range of a few minutes. With Spark getting
more popular, this will continue to happen. How short will the app duration
get? The answer is a few seconds, if we want YARN to be the generic cluster
scheduler. Google already sees that kind of job profile, if you look at their
cluster traces. Of course, our scheduler and container allocation needs to get
a lot better for that to happen. But I think that's the goal. Our ATS design
here should consider short-lived applications.
* It sucks if you're running an app that's supposed to finish under a minute,
but then the ATS writes are stalled for an extra minute because one ATS node
does a failover. Again, we can go back to the counter-argument in scenario #1,
about how unlikely this is. I'll repeat that it's more likely that we think.
And if we have a choice to decouple the write path from the ATS service, why
not?
h4. Scenario 3. ATS backing store fails
By backing store, I mean the storage system where ATS persists the events, such
as LevelDB and HBase. In a naive implementation, it seems that if the backing
store fails, then the ATS service will be unavailable. Does that mean the event
write path will fail, and the YARN apps will stall or fail? I hope not. It's
not an issue if we use HDFS as the default write channel, because most YARN
apps already depends on HDFS.
One may argue that _the ATS service will buffer writes (persist them elsewhere)
if the backing store fails_. To this I'll say:
* If we have an alternate code path to persist events first before they hit the
final backing store, why not do that all the time? Such a path will address
scenario #1 and #2 as well.
* HBase has been mentioned as if it's the penicillin of event storage here.
That is probably true for big shops like Twitter and Yahoo, who have the
expertise to operate an HBase cluster well. But most enterprise users or
startups don't. We should assume that those HBase instances will run
suboptimally with occasional widespread failures. Using HBase for event storage
is a poor fit for most people. And I think it's difficult to achieve good
uptime for the ATS service as a whole.
> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
> Key: YARN-1530
> URL: https://issues.apache.org/jira/browse/YARN-1530
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Vinod Kumar Vavilapalli
> Attachments: ATS-Write-Pipeline-Design-Proposal.pdf,
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf,
> application timeline design-20140116.pdf, application timeline
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework
> data all by itself as YARN doesn't have a common solution. This JIRA attempts
> to solve the storage, management and serving of per-framework data from
> various applications, both running and finished. The aim is to change YARN to
> collect and store data in a generic manner with plugin points for frameworks
> to do their own thing w.r.t interpretation and serving.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)