[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125238#comment-14125238
]
Zhijie Shen commented on YARN-1530:
-----------------------------------
[~rkanter], thanks for proposing the idea of pipeline writing, which sounds
interesting. W.R.T the two proposed solutions, I have some general comments.
1. HDFS-based implementation:
It sounds an interesting idea of using HDFS to persist unpublished timeline
entities, but I’m not sure it’s going to solve the scalability problem. Though
each application can write the timeline entities into HDFS in a distributed
manner, there’s still a single timeline server that fetches the files of the
timeline entities written by ALL applications. The bottleneck is still there.
Essentially I don’t see any difference between publishing entities via HTTP
REST interface and via HDFS in terms of scalability. And given the same
timeline sever, I’m afraid that HTTP REST interface is very likely to be be
more efficient:
a) Less I/O against the secondary storage;
b) Bult-in multithreading mechanism from the web server;
c) No delay of waiting until next round of file fetching.
The current writing channel allows the data to be available on the timeline
server immediately, such that we can support the realtime or near realtime
query about the application metrics. Because of c), I’m not sure this is still
feasible with HDFS writing channel (too frequent fetching will have performance
problem?).
IMHO, the ESSENTIAL problem is that a single instance of the timeline server is
not going to to intake a huge number of concurrent requests (no matter they’re
from HTTP REST, RPC or HDFS). Let’s assume we’re going to have the HBase
timeline store (YARN-2032), which is scalable and reliable. Thanks to the
stateless nature, the proper way scale up the timeline service to start a
number of federal timeline sever and connect them to the same HBase timeline
store. This design will solve the high availability problem as well. Moreover,
it’s not just scale the writing channel, but also the reading one. I’ll file a
separate ticket about the scalable and highly available timeline server.
2. Direct-writing implementation:
The biggest concern here is that the solution is cracking the current timeline
server architecture:
a) Accepting requests for users ->
b) Pre-prosessing the timeline entities and verifying users’ access ->
c) Transmitting the timeline entities into key/value pairs
If you want to open the option for the client to write into the data store
directly, all this logic has to be moved to the client. It’s a complex stack
than some simple put calls to the data store. It’s not just increasing the more
dependency for the timeline client, but changing it from a thin client to a fat
one. This makes it more difficult to distribute the timeline client and upgrade
it in the future, and makes the user heavily loaded. Importantly, I’m not sure
we will be able to verify user's access at the client side, or HBase access
control is enough for our specific timeline ACLs.
Another concern is that the client is strongly coupled with a particular type
of data store, such as HBase. If we choose to use Leveldb (or Rocksdb), the
current client for HBase used at the application side is going to be broken. In
other word, you have to change the client and the data store simultaneously.
Misc. SLA of TimelineClient
The proposal recalls me another interesting question: SLA of the timeline
server and the client. The timeline server should be reliable: when an timeline
entity is accepted at the timeline server, it should not be lost, which is
ensured by reliable data store (e.g. HBase): YARN-2032. It is questionable
whether we also want the timeline client to reliable: when an timeline entity
is passed to the timeline client, it should not be lost before being accepted
by the timeline server, which may be in the outage for a while. Hence, local
cache in HDFS may be a good idea. Or we can use an even lightweight solution:
Leveldb. I’ll file a ticket about it as well.
> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
> Key: YARN-1530
> URL: https://issues.apache.org/jira/browse/YARN-1530
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Vinod Kumar Vavilapalli
> Attachments: ATS-Write-Pipeline-Design-Proposal.pdf,
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf,
> application timeline design-20140116.pdf, application timeline
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework
> data all by itself as YARN doesn't have a common solution. This JIRA attempts
> to solve the storage, management and serving of per-framework data from
> various applications, both running and finished. The aim is to change YARN to
> collect and store data in a generic manner with plugin points for frameworks
> to do their own thing w.r.t interpretation and serving.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)