[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

Zhijie Shen (JIRA) Sun, 07 Sep 2014 22:59:24 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125238#comment-14125238
 ]


Zhijie Shen commented on YARN-1530:
-----------------------------------


[~rkanter], thanks for proposing the idea of pipeline writing, which sounds 
interesting. W.R.T the two proposed solutions, I have some general comments.

1. HDFS-based implementation:

It sounds an interesting idea of using HDFS to persist unpublished timeline 
entities, but I’m not sure it’s going to solve the scalability problem. Though 
each application can write the timeline entities into HDFS in a distributed 
manner, there’s still a single timeline server that fetches the files of the 
timeline entities written by ALL applications. The bottleneck is still there. 
Essentially I don’t see any difference between publishing entities via HTTP 
REST interface and via HDFS in terms of scalability. And given the same 
timeline sever, I’m afraid that HTTP REST interface is very likely to be be 
more efficient:

a) Less I/O against the secondary storage;
b) Bult-in multithreading mechanism from the web server;
c) No delay of waiting until next round of file fetching.

The current writing channel allows the data to be available on the timeline 
server immediately, such that we can support the realtime or near realtime 
query about the application metrics. Because of c), I’m not sure this is still 
feasible with HDFS writing channel (too frequent fetching will have performance 
problem?).

IMHO, the ESSENTIAL problem is that a single instance of the timeline server is 
not going to to intake a huge number of concurrent requests (no matter they’re 
from HTTP REST, RPC or HDFS). Let’s assume we’re going to have the HBase 
timeline store (YARN-2032), which is scalable and reliable. Thanks to the 
stateless nature, the proper way scale up the timeline service to start a 
number of federal timeline sever and connect them to the same HBase timeline 
store. This design will solve the high availability problem as well. Moreover, 
it’s not just scale the writing channel, but also the reading one. I’ll file a 
separate ticket about the scalable and highly available timeline server.

2. Direct-writing implementation:

The biggest concern here is that the solution is cracking the current timeline 
server architecture:
a) Accepting requests for users ->
b) Pre-prosessing the timeline entities and verifying users’ access ->
c) Transmitting the timeline entities into key/value pairs

If you want to open the option for the client to write into the data store 
directly, all this logic has to be moved to the client. It’s a complex stack 
than some simple put calls to the data store. It’s not just increasing the more 
dependency for the timeline client, but changing it from a thin client to a fat 
one. This makes it more difficult to distribute the timeline client and upgrade 
it in the future, and makes the user heavily loaded. Importantly, I’m not sure 
we will be able to verify user's access at the client side, or HBase access 
control is enough for our specific timeline ACLs.

Another concern is that the client is strongly coupled with a particular type 
of data store, such as HBase. If we choose to use Leveldb (or Rocksdb), the 
current client for HBase used at the application side is going to be broken. In 
other word, you have to change the client and the data store simultaneously.

Misc. SLA of TimelineClient

The proposal recalls me another interesting question: SLA of the timeline 
server and the client. The timeline server should be reliable: when an timeline 
entity is accepted at the timeline server, it should not be lost, which is 
ensured by reliable data store (e.g. HBase): YARN-2032. It is questionable 
whether we also want the timeline client to reliable: when an timeline entity 
is passed to the timeline client, it should not be lost before being accepted 
by the timeline server, which may be in the outage for a while. Hence, local 
cache in HDFS may be a good idea. Or we can use an even lightweight solution: 
Leveldb. I’ll file a ticket about it as well.

> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
>                 Key: YARN-1530
>                 URL: https://issues.apache.org/jira/browse/YARN-1530
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, 
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, 
> application timeline design-20140116.pdf, application timeline 
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework 
> data all by itself as YARN doesn't have a common solution. This JIRA attempts 
> to solve the storage, management and serving of per-framework data from 
> various applications, both running and finished. The aim is to change YARN to 
> collect and store data in a generic manner with plugin points for frameworks 
> to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

Reply via email to