[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143034#comment-14143034
 ] 

Zhijie Shen commented on YARN-1530:
-----------------------------------

bq. Scenario 1. ATS service goes down
bq. Scenario 2. ATS service partially down

In general, I agree the concerns about the scenario when the timeline server is 
(partially) down makes sense. However, if we change the subject from ATS to 
HDFS/Kafka, I'm afraid we can get the similar conclusion. For example, HDFS is 
temporally not writable (We actually have observed this issue around YARN log 
aggregation). I can see the judgement has a obvious implication that the 
timeline server can be down, but HDFS/Kafka will not. It's correct to some 
extent base on the current timeline server SLA. Therefore, is making the 
timeline server reliable (or always-up) the essential solution? If the timeline 
server is reliable, it's going to relax the requirement to persist entities in 
a third place (this is the basic benefit I can see with HDFS/Kafka channel). 
While it may take a while to make sure the timeline server be as reliable as 
HDFS/Kafka does, we can make progress step by step, for example, YARN-2520 
should realistic to be achieved within a reasonable timeline.

Of course, there may still be a reliability gap between ATS/HBase and 
HDFS/Kafka (Actually, I'm not experienced about the reliability about the 
latter components, please let me know the exact gap it will be). It could be 
arguable that we still need to persist the entities in HDFS/Kafka when 
ATS/HBase is not available but HDFS/Kafka is still available. However, if we 
anyway need to improve the timeline server reliability, perhaps we should think 
carefully of the cost performance of implementing and maintaing another writing 
channel to bridge the gap.

bq. Scenario 3. ATS backing store fails

In this scenario, the entities have already reached the timeline server, right? 
I'm considering it as the internal reliability problem of the timeline server. 
As I mentioned the previous threads, it's the requirement that if the entity 
has reached the timeline server: the timeline server should take the 
responsibility to prevent if from being lost. I think it's a good point that 
the date store is going to be in outage (as HDFS can be temporally not 
writable). Having local backup for those outstanding received requests should 
be an answer for this scenario.

bq. However, with the HDFS channel, the ATS can essentially throttle the events 
Suppose you have a cluster pushing X events/second to the ATS. With the REST 
implementation, the ATS must try to handle X events every second; if it can’t 
keep up, or if it gets too many incoming connections, there’s not too much we 
can do here. 

This may not be the accurate judgement. I'm supposing you are comparing pushing 
each event in on request for REST API with writing a batch of X events into 
HDFS. REST API allows to to batch X events and send one request. Please refer 
to TimelineClient#putEntities for the details.

bq. In making the write path pluggable, we’d have to have two pieces: one to do 
the writing from the TimelineClient and one to the receiving in the ATS. These 
would have to be in pairs. We’ve already discussed some different 
implementations for this: REST, Kafka, and HDFS.
bq. The backing store is already pluggable. 

No problem, it's feasible to make the write path pluggable. However. though the 
store is pluggable, Leveldb an HBase is relatively similar to each compared 
HTTP REST vs HDFS/Kafka pair. The more important thing is that it's more 
difficult to manage different writing channels than to manage different stores, 
because one is client-side and the other is server-side. At server-side, the 
YARN cluster operator has the full control of the servers, and the limited 
hosts to deal with. At client-side, the YARN cluster operator may not have the 
access to it, and don't know how many clients and how many type of apps he/she 
needs to deal with. TimelineClient is a generic tool (not for a particular 
application such as Spark), such that it's good to make it lightweight and 
portable. Again, it's still a cost performance question.

bq.  Though as bc pointed out before, it’s fine for more experienced users to 
use HBase, but “regular” users should have a solution as well that is hopefully 
more scalable and reliable than LevelDB. 

Right, and this is also my concern about HDFS/Kafka channel, in particularly 
using it as a default. "Regular" users may not be experienced enough about 
HBase as well as HDFS/Kafka. It very much depends on the users and the use 
cases.

[~bcwalrus] and [~rkanter], thanks for putting new idea into the timeline 
service. In general, the timeline service is still a young project. We have 
different problems to solve and multiple ways to them. Additional writing 
channel is interesting, while given the whole roadmap of the timeline service, 
let's think critically of work that can improve the timeline service most 
significantly. Hopefully you can understand my concern. Thanks!

> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
>                 Key: YARN-1530
>                 URL: https://issues.apache.org/jira/browse/YARN-1530
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, 
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, 
> application timeline design-20140116.pdf, application timeline 
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework 
> data all by itself as YARN doesn't have a common solution. This JIRA attempts 
> to solve the storage, management and serving of per-framework data from 
> various applications, both running and finished. The aim is to change YARN to 
> collect and store data in a generic manner with plugin points for frameworks 
> to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to