[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143034#comment-14143034 ]
Zhijie Shen commented on YARN-1530: ----------------------------------- bq. Scenario 1. ATS service goes down bq. Scenario 2. ATS service partially down In general, I agree the concerns about the scenario when the timeline server is (partially) down makes sense. However, if we change the subject from ATS to HDFS/Kafka, I'm afraid we can get the similar conclusion. For example, HDFS is temporally not writable (We actually have observed this issue around YARN log aggregation). I can see the judgement has a obvious implication that the timeline server can be down, but HDFS/Kafka will not. It's correct to some extent base on the current timeline server SLA. Therefore, is making the timeline server reliable (or always-up) the essential solution? If the timeline server is reliable, it's going to relax the requirement to persist entities in a third place (this is the basic benefit I can see with HDFS/Kafka channel). While it may take a while to make sure the timeline server be as reliable as HDFS/Kafka does, we can make progress step by step, for example, YARN-2520 should realistic to be achieved within a reasonable timeline. Of course, there may still be a reliability gap between ATS/HBase and HDFS/Kafka (Actually, I'm not experienced about the reliability about the latter components, please let me know the exact gap it will be). It could be arguable that we still need to persist the entities in HDFS/Kafka when ATS/HBase is not available but HDFS/Kafka is still available. However, if we anyway need to improve the timeline server reliability, perhaps we should think carefully of the cost performance of implementing and maintaing another writing channel to bridge the gap. bq. Scenario 3. ATS backing store fails In this scenario, the entities have already reached the timeline server, right? I'm considering it as the internal reliability problem of the timeline server. As I mentioned the previous threads, it's the requirement that if the entity has reached the timeline server: the timeline server should take the responsibility to prevent if from being lost. I think it's a good point that the date store is going to be in outage (as HDFS can be temporally not writable). Having local backup for those outstanding received requests should be an answer for this scenario. bq. However, with the HDFS channel, the ATS can essentially throttle the events Suppose you have a cluster pushing X events/second to the ATS. With the REST implementation, the ATS must try to handle X events every second; if it can’t keep up, or if it gets too many incoming connections, there’s not too much we can do here. This may not be the accurate judgement. I'm supposing you are comparing pushing each event in on request for REST API with writing a batch of X events into HDFS. REST API allows to to batch X events and send one request. Please refer to TimelineClient#putEntities for the details. bq. In making the write path pluggable, we’d have to have two pieces: one to do the writing from the TimelineClient and one to the receiving in the ATS. These would have to be in pairs. We’ve already discussed some different implementations for this: REST, Kafka, and HDFS. bq. The backing store is already pluggable. No problem, it's feasible to make the write path pluggable. However. though the store is pluggable, Leveldb an HBase is relatively similar to each compared HTTP REST vs HDFS/Kafka pair. The more important thing is that it's more difficult to manage different writing channels than to manage different stores, because one is client-side and the other is server-side. At server-side, the YARN cluster operator has the full control of the servers, and the limited hosts to deal with. At client-side, the YARN cluster operator may not have the access to it, and don't know how many clients and how many type of apps he/she needs to deal with. TimelineClient is a generic tool (not for a particular application such as Spark), such that it's good to make it lightweight and portable. Again, it's still a cost performance question. bq. Though as bc pointed out before, it’s fine for more experienced users to use HBase, but “regular” users should have a solution as well that is hopefully more scalable and reliable than LevelDB. Right, and this is also my concern about HDFS/Kafka channel, in particularly using it as a default. "Regular" users may not be experienced enough about HBase as well as HDFS/Kafka. It very much depends on the users and the use cases. [~bcwalrus] and [~rkanter], thanks for putting new idea into the timeline service. In general, the timeline service is still a young project. We have different problems to solve and multiple ways to them. Additional writing channel is interesting, while given the whole roadmap of the timeline service, let's think critically of work that can improve the timeline service most significantly. Hopefully you can understand my concern. Thanks! > [Umbrella] Store, manage and serve per-framework application-timeline data > -------------------------------------------------------------------------- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)