[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143034#comment-14143034
]
Zhijie Shen commented on YARN-1530:
-----------------------------------
bq. Scenario 1. ATS service goes down
bq. Scenario 2. ATS service partially down
In general, I agree the concerns about the scenario when the timeline server is
(partially) down makes sense. However, if we change the subject from ATS to
HDFS/Kafka, I'm afraid we can get the similar conclusion. For example, HDFS is
temporally not writable (We actually have observed this issue around YARN log
aggregation). I can see the judgement has a obvious implication that the
timeline server can be down, but HDFS/Kafka will not. It's correct to some
extent base on the current timeline server SLA. Therefore, is making the
timeline server reliable (or always-up) the essential solution? If the timeline
server is reliable, it's going to relax the requirement to persist entities in
a third place (this is the basic benefit I can see with HDFS/Kafka channel).
While it may take a while to make sure the timeline server be as reliable as
HDFS/Kafka does, we can make progress step by step, for example, YARN-2520
should realistic to be achieved within a reasonable timeline.
Of course, there may still be a reliability gap between ATS/HBase and
HDFS/Kafka (Actually, I'm not experienced about the reliability about the
latter components, please let me know the exact gap it will be). It could be
arguable that we still need to persist the entities in HDFS/Kafka when
ATS/HBase is not available but HDFS/Kafka is still available. However, if we
anyway need to improve the timeline server reliability, perhaps we should think
carefully of the cost performance of implementing and maintaing another writing
channel to bridge the gap.
bq. Scenario 3. ATS backing store fails
In this scenario, the entities have already reached the timeline server, right?
I'm considering it as the internal reliability problem of the timeline server.
As I mentioned the previous threads, it's the requirement that if the entity
has reached the timeline server: the timeline server should take the
responsibility to prevent if from being lost. I think it's a good point that
the date store is going to be in outage (as HDFS can be temporally not
writable). Having local backup for those outstanding received requests should
be an answer for this scenario.
bq. However, with the HDFS channel, the ATS can essentially throttle the events
Suppose you have a cluster pushing X events/second to the ATS. With the REST
implementation, the ATS must try to handle X events every second; if it can’t
keep up, or if it gets too many incoming connections, there’s not too much we
can do here.
This may not be the accurate judgement. I'm supposing you are comparing pushing
each event in on request for REST API with writing a batch of X events into
HDFS. REST API allows to to batch X events and send one request. Please refer
to TimelineClient#putEntities for the details.
bq. In making the write path pluggable, we’d have to have two pieces: one to do
the writing from the TimelineClient and one to the receiving in the ATS. These
would have to be in pairs. We’ve already discussed some different
implementations for this: REST, Kafka, and HDFS.
bq. The backing store is already pluggable.
No problem, it's feasible to make the write path pluggable. However. though the
store is pluggable, Leveldb an HBase is relatively similar to each compared
HTTP REST vs HDFS/Kafka pair. The more important thing is that it's more
difficult to manage different writing channels than to manage different stores,
because one is client-side and the other is server-side. At server-side, the
YARN cluster operator has the full control of the servers, and the limited
hosts to deal with. At client-side, the YARN cluster operator may not have the
access to it, and don't know how many clients and how many type of apps he/she
needs to deal with. TimelineClient is a generic tool (not for a particular
application such as Spark), such that it's good to make it lightweight and
portable. Again, it's still a cost performance question.
bq. Though as bc pointed out before, it’s fine for more experienced users to
use HBase, but “regular” users should have a solution as well that is hopefully
more scalable and reliable than LevelDB.
Right, and this is also my concern about HDFS/Kafka channel, in particularly
using it as a default. "Regular" users may not be experienced enough about
HBase as well as HDFS/Kafka. It very much depends on the users and the use
cases.
[~bcwalrus] and [~rkanter], thanks for putting new idea into the timeline
service. In general, the timeline service is still a young project. We have
different problems to solve and multiple ways to them. Additional writing
channel is interesting, while given the whole roadmap of the timeline service,
let's think critically of work that can improve the timeline service most
significantly. Hopefully you can understand my concern. Thanks!
> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
> Key: YARN-1530
> URL: https://issues.apache.org/jira/browse/YARN-1530
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Vinod Kumar Vavilapalli
> Attachments: ATS-Write-Pipeline-Design-Proposal.pdf,
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf,
> application timeline design-20140116.pdf, application timeline
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework
> data all by itself as YARN doesn't have a common solution. This JIRA attempts
> to solve the storage, management and serving of per-framework data from
> various applications, both running and finished. The aim is to change YARN to
> collect and store data in a generic manner with plugin points for frameworks
> to do their own thing w.r.t interpretation and serving.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)