[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141644#comment-14141644
 ] 

Robert Kanter commented on YARN-1530:
-------------------------------------

I also agree that providing reliability through an “always-up” ATS service is 
not the optimal solution here for the reasons already mentioned.  We should 
instead make the write path and backing store reliable (or at least somehow 
recoverable).  

{quote}
Though each application can write the timeline entities into HDFS in a 
distributed manner, there’s still a single timeline server that fetches the 
files of the timeline entities written by ALL applications. The bottleneck is 
still there. Essentially I don’t see any difference between publishing entities 
via HTTP REST interface and via HDFS in terms of scalability.{quote}
Technically yes, there is still the same bottleneck.  However, with the HDFS 
channel, the ATS can essentially throttle the events  Suppose you have a 
cluster pushing X events/second to the ATS.  With the REST implementation, the 
ATS must try to handle X events every second; if it can’t keep up, or if it 
gets too many incoming connections, there’s not too much we can do here.  I 
suppose we could add active-active HA so we have more ATS servers running, but 
I’m not sure we want to make that a requirement — we’d also have to come up 
with a good way of balancing this.  With the HDFS implementation, the ATS has 
more control over how it ingests the events: for example, it could read a 
maximum of Y events per poll, or Y events per job, etc.  While this will slow 
down the availability of the events in the ATS, it will allow it to keep 
running normally and not require active-active HA.  And if we make this 
configurable enough, users with beefier ATS machines could increase Y.

It sounds like there are two areas where we’re having difficulty coming to a 
consensus:
# The write path/communication channel from the TimelineClient to the ATS or 
backing store
# The backing store itself

I can see reasons for having different implementations for the backing store 
given that HBase is a “heavy” external service and we should have something 
that works out-of-the-box.  Ideally, I think it would be best if we could all 
agree on a single write path, though making it pluggable is certainly an 
option.  As for maintaining them, I think we should be fine as long as we don’t 
have too many implementations.  We already do that for other components, such 
as the scheduler; though we should be careful to make sure that the different 
implementations only implement what they need to and any shareable code is 
shared.  In making the write path pluggable, we’d have to have two pieces: one 
to do the writing from the TimelineClient and one to the receiving in the ATS.  
These would have to be in pairs.  We’ve already discussed some different 
implementations for this: REST, Kafka, and HDFS.  

The backing store is already pluggable.  Though as bc pointed out before, it’s 
fine for more experienced users to use HBase, but “regular” users should have a 
solution as well that is hopefully more scalable and reliable than LevelDB.  It 
would be great if we could provide a backing store that’s in between LevelDB 
and HBase.  And I think it’s fine to be an external to Hadoop as long as it’s 
relatively simple to setup and maintain.  Though I’ll admit I’m not really sure 
of such a store we could use.  Does anyone have any suggestions on this?

> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
>                 Key: YARN-1530
>                 URL: https://issues.apache.org/jira/browse/YARN-1530
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, 
> ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, 
> application timeline design-20140116.pdf, application timeline 
> design-20140130.pdf, application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework 
> data all by itself as YARN doesn't have a common solution. This JIRA attempts 
> to solve the storage, management and serving of per-framework data from 
> various applications, both running and finished. The aim is to change YARN to 
> collect and store data in a generic manner with plugin points for frameworks 
> to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to