[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893022#comment-13893022
]
Patrick Wendell commented on YARN-1530:
---------------------------------------
Just gave this a read. Based on my understanding, the design here is basically
an indexing service for timeseries meta data from applications. The major
design decisions are that the API is REST for both inserting and removing data
and that the data format will be fairly structured, include a first class
notion of time, and support filtering based on some dimensional information.
Other questions like “how is the data persisted” and “what type of intermediate
aggregation do we support” seem to be undecided at this point or will be
pluggable.
I can give feedback from the perspective of Spark, which is an application that
runs on YARN but is not MapReduce. In Spark’s case, while we enthusiastically
support YARN, we also support other resource managers. So it’s unlikely we’d
ever add this indexing service as a dependency in the way we architect our UI
persistence. However, we are in the process of thinking about building a
history server component right now, so it would be nice to structure things in
a way where this can be leveraged in YARN environments. The fact that the API
is simple (REST) is a big +1 in that regard.
My biggest concern with this design is the notion of sending live data to a
single node rather than writing through HDFS. In Spark, tasks can easily be 100
milliseconds or less. This means that even a short Spark job can execute tens
of thousands of tasks and large spark job can execute hundreds of thousands of
tasks or more. It’s easily an order of magnitude more tasks per unit time than
MR and we also track a large amount of instrumentation per task since users
tend to be very performance conscious. So I might worry about the rate at which
events can be reported over REST vs over a bulk transfer through compressed
HDFS files.
Another question - if we wanted to write an “approved” UI that would be served
from within the same JVM, what would be the interface between that UI and the
indexing service? Would it also speak REST just within a single process, or is
it some other interface?
A final question - what is the security reason why YARN can't link to a
framework-specific UI? It seems like whether the user has a link to the URL and
whether it's secure are unrelated. I’m not super familiar with the security
model around web UI’s in YARN though...
> [Umbrella] Store, manage and serve per-framework application-timeline data
> --------------------------------------------------------------------------
>
> Key: YARN-1530
> URL: https://issues.apache.org/jira/browse/YARN-1530
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Vinod Kumar Vavilapalli
> Attachments: application timeline design-20140108.pdf, application
> timeline design-20140116.pdf, application timeline design-20140130.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework
> data all by itself as YARN doesn't have a common solution. This JIRA attempts
> to solve the storage, management and serving of per-framework data from
> various applications, both running and finished. The aim is to change YARN to
> collect and store data in a generic manner with plugin points for frameworks
> to do their own thing w.r.t interpretation and serving.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)