[
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598577#comment-14598577
]
Sangjin Lee commented on YARN-3815:
-----------------------------------
{quote}
About flow online aggregation, I am not quite sure on requirement yet. Do we
really want real time for flow aggregated data or some fine-grained time
interval (like 15 secs) should be good enough - if we want to show some nice
metrics chart for flow, this should be fine.
{quote}
Yes, I agree with that. When I said "real time", it doesn't mean real time in
the sense that every metric is accurate to the second. Most likely raw data
themselves (e.g. container data) are written on an interval anyway. Some type
of time interval for aggregation is implied.
{quote}
Any special reason not to handle it in the same way above - as HBase
coprocessor? It just sound like gross-grained time interval. Isn't it?
{quote}
I do see your point in that what I called the "real time" aggregation can be
considered the same type of aggregation as the "offline" aggregation only on a
shorter time interval. However, we also need to think about the use cases of
such aggregated data.
The former type of aggregation is very much something that can be plugged into
UI such as the RM UI or ambari to show more immediate data. These data may
change as the user refreshes the UI. So this is closer to the raw data.
On the other hand, the latter type of aggregation lends itself to more
analytical and ad-hoc analysis of data. These can be used for calculating
chargebacks, usage trending, reporting, etc. Perhaps it could even contain more
detailed info than the "real time" aggregated data for the reporting/data
mining purposes. And that's where we would like to consider using phoenix to
enable arbitrary ad-hoc SQL queries.
One analogy [~jrottinghuis] brings up regarding this is OLTP v. OLAP.
That's why we also think it makes sense to do only "offline" (time-based)
aggregation for users and queues. At least in our case with hRaven, there
hasn't been a compelling reason to show user- or queue-aggregated data in
semi-real time. It has been perfectly adequate to show time-based aggregation,
as data like this tend to be used more for reporting and analysis.
> [Aggregation] Application/Flow/User/Queue Level Aggregations
> ------------------------------------------------------------
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version
> and flow
> - User level, expect return: aggregated stats for applications submitted by
> user
> - Queue level, expect return: aggregated stats for applications within the
> Queue
> Application states is the basic building block for all other level
> aggregations. We can provide Flow/User/Queue level aggregated statistics info
> based on application states (a dedicated table for application states is
> needed which is missing from previous design documents like HBase/Phoenix
> schema design).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)