[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598577#comment-14598577
 ] 

Sangjin Lee commented on YARN-3815:
-----------------------------------

{quote}
About flow online aggregation, I am not quite sure on requirement yet. Do we 
really want real time for flow aggregated data or some fine-grained time 
interval (like 15 secs) should be good enough - if we want to show some nice 
metrics chart for flow, this should be fine.
{quote}

Yes, I agree with that. When I said "real time", it doesn't mean real time in 
the sense that every metric is accurate to the second. Most likely raw data 
themselves (e.g. container data) are written on an interval anyway. Some type 
of time interval for aggregation is implied.

{quote}
Any special reason not to handle it in the same way above - as HBase 
coprocessor? It just sound like gross-grained time interval. Isn't it?
{quote}

I do see your point in that what I called the "real time" aggregation can be 
considered the same type of aggregation as the "offline" aggregation only on a 
shorter time interval. However, we also need to think about the use cases of 
such aggregated data.

The former type of aggregation is very much something that can be plugged into 
UI such as the RM UI or ambari to show more immediate data. These data may 
change as the user refreshes the UI. So this is closer to the raw data.

On the other hand, the latter type of aggregation lends itself to more 
analytical and ad-hoc analysis of data. These can be used for calculating 
chargebacks, usage trending, reporting, etc. Perhaps it could even contain more 
detailed info than the "real time" aggregated data for the reporting/data 
mining purposes. And that's where we would like to consider using phoenix to 
enable arbitrary ad-hoc SQL queries.

One analogy [~jrottinghuis] brings up regarding this is OLTP v. OLAP.

That's why we also think it makes sense to do only "offline" (time-based) 
aggregation for users and queues. At least in our case with hRaven, there 
hasn't been a compelling reason to show user- or queue-aggregated data in 
semi-real time. It has been perfectly adequate to show time-based aggregation, 
as data like this tend to be used more for reporting and analysis.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> ------------------------------------------------------------
>
>                 Key: YARN-3815
>                 URL: https://issues.apache.org/jira/browse/YARN-3815
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to