[ 
https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575358#comment-14575358
 ] 

Li Lu commented on YARN-2928:
-----------------------------

Hi [~jamestaylor]

Thank you very much for your suggestions and PHOENIX-2028! I wrote the 
experimental Phoenix writer code and currently have some follow up questions 
w.r.t your comments. 

bq. The easiest is probably to create the HBase table the same way (through 
code or using the HBase shell) with the KeyPrefixRegionSplitPolicy specified at 
create time. Then, in Phoenix you can issue a CREATE TABLE statement against 
the existing HBase table and it'll just map to it. Then you'll have your split 
policy for your benchmark in both write paths.

If I understand this correctly, in this case, Phoenix will inherit pre-split 
settings from HBase? Will this alter the existing HBase table, including its 
schema and/or data inside? In general, if one runs CREATE TABLE IF NOT EXISTS 
or simply CREATE TABLE commands over a pre-split existing HBase table, will 
Phoenix simply accept the existing table as-is? 

bq. An alternative to dynamic columns is to define views over your Phoenix 
table (http://phoenix.apache.org/views.html).

I once looked at views but I'm not sure if that fits our write path use case 
well. Let me briefly talk about our use case in YARN first. In general, we 
would like to dynamically store the configuration and metrics for each YARN 
timeline entity in a Phoenix database, such that our timeline reader apps or 
users can use SQL to query historical data. Phoenix view may make a perfect 
solution for the reader use cases. However, we are hitting problems on the 
writer side. We store each configuration/metric key-value pair in a dynamic 
column. This causes us two main troubles. First, we need to use a dynamically 
generated SQL statement to write to the Phoenix table which is cumbersome and 
error-prone. Second, when performing aggregations, we need to aggregate on all 
available metrics for an application (or a user, flow), but we cannot simply 
iterate on those dynamic columns because there is no such API. I'm not sure how 
to resolve these two problems via Phoenix view, or via existing Phoenix APIs. 
Actually, I suspect that if it's possible to fall back to the HBase-style APIs, 
our writer path would be much simpler. 

bq. If you do end up going with a direct HBase write path, I'd encourage you to 
use the Phoenix serialization format (through PDataType and derived classes) to 
ensure you can do adhoc querying on the data.

We're currently looking into this method in the aggregation part. We're doing 
our best to support SQL on the aggregated data by using Phoenix. One potential 
solution is to use HBase coprocessors to aggregate application data from the 
HBase storage, and then store them in a Phoenix aggregation table. However, if 
we want to keep aggregating on the Phoenix table, can we also write a HBase 
coprocessor that read the Phoenix PDataTypes, and aggregate them into other 
Phoenix tables? If it's possible, are there any stable (or "safe") APIs for 
PDataTypes?

A slightly more generalized question here is, is SQL the _only_ API for 
Phoenix, or there may be more? I ask this question because from a YARN timeline 
service perspective, Phoenix is a nice tool through which we can easily add SQL 
support to our final users, but we may not necessarily use SQL to program it 
all the time? 

Thank you very much for your comments and help from the Phoenix side. Our 
current Phoenix writer is more of an experimental version, but we really hope 
to have something for our aggregators and readers in near future. 


> YARN Timeline Service: Next generation
> --------------------------------------
>
>                 Key: YARN-2928
>                 URL: https://issues.apache.org/jira/browse/YARN-2928
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>            Priority: Critical
>         Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf, Data model proposal 
> v1.pdf, Timeline Service Next Gen - Planning - ppt.pptx, 
> TimelineServiceStoragePerformanceTestSummaryYARN-2928.pdf
>
>
> We have the application timeline server implemented in yarn per YARN-1530 and 
> YARN-321. Although it is a great feature, we have recognized several critical 
> issues and features that need to be addressed.
> This JIRA proposes the design and implementation changes to address those. 
> This is phase 1 of this effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to