Junping Du commented on YARN-3411:

Thanks [~jrottinghuis] for comments!
bq. I think it is reasonable that two implementations can differ in their 
backing schema as long as they both can write the data and retrieve the data 
with the same key information. Phoenix may need to add somethings to the rowkey 
in order to work properly, it may have to add some things, and ditto for the 
raw HBase implementation, some additional secondary lookups may be needed etc. 
That is part of the performance comparison to see.
I would agree with this in overall: in long term, each implementation should 
get optimized to the best which could have different strategies on schema 
design. But in short term performance test plan, the ideal case is two schema 
of implementations should get as much closed as it can so the result can get 
rid of noises caused by schema different. However, I agree that the current 
schema different shouldn't hint too much performance different. It should be 
fine if we can tolerant these noises.

bq.  Junping Du with respect to adding the flow version in the key, I think the 
problem with that is that you now require the caller to know what the version 
is in order to query back. I don't think that is a natural requirement. I know 
that I ran the "ComputeUniqueUsers" flow on the cluster, so I have user cluster 
and flowname, but I don't need to know the version to just query the last few 
runs right? If you do have the version (for reducer estimation and you want the 
last runs of the same flow back) then it should be possible to query by flow 
and by version, but I don't think it should be mandatory. Therefore I don't 
think that flow version must perse be a rowkey in all implementations.
Thanks for sharing the use case here. The case for query against cluster and 
flowname is pretty solid. However, the query against specific flow version also 
reasonable as user may have interest to understand the differences between 
different versions of flow on execution time, resource consumption, task 
failures/exceptions, etc? In addition, I have a quick question here: does 
making flow_version a key means it is mandatory in query? Just like we have 
flow_run as key, but we still treat it optional in query (or search). Isn't it?

bq. I think we'll find that with certain schema choices some things will be 
more performant while others will be somewhat slower.
That's true. We should keep schema design discussion open even after the patch 
here get in. 

I would give existing patch a +1 with:
1. open a JIRA for discussion on decision of flow version to key or column, 
then either Phoenix or HBase implementation could update with conclusion we 
2. open a cleanup JIRA with addressing minor issues mentioned above.

> [Storage implementation] explore the native HBase write schema for storage
> --------------------------------------------------------------------------
>                 Key: YARN-3411
>                 URL: https://issues.apache.org/jira/browse/YARN-3411
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Vrushali C
>            Priority: Critical
>         Attachments: ATSv2BackendHBaseSchemaproposal.pdf, 
> YARN-3411-YARN-2928.001.patch, YARN-3411-YARN-2928.002.patch, 
> YARN-3411-YARN-2928.003.patch, YARN-3411-YARN-2928.004.patch, 
> YARN-3411-YARN-2928.005.patch, YARN-3411-YARN-2928.006.patch, 
> YARN-3411-YARN-2928.007.patch, YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, 
> YARN-3411.poc.4.txt, YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, 
> YARN-3411.poc.7.txt, YARN-3411.poc.txt
> There is work that's in progress to implement the storage based on a Phoenix 
> schema (YARN-3134).
> In parallel, we would like to explore an implementation based on a native 
> HBase schema for the write path. Such a schema does not exclude using 
> Phoenix, especially for reads and offline queries.
> Once we have basic implementations of both options, we could evaluate them in 
> terms of performance, scalability, usability, etc. and make a call.

This message was sent by Atlassian JIRA

Reply via email to