Li Lu updated YARN-3904:
    Attachment: YARN-3904-YARN-2928.007.patch

Thanks [~zjshen] and [~sjlee0]! I've uploaded v007 patch to address your 
comments. Specifically:

bq. However, it's better to have config such as "blah.blah.backend.type". When 
backend.type = hbase, we user can access HBase both directly and via Phoenix, 
and we allow aggregation. This may not need to part of this jira, but just 
think it out loudly.
Yes, this proposal makes sense. Actually it significantly simplifies 
deployment, since users no longer needs to know the exact class name of the 
backend. However, if we decide to move along this direction, there are some 
foreseeable nontrivial work such as changing class loading strategies. To 
better separate the current workload I propose to address this issue in a 
separate JIRA (we may not address it immediately).

bq. Make sense, but can we still make table creation centralized? I think we 
can make some option to create raw entity tables and aggregation tables 
bq. As for createTables(), I'm also of the opinion that it might be better if 
we moved it to a dedicated creator class.
I agree it is appealing to centralize table creations. After putting some 
thoughts here I think what we really want is a centralized _workflow_ for 
storage schema creations. That is to say, when setting up a v2 timeline server, 
users can simply run data schema creator for once to create necessary data 
storage schemas. With this in mind, I added Phoenix schema creation into the 
existing data schema creator, with a separate option {{-p}}. However, I'm 
keeping the SQL statements for table creation inside the writer file so that we 
also have a centralized place for the Phoenix storage schema. 

bq. Actually I'd like to ask whether this needs to be a service. Note that it 
is possible (or likely) that the writer will be executed in a mapreduce task.
We implement offline writers as {{AbstractServices}} to reuse the existing 
logic for service initialization, start, and finish. This pattern matches 
nicely with our use cases of our offline writers. I admit it sounds a little 
bit awkward if we call something inside a mapreduce job as a "service". 
However, the hadoop {{Service}} is just a light weight package for service 
lifecycle management. It does not strongly tight to server side or 
non-application use cases. Therefore I modified the writer to an 
{AbstractService} per [~zjshen]'s suggestion. 

bq. For the user aggregation tables, I believe the cluster needs to be included 
in the row key.
Yes. Fixed. 

bq. l.156: My JDBC knowledge is bit outdated, but do you want to prepare the 
statement every time write is done? Don't you want to prepare it once and reuse 
it? That optimization will follow later?
Nice catch. We can definitely reuse this PreparedStatement (as well as the 
connections) after we integrated the aggregation writer with the aggregator. My 
plan is to use this (relatively) stable writer to unblock the future patch on 
flow and user level offline aggregation. After we have the whole workflow, we 
can gradually add optimizations. Thoughts? 

bq. I would enforce the notion that this is a read-only object by making the 
members final
Yes. Fixed. 

bq. Should the primary key user and then cluster, or cluster and user? I think 
it might be better if it is cluster and user although it is different than the 
entity table. Vrushali C?
I'm OK with either. Any suggestions [~vrushalic]? Anyways we can decide it 
similar to TimelineEvents in HBase storage so I don't think this is blocking 
the JIRA? 

> Refactor timelineservice.storage to add support to online and offline 
> aggregation writers
> -----------------------------------------------------------------------------------------
>                 Key: YARN-3904
>                 URL: https://issues.apache.org/jira/browse/YARN-3904
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Li Lu
>            Assignee: Li Lu
>         Attachments: YARN-3904-YARN-2928.001.patch, 
> YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, 
> YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch, 
> YARN-3904-YARN-2928.006.patch, YARN-3904-YARN-2928.007.patch
> After we finished the design for time-based aggregation, we can adopt our 
> existing Phoenix storage into the storage of the aggregated data. In this 
> JIRA, I'm proposing to refactor writers to add support to aggregation 
> writers. Offline aggregation writers typically has less contextual 
> information. We can distinguish these writers by special naming. We can also 
> use CollectorContexts to model all contextual information and use it in our 
> writer interfaces. 

This message was sent by Atlassian JIRA

Reply via email to