Re: Phoenix UPSERTS SELECT for data transformation

James Taylor Tue, 03 Feb 2015 08:23:39 -0800

FWIW, if you develop any standard SQL built-ins that are missing, we're
happy to pull these into Phoenix.
Thanks,
James


On Tuesday, February 3, 2015, Jaime Solano <[email protected]> wrote:

> Thanks, guys, for your insights. We'll try to stay away from custom
> functions for now, and try to use plain SQL to do the transformation. For
> custom functionality, we'll go with Pig.
>  On Jan 28, 2015 1:34 PM, "Ravi Kiran" <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> I second James opinion of going with Pig. You can use the
>> PhoenixHBaseLoader to load data and have custom Pig UDF to apply various
>> transformation and then finally have the tuples upsert back to Phoenix
>> tables using our PhoenixHBaseStorage.
>>
>> My two cents :)
>>
>> Regards
>> Ravi
>>
>> On Wed, Jan 28, 2015 at 9:35 AM, James Taylor <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> If you could share some examples of the "calculated values" case,
>>> that'd be helpful. When you say you need to "transform the whole data
>>> set", I'm not sure I understand. If you find yourself having to write
>>> 50 built-in functions, each with specific business logic, I'd
>>> definitely gravitate toward using our Pig integration.
>>>
>>> Thanks,
>>> James
>>>
>>> On Wed, Jan 28, 2015 at 6:51 AM, Jaime Solano <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>> > Hi James,
>>> >
>>> > First of all, thank you for your quick response. Let me give you more
>>> > details of our scenario:
>>> >
>>> > We want to load daily bulks of data into a Phoenix table. This table
>>> > contains normalized data, with around 60 columns. Then, we need to
>>> enrich
>>> > (transform) the data. Enriching means adding more columns of basically
>>> two
>>> > types:
>>> > - Values obtained from joins with other tables (reference data). This
>>> covers
>>> > a small part of the process.
>>> > - Calculated values, with particular business logic, hard to implement
>>> using
>>> > SQL. This is the case where we're thinking of building our own
>>> functions.
>>> > However, I'm concerned about this approach since (1) We need to
>>> transform
>>> > the whole data set and (2) we might end up creating a function per data
>>> > transformed/added (we expect around 50 additional columns to be added
>>> after
>>> > the enrichment process).
>>> >
>>> > Thank you for your time and I'd appreciate your thoughts about this.
>>> >
>>> > -Jaime
>>> >
>>> > On Jan 27, 2015 11:51 PM, "James Taylor" <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>> >>
>>> >> Hi Jaime,
>>> >>
>>> >> Would it be possible to see a few examples of the kind of
>>> >> transformations you're doing? I think the tool you use depends on
>>> >> whether you're transforming all of your data or a smaller subset. It
>>> >> also depends on the complexity of the transformation. If you're
>>> >> transforming every row of data in your HBase table to create a new row
>>> >> in a different HBase table, Phoenix is not going to be an ideal
>>> >> choice. Also, if you're transforming data such that you need a new
>>> >> built-in function for each kind of transformation, Phoenix would not
>>> >> be the right choice.
>>> >>
>>> >> Have you seen our map-reduce[1] and pig integration[2] support? Pig is
>>> >> very good at ETL. You may be able to leverage Pig to do the
>>> >> transformation such that the resulting table is queryable through
>>> >> Phoenix as well.
>>> >>
>>> >> HTH. Thanks,
>>> >>
>>> >>     James
>>> >>
>>> >> [1] http://phoenix.apache.org/phoenix_mr.html
>>> >> [2] http://phoenix.apache.org/pig_integration.html
>>> >>
>>> >> On Tue, Jan 27, 2015 at 11:50 AM, Jaime Solano <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> >> wrote:
>>> >> > Hi guys,
>>> >> >
>>> >> > The company I work for wants to use Phoenix for Data Transformation.
>>> >> > Basically, the idea is to denormalize and include additional
>>> calculated
>>> >> > data
>>> >> > to a details table, by using UPSERT SELECT statements (joins with
>>> other
>>> >> > tables and specific functions). This has proven to be challenging,
>>> since
>>> >> > the
>>> >> > SQL language sometimes is not enough, leading us to try implement
>>> our
>>> >> > own
>>> >> > built-in Phoenix functions (following this post:
>>> >> >
>>> >> >
>>> http://phoenix-hbase.blogspot.com/2013/04/how-to-add-your-own-built-in-function.html
>>> ).
>>> >> >
>>> >> > I feel this is not the right direction, and maybe we should be using
>>> >> > other
>>> >> > tools like Pig, MR or Storm (for Near-Real Time).
>>> >> >
>>> >> > What are your thoughts about this? Would you recommend Phoenix for
>>> >> > complex
>>> >> > Data transformation? What a re the drawbacks you see in this
>>> approach?
>>> >> >
>>> >> > Thanks in advance,
>>> >> > -Jaime
>>>
>>
>>

Re: Phoenix UPSERTS SELECT for data transformation

Reply via email to