Re: Getting dimension values for Facts

Something Something Fri, 19 Jul 2013 16:34:38 -0700

>> what is the purpose/deliverable of coalescing
Here's the use case.  Basically, Oracle can no longer handle the load!  The
aggregations are currently done using PL/SQL based stored procedures & as
the data size is increasing these are running slower & slower on Oracle.
 The same aggregations are running faster on Hadoop so we are thinking of
switching.  Now the challenge is to make the Pig script more maintainable.


>>Hive
Some of the Facts come from other teams.  We will have to first load their
data onto a Hive table, then use UDFs to perform transformation.
 Experimenting with Hive & Shark is on our plan, but the question is - is
there a better way to do this in Pig?




On Fri, Jul 19, 2013 at 7:33 AM, Dan DeCapria, CivicScience <
[email protected]> wrote:

> It seems like your fact table and its corresponding dimension tables follow
> a traditional data warehousing star topology relational diagram.  I would
> have to ask what is the purpose/deliverable of coalescing your star
> topology into a 1NF of fields in Pig?  In this case, I'd agree with
> Pradeep; that the use of MR for queries against the warehouse using Hive
> might be more appropriate.  If you really do need the coalesced 1NF of your
> warehouse, I'd recommend a parent bash/perl/php script to dynamically
> generate a pig script, with your table names and paths as parameter inputs
> to the script.
>
> -Dan
>
>
> On Fri, Jul 19, 2013 at 10:10 AM, Sajid Raza <[email protected]> wrote:
>
> > Could you potentially store all of your facts in one table, join it
> agains
> > your dimension table, and filter as needed?
> >
> > On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[email protected]>
> > wrote:
> >
> > > Unfortunately I can't think of any good way of doing this (other than
> > what
> > > Bertrand suggested with using a different language to generate the
> > script).
> > >
> > > I'd also recommend Hive... it may be easier to do this in Hive since
> you
> > > have SQL like syntax. (Haven't used Hive, but it looks like this type
> of
> > > thing would be far more natural in Hive)
> > >
> > >
> > > On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
> > > [email protected]> wrote:
> > >
> > >> I don't think this is macro-able, Pradeep.  Every step of the way a
> > >> different column gets updated.  For example, for FACT_TABLE3 we update
> > >> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from
> > DIMENSION2 &
> > >> so on.
> > >>
> > >> Feel free to correct me if I am wrong.  Thanks.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <
> > [email protected]
> > >>> wrote:
> > >>
> > >>> Looks like this might be macroable. Not entirely sure how that can be
> > >> done
> > >>> yet... but I'd look into that if I were you.
> > >>>
> > >>>
> > >>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
> > >>> [email protected]> wrote:
> > >>>
> > >>>> Wow, Bertrand, on the Pig mailing list you're recommending not to
> use
> > >>>> Pig... LOL!  Jokes apart, I would think this would be a common use
> > case
> > >>> for
> > >>>> Pig, no?  Generating a Pig script on the fly is a decent idea, but
> > >> we're
> > >>>> hoping to avoid that - unless there's no other way.  Thanks for the
> > >>>> pointers.
> > >>>>
> > >>>>
> > >>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <
> [email protected]
> > >>>>> wrote:
> > >>>>
> > >>>>> I would say either generate the script using another language (eg
> > >>> Python)
> > >>>>> or use a true programming language with an API having the same
> level
> > >> of
> > >>>>> abstraction (eg Java and Cascading).
> > >>>>>
> > >>>>> Bertrand
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> > >>>>> [email protected]> wrote:
> > >>>>>
> > >>>>>> There must be a better way to do this in Pig.  Here's how my
> script
> > >>>> looks
> > >>>>>> like right now:  (omitted some snippet for saving space, but you
> > >> will
> > >>>> get
> > >>>>>> the idea).
> > >>>>>>
> > >>>>>> FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
> > >> chararray);
> > >>>>>>
> > >>>>>> FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
> > >>> col2,…..
> > >>>>>> udf10(col30) as col30;
> > >>>>>>
> > >>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value);
> > >>>>>>
> > >>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
> > >> key;
> > >>>>>>
> > >>>>>> FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> > >>>> col1,…….
> > >>>>>> FACT_TABLE1::col30 as col30;
> > >>>>>>
> > >>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value);
> > >>>>>>
> > >>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
> > >> key;
> > >>>>>>
> > >>>>>> FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> > >>>>>> col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> > >>>>>>
> > >>>>>> & so on!  There are 10 more such dimension tables to join.
> > >>>>>>
> > >>>>>> In short, each row on the fact table needs to be joined to a key
> > >>> field
> > >>>>> on a
> > >>>>>> dimension table to get it's associated value.
> > >>>>>>
> > >>>>>> This is beginning to look ugly.  Plus it's maintenance nightmare
> > >> when
> > >>>> it
> > >>>>>> comes to adding new fields.  What's the best way to code this in
> > >> Pig?
> > >>>>>>
> > >>>>>> Thanks in advance.
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: Getting dimension values for Facts

Reply via email to