>> what is the purpose/deliverable of coalescing Here's the use case. Basically, Oracle can no longer handle the load! The aggregations are currently done using PL/SQL based stored procedures & as the data size is increasing these are running slower & slower on Oracle. The same aggregations are running faster on Hadoop so we are thinking of switching. Now the challenge is to make the Pig script more maintainable.
>>Hive Some of the Facts come from other teams. We will have to first load their data onto a Hive table, then use UDFs to perform transformation. Experimenting with Hive & Shark is on our plan, but the question is - is there a better way to do this in Pig? On Fri, Jul 19, 2013 at 7:33 AM, Dan DeCapria, CivicScience < [email protected]> wrote: > It seems like your fact table and its corresponding dimension tables follow > a traditional data warehousing star topology relational diagram. I would > have to ask what is the purpose/deliverable of coalescing your star > topology into a 1NF of fields in Pig? In this case, I'd agree with > Pradeep; that the use of MR for queries against the warehouse using Hive > might be more appropriate. If you really do need the coalesced 1NF of your > warehouse, I'd recommend a parent bash/perl/php script to dynamically > generate a pig script, with your table names and paths as parameter inputs > to the script. > > -Dan > > > On Fri, Jul 19, 2013 at 10:10 AM, Sajid Raza <[email protected]> wrote: > > > Could you potentially store all of your facts in one table, join it > agains > > your dimension table, and filter as needed? > > > > On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[email protected]> > > wrote: > > > > > Unfortunately I can't think of any good way of doing this (other than > > what > > > Bertrand suggested with using a different language to generate the > > script). > > > > > > I'd also recommend Hive... it may be easier to do this in Hive since > you > > > have SQL like syntax. (Haven't used Hive, but it looks like this type > of > > > thing would be far more natural in Hive) > > > > > > > > > On Thu, Jul 18, 2013 at 12:09 PM, Something Something < > > > [email protected]> wrote: > > > > > >> I don't think this is macro-able, Pradeep. Every step of the way a > > >> different column gets updated. For example, for FACT_TABLE3 we update > > >> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from > > DIMENSION2 & > > >> so on. > > >> > > >> Feel free to correct me if I am wrong. Thanks. > > >> > > >> > > >> > > >> > > >> > > >> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota < > > [email protected] > > >>> wrote: > > >> > > >>> Looks like this might be macroable. Not entirely sure how that can be > > >> done > > >>> yet... but I'd look into that if I were you. > > >>> > > >>> > > >>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something < > > >>> [email protected]> wrote: > > >>> > > >>>> Wow, Bertrand, on the Pig mailing list you're recommending not to > use > > >>>> Pig... LOL! Jokes apart, I would think this would be a common use > > case > > >>> for > > >>>> Pig, no? Generating a Pig script on the fly is a decent idea, but > > >> we're > > >>>> hoping to avoid that - unless there's no other way. Thanks for the > > >>>> pointers. > > >>>> > > >>>> > > >>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux < > [email protected] > > >>>>> wrote: > > >>>> > > >>>>> I would say either generate the script using another language (eg > > >>> Python) > > >>>>> or use a true programming language with an API having the same > level > > >> of > > >>>>> abstraction (eg Java and Cascading). > > >>>>> > > >>>>> Bertrand > > >>>>> > > >>>>> > > >>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something < > > >>>>> [email protected]> wrote: > > >>>>> > > >>>>>> There must be a better way to do this in Pig. Here's how my > script > > >>>> looks > > >>>>>> like right now: (omitted some snippet for saving space, but you > > >> will > > >>>> get > > >>>>>> the idea). > > >>>>>> > > >>>>>> FACT_TABLE = LOAD 'XYZ' as (col1 :chararray,………. col30: > > >> chararray); > > >>>>>> > > >>>>>> FACT_TABLE1 = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as > > >>> col2,….. > > >>>>>> udf10(col30) as col30; > > >>>>>> > > >>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value); > > >>>>>> > > >>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY > > >> key; > > >>>>>> > > >>>>>> FACT_TABLE3 = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as > > >>>> col1,……. > > >>>>>> FACT_TABLE1::col30 as col30; > > >>>>>> > > >>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value); > > >>>>>> > > >>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY > > >> key; > > >>>>>> > > >>>>>> FACT_TABLE5 = FOREACH FACT_TABLE4 GENERATE FACT_TABLE3::col1 as > > >>>>>> col1, DIMENSION2::value as col2,……. FACT_TABLE3::col30 as col30; > > >>>>>> > > >>>>>> & so on! There are 10 more such dimension tables to join. > > >>>>>> > > >>>>>> In short, each row on the fact table needs to be joined to a key > > >>> field > > >>>>> on a > > >>>>>> dimension table to get it's associated value. > > >>>>>> > > >>>>>> This is beginning to look ugly. Plus it's maintenance nightmare > > >> when > > >>>> it > > >>>>>> comes to adding new fields. What's the best way to code this in > > >> Pig? > > >>>>>> > > >>>>>> Thanks in advance. > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > >
