Could you potentially store all of your facts in one table, join it agains your dimension table, and filter as needed?
On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[email protected]> wrote: > Unfortunately I can't think of any good way of doing this (other than what > Bertrand suggested with using a different language to generate the script). > > I'd also recommend Hive... it may be easier to do this in Hive since you > have SQL like syntax. (Haven't used Hive, but it looks like this type of > thing would be far more natural in Hive) > > > On Thu, Jul 18, 2013 at 12:09 PM, Something Something < > [email protected]> wrote: > >> I don't think this is macro-able, Pradeep. Every step of the way a >> different column gets updated. For example, for FACT_TABLE3 we update >> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from DIMENSION2 & >> so on. >> >> Feel free to correct me if I am wrong. Thanks. >> >> >> >> >> >> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <[email protected] >>> wrote: >> >>> Looks like this might be macroable. Not entirely sure how that can be >> done >>> yet... but I'd look into that if I were you. >>> >>> >>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something < >>> [email protected]> wrote: >>> >>>> Wow, Bertrand, on the Pig mailing list you're recommending not to use >>>> Pig... LOL! Jokes apart, I would think this would be a common use case >>> for >>>> Pig, no? Generating a Pig script on the fly is a decent idea, but >> we're >>>> hoping to avoid that - unless there's no other way. Thanks for the >>>> pointers. >>>> >>>> >>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[email protected] >>>>> wrote: >>>> >>>>> I would say either generate the script using another language (eg >>> Python) >>>>> or use a true programming language with an API having the same level >> of >>>>> abstraction (eg Java and Cascading). >>>>> >>>>> Bertrand >>>>> >>>>> >>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something < >>>>> [email protected]> wrote: >>>>> >>>>>> There must be a better way to do this in Pig. Here's how my script >>>> looks >>>>>> like right now: (omitted some snippet for saving space, but you >> will >>>> get >>>>>> the idea). >>>>>> >>>>>> FACT_TABLE = LOAD 'XYZ' as (col1 :chararray,………. col30: >> chararray); >>>>>> >>>>>> FACT_TABLE1 = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as >>> col2,….. >>>>>> udf10(col30) as col30; >>>>>> >>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value); >>>>>> >>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY >> key; >>>>>> >>>>>> FACT_TABLE3 = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as >>>> col1,……. >>>>>> FACT_TABLE1::col30 as col30; >>>>>> >>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value); >>>>>> >>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY >> key; >>>>>> >>>>>> FACT_TABLE5 = FOREACH FACT_TABLE4 GENERATE FACT_TABLE3::col1 as >>>>>> col1, DIMENSION2::value as col2,……. FACT_TABLE3::col30 as col30; >>>>>> >>>>>> & so on! There are 10 more such dimension tables to join. >>>>>> >>>>>> In short, each row on the fact table needs to be joined to a key >>> field >>>>> on a >>>>>> dimension table to get it's associated value. >>>>>> >>>>>> This is beginning to look ugly. Plus it's maintenance nightmare >> when >>>> it >>>>>> comes to adding new fields. What's the best way to code this in >> Pig? >>>>>> >>>>>> Thanks in advance. >>>>>> >>>>> >>>> >>> >>
