Could you potentially store all of your facts in one table, join it agains your 
dimension table, and filter as needed?

On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[email protected]> wrote:

> Unfortunately I can't think of any good way of doing this (other than what
> Bertrand suggested with using a different language to generate the script).
> 
> I'd also recommend Hive... it may be easier to do this in Hive since you
> have SQL like syntax. (Haven't used Hive, but it looks like this type of
> thing would be far more natural in Hive)
> 
> 
> On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
> [email protected]> wrote:
> 
>> I don't think this is macro-able, Pradeep.  Every step of the way a
>> different column gets updated.  For example, for FACT_TABLE3 we update
>> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from DIMENSION2 &
>> so on.
>> 
>> Feel free to correct me if I am wrong.  Thanks.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <[email protected]
>>> wrote:
>> 
>>> Looks like this might be macroable. Not entirely sure how that can be
>> done
>>> yet... but I'd look into that if I were you.
>>> 
>>> 
>>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
>>> [email protected]> wrote:
>>> 
>>>> Wow, Bertrand, on the Pig mailing list you're recommending not to use
>>>> Pig... LOL!  Jokes apart, I would think this would be a common use case
>>> for
>>>> Pig, no?  Generating a Pig script on the fly is a decent idea, but
>> we're
>>>> hoping to avoid that - unless there's no other way.  Thanks for the
>>>> pointers.
>>>> 
>>>> 
>>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[email protected]
>>>>> wrote:
>>>> 
>>>>> I would say either generate the script using another language (eg
>>> Python)
>>>>> or use a true programming language with an API having the same level
>> of
>>>>> abstraction (eg Java and Cascading).
>>>>> 
>>>>> Bertrand
>>>>> 
>>>>> 
>>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> There must be a better way to do this in Pig.  Here's how my script
>>>> looks
>>>>>> like right now:  (omitted some snippet for saving space, but you
>> will
>>>> get
>>>>>> the idea).
>>>>>> 
>>>>>> FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
>> chararray);
>>>>>> 
>>>>>> FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
>>> col2,…..
>>>>>> udf10(col30) as col30;
>>>>>> 
>>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value);
>>>>>> 
>>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
>> key;
>>>>>> 
>>>>>> FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
>>>> col1,…….
>>>>>> FACT_TABLE1::col30 as col30;
>>>>>> 
>>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value);
>>>>>> 
>>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
>> key;
>>>>>> 
>>>>>> FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
>>>>>> col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
>>>>>> 
>>>>>> & so on!  There are 10 more such dimension tables to join.
>>>>>> 
>>>>>> In short, each row on the fact table needs to be joined to a key
>>> field
>>>>> on a
>>>>>> dimension table to get it's associated value.
>>>>>> 
>>>>>> This is beginning to look ugly.  Plus it's maintenance nightmare
>> when
>>>> it
>>>>>> comes to adding new fields.  What's the best way to code this in
>> Pig?
>>>>>> 
>>>>>> Thanks in advance.
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to