Re: About Multiple Join in Pig

Daniel Dai Tue, 01 Nov 2016 21:21:01 -0700

Yes, you need to dump/store xxx_OrderRes to kick off the job. You will see two 
MapReduce jobs corresponding to the first and second join.


Thanks,
Daniel



On 11/1/16, 10:52 AM, "mingda li" <limingda1...@gmail.com> wrote:

>Dear Dai,
>
>Thanks for your reply.
>What I want to do is to compare the two different order of join. The query
>is as following:
>
>*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;*
>*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
>catalog_returns BY (cr_item_sk, cr_order_number);*
>*Dump or Store Bad_OrderRes;*
>
>*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, cr_order_number),
>catalog_sales BY (cs_item_sk, cs_order_number);*
>*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk, inventory BY
> inv_item_sk;*
>*Dump or Store Good_OrderRes;*
>
>Since Pig execute the query lazily, I think only by Dump or Store the
>result, I can know the time of MapReduce Job, is it right? If it is, then I
>need to count the time to Dump or Store the result as the time for the
>different orders' join.
>
>Bests,
>Mingda
>
>
>
>On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <da...@hortonworks.com> wrote:
>
>> Hi, Mingda,
>>
>> Pig does not do join reordering and will execute the query as the way it
>> is written. Note you can join multiple relations in one join statement.
>>
>> Do you want execution time for each join in your statement? I assume you
>> are using regular join and running with MapReduce, every join statement
>> will be a separate MapReduce job and the join runtime is the runtime for
>> its MapReduce job.
>>
>> Thanks,
>> Daniel
>>
>>
>>
>> On 10/31/16, 8:21 PM, "mingda li" <limingda1...@gmail.com> wrote:
>>
>> >Dear all,
>> >
>> >I am doing optimization for multiple join. I am not sure if Pig can decide
>> >the join order in optimization layer. Does anyone know about this? Or Pig
>> >just execute the query as the way it is written.
>> >
>> >And, I want to do the multiple way Join on different keys. Can the
>> >following query work?
>> >
>> >Res =
>> >JOIN
>> >(JOIN catalog_sales BY cs_item_sk, inventory BY  inv_item_sk) BY
>> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk,
>> >cr_order_number);
>> >
>> >BTW, each time, I run the query, it is finished in one second. Is there a
>> >way to see the execution time? I have set the  pig.udf.profile=true. Where
>> >can I find the time?
>> >
>> >Bests,
>> >Mingda
>>

Re: About Multiple Join in Pig

Reply via email to