Re: About Multiple Join in Pig

mingda li Wed, 02 Nov 2016 21:42:38 -0700

My query is as following:

pig
-Dpig.additional.jars=/home/hadoop-user/pig-branch-0.lib/datafu-pig-incubating-1.3.1.jar



To open pig.

Then, input:


*REGISTER*
/home/hadoop-user/pig-branch-0.15/lib/datafu-pig-incubating-1.3.1.jar

data = LOAD 'hdfs://SCAI01.CS.UCLA.EDU:9000/clash/datasets/1.txt' using
PigStorage() as (val:int);

define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');

dat= FOREACH data GENERATE MurmurH32(val);

On Wed, Nov 2, 2016 at 9:35 PM, mingda li <limingda1...@gmail.com> wrote:

> En, thanks Debabrata, but actually, I register each time ( forget to tell
> you) before i run the commands.
> I use *REGISTER* /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-
> incubating-1.3.1.jar.
> But cannot help me.
>
> Any other reason?
>
> Thanks
>
> On Wed, Nov 2, 2016 at 8:03 PM, Debabrata Pani <android.p...@gmail.com>
> wrote:
>
>> It says that pig could not find the class Hasher. Start grunt with
>> -Dpig.additional.jars (before other pig arguments) or do a "register" of
>> individual jars before typing in your scripts.
>>
>> Regards,
>> Debabrata
>>
>> On Nov 3, 2016 07:09, "mingda li" <limingda1...@gmail.com> wrote:
>>
>> > Thanks. I have tried to install the datafu and finish quickstart
>> > successfully http://datafu.incubator.apache.org/docs/quick-start.html
>> >
>> > But when i use the murmur hash, it failed. I do not know why.
>> >
>> > grunt>  data = LOAD 'hdfs://***.UCLA.EDU:9000/clash/datasets/1.txt'
>> using
>> > PigStorage() as (val:int);
>> >
>> > grunt> data_out = FOREACH data GENERATE val;
>> >
>> > grunt> dat= FOREACH data GENERATE MurmurH32(val);
>> >
>> > 2016-11-02 18:25:18,424 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> > ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [,
>> > java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
>> >
>> > Details at logfile: /home/hadoop-user/pig-branch-
>> > 0.15/bin/pig_1478136031217.log
>> >
>> >
>> > The log file is in attachment.
>> >
>> >
>> > Bests,
>> >
>> > Mingda
>> >
>> >
>> > On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <da...@hortonworks.com>
>> wrote:
>> >
>> >> I see datafu has a patch for the UDF: https://issues.apache.org/jira
>> >> /browse/DATAFU-47
>> >>
>> >>
>> >>
>> >>
>> >> On 11/2/16, 11:45 AM, "mingda li" <limingda1...@gmail.com> wrote:
>> >>
>> >> >Dear all,
>> >> >
>> >> >Hi, now I wants to import a UDF function to pig command. Has anyone
>> ever
>> >> >done so? I want to import google's guava/murmur3_32 to pig. Could
>> anyone
>> >> >give some useful materials or suggestion？
>> >> >
>> >> >Bests,
>> >> >Mingda
>> >> >
>> >> >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <limingda1...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Yeah, I see. Thanks for your reply.
>> >> >>
>> >> >> Bests,
>> >> >> Mingda
>> >> >>
>> >> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <da...@hortonworks.com>
>> >> wrote:
>> >> >>
>> >> >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You
>> will
>> >> >>> see two MapReduce jobs corresponding to the first and second join.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Daniel
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On 11/1/16, 10:52 AM, "mingda li" <limingda1...@gmail.com> wrote:
>> >> >>>
>> >> >>> >Dear Dai,
>> >> >>> >
>> >> >>> >Thanks for your reply.
>> >> >>> >What I want to do is to compare the two different order of join.
>> The
>> >> >>> query
>> >> >>> >is as following:
>> >> >>> >
>> >> >>> >*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
>> >> >>> cs_item_sk;*
>> >> >>> >*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk,
>> cs_order_number),
>> >> >>> >catalog_returns BY (cr_item_sk, cr_order_number);*
>> >> >>> >*Dump or Store Bad_OrderRes;*
>> >> >>> >
>> >> >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk,
>> >> cr_order_number),
>> >> >>> >catalog_sales BY (cs_item_sk, cs_order_number);*
>> >> >>> >*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk, inventory BY
>> >> >>> > inv_item_sk;*
>> >> >>> >*Dump or Store Good_OrderRes;*
>> >> >>> >
>> >> >>> >Since Pig execute the query lazily, I think only by Dump or Store
>> the
>> >> >>> >result, I can know the time of MapReduce Job, is it right? If it
>> is,
>> >> >>> then I
>> >> >>> >need to count the time to Dump or Store the result as the time for
>> >> the
>> >> >>> >different orders' join.
>> >> >>> >
>> >> >>> >Bests,
>> >> >>> >Mingda
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <
>> da...@hortonworks.com>
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> Hi, Mingda,
>> >> >>> >>
>> >> >>> >> Pig does not do join reordering and will execute the query as
>> the
>> >> way
>> >> >>> it
>> >> >>> >> is written. Note you can join multiple relations in one join
>> >> statement.
>> >> >>> >>
>> >> >>> >> Do you want execution time for each join in your statement? I
>> >> assume
>> >> >>> you
>> >> >>> >> are using regular join and running with MapReduce, every join
>> >> statement
>> >> >>> >> will be a separate MapReduce job and the join runtime is the
>> >> runtime
>> >> >>> for
>> >> >>> >> its MapReduce job.
>> >> >>> >>
>> >> >>> >> Thanks,
>> >> >>> >> Daniel
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On 10/31/16, 8:21 PM, "mingda li" <limingda1...@gmail.com>
>> wrote:
>> >> >>> >>
>> >> >>> >> >Dear all,
>> >> >>> >> >
>> >> >>> >> >I am doing optimization for multiple join. I am not sure if Pig
>> >> can
>> >> >>> decide
>> >> >>> >> >the join order in optimization layer. Does anyone know about
>> >> this? Or
>> >> >>> Pig
>> >> >>> >> >just execute the query as the way it is written.
>> >> >>> >> >
>> >> >>> >> >And, I want to do the multiple way Join on different keys. Can
>> the
>> >> >>> >> >following query work?
>> >> >>> >> >
>> >> >>> >> >Res =
>> >> >>> >> >JOIN
>> >> >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY  inv_item_sk)
>> BY
>> >> >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk,
>> >> >>> >> >cr_order_number);
>> >> >>> >> >
>> >> >>> >> >BTW, each time, I run the query, it is finished in one second.
>> Is
>> >> >>> there a
>> >> >>> >> >way to see the execution time? I have set the
>> >> pig.udf.profile=true.
>> >> >>> Where
>> >> >>> >> >can I find the time?
>> >> >>> >> >
>> >> >>> >> >Bests,
>> >> >>> >> >Mingda
>> >> >>> >>
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >
>> >
>>
>
>

Re: About Multiple Join in Pig

Reply via email to