My query is as following: pig -Dpig.additional.jars=/home/hadoop-user/pig-branch-0.lib/datafu-pig-incubating-1.3.1.jar
To open pig. Then, input: *REGISTER* /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-incubating-1.3.1.jar data = LOAD 'hdfs://SCAI01.CS.UCLA.EDU:9000/clash/datasets/1.txt' using PigStorage() as (val:int); define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); dat= FOREACH data GENERATE MurmurH32(val); On Wed, Nov 2, 2016 at 9:35 PM, mingda li <limingda1...@gmail.com> wrote: > En, thanks Debabrata, but actually, I register each time ( forget to tell > you) before i run the commands. > I use *REGISTER* /home/hadoop-user/pig-branch-0.15/lib/datafu-pig- > incubating-1.3.1.jar. > But cannot help me. > > Any other reason? > > Thanks > > On Wed, Nov 2, 2016 at 8:03 PM, Debabrata Pani <android.p...@gmail.com> > wrote: > >> It says that pig could not find the class Hasher. Start grunt with >> -Dpig.additional.jars (before other pig arguments) or do a "register" of >> individual jars before typing in your scripts. >> >> Regards, >> Debabrata >> >> On Nov 3, 2016 07:09, "mingda li" <limingda1...@gmail.com> wrote: >> >> > Thanks. I have tried to install the datafu and finish quickstart >> > successfully http://datafu.incubator.apache.org/docs/quick-start.html >> > >> > But when i use the murmur hash, it failed. I do not know why. >> > >> > grunt> data = LOAD 'hdfs://***.UCLA.EDU:9000/clash/datasets/1.txt' >> using >> > PigStorage() as (val:int); >> > >> > grunt> data_out = FOREACH data GENERATE val; >> > >> > grunt> dat= FOREACH data GENERATE MurmurH32(val); >> > >> > 2016-11-02 18:25:18,424 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> > ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [, >> > java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] >> > >> > Details at logfile: /home/hadoop-user/pig-branch- >> > 0.15/bin/pig_1478136031217.log >> > >> > >> > The log file is in attachment. >> > >> > >> > Bests, >> > >> > Mingda >> > >> > >> > On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <da...@hortonworks.com> >> wrote: >> > >> >> I see datafu has a patch for the UDF: https://issues.apache.org/jira >> >> /browse/DATAFU-47 >> >> >> >> >> >> >> >> >> >> On 11/2/16, 11:45 AM, "mingda li" <limingda1...@gmail.com> wrote: >> >> >> >> >Dear all, >> >> > >> >> >Hi, now I wants to import a UDF function to pig command. Has anyone >> ever >> >> >done so? I want to import google's guava/murmur3_32 to pig. Could >> anyone >> >> >give some useful materials or suggestion? >> >> > >> >> >Bests, >> >> >Mingda >> >> > >> >> >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <limingda1...@gmail.com> >> >> wrote: >> >> > >> >> >> Yeah, I see. Thanks for your reply. >> >> >> >> >> >> Bests, >> >> >> Mingda >> >> >> >> >> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <da...@hortonworks.com> >> >> wrote: >> >> >> >> >> >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You >> will >> >> >>> see two MapReduce jobs corresponding to the first and second join. >> >> >>> >> >> >>> Thanks, >> >> >>> Daniel >> >> >>> >> >> >>> >> >> >>> >> >> >>> On 11/1/16, 10:52 AM, "mingda li" <limingda1...@gmail.com> wrote: >> >> >>> >> >> >>> >Dear Dai, >> >> >>> > >> >> >>> >Thanks for your reply. >> >> >>> >What I want to do is to compare the two different order of join. >> The >> >> >>> query >> >> >>> >is as following: >> >> >>> > >> >> >>> >*Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY >> >> >>> cs_item_sk;* >> >> >>> >*Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, >> cs_order_number), >> >> >>> >catalog_returns BY (cr_item_sk, cr_order_number);* >> >> >>> >*Dump or Store Bad_OrderRes;* >> >> >>> > >> >> >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, >> >> cr_order_number), >> >> >>> >catalog_sales BY (cs_item_sk, cs_order_number);* >> >> >>> >*Good_OrderRes = JOIN Good_OrderIn BY cs_item_sk, inventory BY >> >> >>> > inv_item_sk;* >> >> >>> >*Dump or Store Good_OrderRes;* >> >> >>> > >> >> >>> >Since Pig execute the query lazily, I think only by Dump or Store >> the >> >> >>> >result, I can know the time of MapReduce Job, is it right? If it >> is, >> >> >>> then I >> >> >>> >need to count the time to Dump or Store the result as the time for >> >> the >> >> >>> >different orders' join. >> >> >>> > >> >> >>> >Bests, >> >> >>> >Mingda >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai < >> da...@hortonworks.com> >> >> >>> wrote: >> >> >>> > >> >> >>> >> Hi, Mingda, >> >> >>> >> >> >> >>> >> Pig does not do join reordering and will execute the query as >> the >> >> way >> >> >>> it >> >> >>> >> is written. Note you can join multiple relations in one join >> >> statement. >> >> >>> >> >> >> >>> >> Do you want execution time for each join in your statement? I >> >> assume >> >> >>> you >> >> >>> >> are using regular join and running with MapReduce, every join >> >> statement >> >> >>> >> will be a separate MapReduce job and the join runtime is the >> >> runtime >> >> >>> for >> >> >>> >> its MapReduce job. >> >> >>> >> >> >> >>> >> Thanks, >> >> >>> >> Daniel >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> On 10/31/16, 8:21 PM, "mingda li" <limingda1...@gmail.com> >> wrote: >> >> >>> >> >> >> >>> >> >Dear all, >> >> >>> >> > >> >> >>> >> >I am doing optimization for multiple join. I am not sure if Pig >> >> can >> >> >>> decide >> >> >>> >> >the join order in optimization layer. Does anyone know about >> >> this? Or >> >> >>> Pig >> >> >>> >> >just execute the query as the way it is written. >> >> >>> >> > >> >> >>> >> >And, I want to do the multiple way Join on different keys. Can >> the >> >> >>> >> >following query work? >> >> >>> >> > >> >> >>> >> >Res = >> >> >>> >> >JOIN >> >> >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY inv_item_sk) >> BY >> >> >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk, >> >> >>> >> >cr_order_number); >> >> >>> >> > >> >> >>> >> >BTW, each time, I run the query, it is finished in one second. >> Is >> >> >>> there a >> >> >>> >> >way to see the execution time? I have set the >> >> pig.udf.profile=true. >> >> >>> Where >> >> >>> >> >can I find the time? >> >> >>> >> > >> >> >>> >> >Bests, >> >> >>> >> >Mingda >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >> >> > >> > >> > >