En, thanks Debabrata, but actually, I register each time ( forget to tell you) before i run the commands. I use *REGISTER* /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-incubating-1.3.1.jar. But cannot help me.
Any other reason? Thanks On Wed, Nov 2, 2016 at 8:03 PM, Debabrata Pani <[email protected]> wrote: > It says that pig could not find the class Hasher. Start grunt with > -Dpig.additional.jars (before other pig arguments) or do a "register" of > individual jars before typing in your scripts. > > Regards, > Debabrata > > On Nov 3, 2016 07:09, "mingda li" <[email protected]> wrote: > > > Thanks. I have tried to install the datafu and finish quickstart > > successfully http://datafu.incubator.apache.org/docs/quick-start.html > > > > But when i use the murmur hash, it failed. I do not know why. > > > > grunt> data = LOAD 'hdfs://***.UCLA.EDU:9000/clash/datasets/1.txt' > using > > PigStorage() as (val:int); > > > > grunt> data_out = FOREACH data GENERATE val; > > > > grunt> dat= FOREACH data GENERATE MurmurH32(val); > > > > 2016-11-02 18:25:18,424 [main] ERROR org.apache.pig.tools.grunt.Grunt - > > ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [, > > java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] > > > > Details at logfile: /home/hadoop-user/pig-branch- > > 0.15/bin/pig_1478136031217.log > > > > > > The log file is in attachment. > > > > > > Bests, > > > > Mingda > > > > > > On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <[email protected]> > wrote: > > > >> I see datafu has a patch for the UDF: https://issues.apache.org/jira > >> /browse/DATAFU-47 > >> > >> > >> > >> > >> On 11/2/16, 11:45 AM, "mingda li" <[email protected]> wrote: > >> > >> >Dear all, > >> > > >> >Hi, now I wants to import a UDF function to pig command. Has anyone > ever > >> >done so? I want to import google's guava/murmur3_32 to pig. Could > anyone > >> >give some useful materials or suggestion? > >> > > >> >Bests, > >> >Mingda > >> > > >> >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <[email protected]> > >> wrote: > >> > > >> >> Yeah, I see. Thanks for your reply. > >> >> > >> >> Bests, > >> >> Mingda > >> >> > >> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <[email protected]> > >> wrote: > >> >> > >> >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You > will > >> >>> see two MapReduce jobs corresponding to the first and second join. > >> >>> > >> >>> Thanks, > >> >>> Daniel > >> >>> > >> >>> > >> >>> > >> >>> On 11/1/16, 10:52 AM, "mingda li" <[email protected]> wrote: > >> >>> > >> >>> >Dear Dai, > >> >>> > > >> >>> >Thanks for your reply. > >> >>> >What I want to do is to compare the two different order of join. > The > >> >>> query > >> >>> >is as following: > >> >>> > > >> >>> >*Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY > >> >>> cs_item_sk;* > >> >>> >*Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, > cs_order_number), > >> >>> >catalog_returns BY (cr_item_sk, cr_order_number);* > >> >>> >*Dump or Store Bad_OrderRes;* > >> >>> > > >> >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, > >> cr_order_number), > >> >>> >catalog_sales BY (cs_item_sk, cs_order_number);* > >> >>> >*Good_OrderRes = JOIN Good_OrderIn BY cs_item_sk, inventory BY > >> >>> > inv_item_sk;* > >> >>> >*Dump or Store Good_OrderRes;* > >> >>> > > >> >>> >Since Pig execute the query lazily, I think only by Dump or Store > the > >> >>> >result, I can know the time of MapReduce Job, is it right? If it > is, > >> >>> then I > >> >>> >need to count the time to Dump or Store the result as the time for > >> the > >> >>> >different orders' join. > >> >>> > > >> >>> >Bests, > >> >>> >Mingda > >> >>> > > >> >>> > > >> >>> > > >> >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <[email protected] > > > >> >>> wrote: > >> >>> > > >> >>> >> Hi, Mingda, > >> >>> >> > >> >>> >> Pig does not do join reordering and will execute the query as the > >> way > >> >>> it > >> >>> >> is written. Note you can join multiple relations in one join > >> statement. > >> >>> >> > >> >>> >> Do you want execution time for each join in your statement? I > >> assume > >> >>> you > >> >>> >> are using regular join and running with MapReduce, every join > >> statement > >> >>> >> will be a separate MapReduce job and the join runtime is the > >> runtime > >> >>> for > >> >>> >> its MapReduce job. > >> >>> >> > >> >>> >> Thanks, > >> >>> >> Daniel > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> On 10/31/16, 8:21 PM, "mingda li" <[email protected]> > wrote: > >> >>> >> > >> >>> >> >Dear all, > >> >>> >> > > >> >>> >> >I am doing optimization for multiple join. I am not sure if Pig > >> can > >> >>> decide > >> >>> >> >the join order in optimization layer. Does anyone know about > >> this? Or > >> >>> Pig > >> >>> >> >just execute the query as the way it is written. > >> >>> >> > > >> >>> >> >And, I want to do the multiple way Join on different keys. Can > the > >> >>> >> >following query work? > >> >>> >> > > >> >>> >> >Res = > >> >>> >> >JOIN > >> >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY inv_item_sk) BY > >> >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk, > >> >>> >> >cr_order_number); > >> >>> >> > > >> >>> >> >BTW, each time, I run the query, it is finished in one second. > Is > >> >>> there a > >> >>> >> >way to see the execution time? I have set the > >> pig.udf.profile=true. > >> >>> Where > >> >>> >> >can I find the time? > >> >>> >> > > >> >>> >> >Bests, > >> >>> >> >Mingda > >> >>> >> > >> >>> > >> >> > >> >> > >> > > > > >
