> -----Original Message----- > From: Dexin Wang [mailto:[email protected]] > Sent: Wednesday, June 27, 2012 11:00 PM > To: [email protected] > Subject: Re: Passing a BAG to Pig UDF constructor? > > That's a good idea (to pass the bag to UDF and initialize it on first > UDF invocation). Thanks. > > Why do you think it is expensive Mridul?
You will be passing the bag with each tuple, but using it only for the first invocation per mapper/reducer. If other computations are more expensive, then it will get amortized over time; but it is a cost nonetheless ... only a perf test will tell you if it is small enough to ignore ! Regards, Mridul > > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan > <[email protected]>wrote: > > > > > > > > -----Original Message----- > > > From: Jonathan Coveney [mailto:[email protected]] > > > Sent: Wednesday, June 27, 2012 3:12 AM > > > To: [email protected] > > > Subject: Re: Passing a BAG to Pig UDF constructor? > > > > > > You can also just pass the bag to the UDF, and have a lazy > > > initializer in exec that loads the bag into memory. > > > > > > Can you elaborate what you mean by pass the bag to the UDF ? > > Pass it as part of the input to the udf in exec and initialize it > only > > once (first time) ? (If yes, this is expensive) Or something else ? > > > > > > Regards, > > Mridul > > > > > > > > > > > > 2012/6/26 Mridul Muralidharan <[email protected]> > > > > > > > You could dump the data in a dfs file and pass the location of > the > > > > file as param to your udf in define - so that it initializes > > > > itself using that data ... > > > > > > > > > > > > - Mridul > > > > > > > > > > > > > -----Original Message----- > > > > > From: Dexin Wang [mailto:[email protected]] > > > > > Sent: Tuesday, June 26, 2012 10:58 PM > > > > > To: [email protected] > > > > > Subject: Passing a BAG to Pig UDF constructor? > > > > > > > > > > Is it possible to pass a bag to a Pig UDF constructor? > > > > > > > > > > Basically in the constructor I want to initialize some hash map > > > > > so that on every exec operation, I can use the hashmap to do a > > > > > lookup and find the value I need, and apply some algorithm to > it. > > > > > > > > > > I realize I could just do a replicated join to achieve similar > > > > > things but the algorithm is more than a few lines and there are > > > some > > > > > edge cases so I would rather wrap that logic inside a UDF > function. > > > > > I also realize I could just pass a file path to the constructor > > > > > and read the files to initialize the hashmap but my files are > on > > > > > Amazon's S3 and I don't want to deal with > > > > > S3 API to read the file. > > > > > > > > > > Is this possible or is there some alternative ways to achieve > > > > > the same thing? > > > > > > > > > > Thanks. > > > > > Dexin > > > > > >
