> -----Original Message-----
> From: Dexin Wang [mailto:[email protected]]
> Sent: Wednesday, June 27, 2012 11:00 PM
> To: [email protected]
> Subject: Re: Passing a BAG to Pig UDF constructor?
> 
> That's a good idea (to pass the bag to UDF and initialize it on first
> UDF invocation). Thanks.
> 
> Why do you think it is expensive Mridul?


You will be passing the bag with each tuple, but using it only for the first 
invocation per mapper/reducer.
If other computations are more expensive, then it will get amortized over time; 
but it is a cost nonetheless ... only a perf test will tell you if it is small 
enough to ignore !


Regards,
Mridul


> 
> On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
> <[email protected]>wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: Jonathan Coveney [mailto:[email protected]]
> > > Sent: Wednesday, June 27, 2012 3:12 AM
> > > To: [email protected]
> > > Subject: Re: Passing a BAG to Pig UDF constructor?
> > >
> > > You can also just pass the bag to the UDF, and have a lazy
> > > initializer in exec that loads the bag into memory.
> >
> >
> > Can you elaborate what you mean by pass the bag to the UDF ?
> > Pass it as part of the input to the udf in exec and initialize it
> only
> > once (first time) ? (If yes, this is expensive) Or something else ?
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> > >
> > > 2012/6/26 Mridul Muralidharan <[email protected]>
> > >
> > > > You could dump the data in a dfs file and pass the location of
> the
> > > > file as param to your udf in define - so that it initializes
> > > > itself using that data ...
> > > >
> > > >
> > > > - Mridul
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Dexin Wang [mailto:[email protected]]
> > > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > > > To: [email protected]
> > > > > Subject: Passing a BAG to Pig UDF constructor?
> > > > >
> > > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > > >
> > > > > Basically in the constructor I want to initialize some hash map
> > > > > so that on every exec operation, I can use the hashmap to do a
> > > > > lookup and find the value I need, and apply some algorithm to
> it.
> > > > >
> > > > > I realize I could just do a replicated join to achieve similar
> > > > > things but the algorithm is more than a few lines and there are
> > > some
> > > > > edge cases so I would rather wrap that logic inside a UDF
> function.
> > > > > I also realize I could just pass a file path to the constructor
> > > > > and read the files to initialize the hashmap but my files are
> on
> > > > > Amazon's S3 and I don't want to deal with
> > > > > S3 API to read the file.
> > > > >
> > > > > Is this possible or is there some alternative ways to achieve
> > > > > the same thing?
> > > > >
> > > > > Thanks.
> > > > > Dexin
> > > >
> >

Reply via email to