Daniel, Awesome, thank you. I will try that out.
--Eric On Wed, Jan 5, 2011 at 1:14 AM, Daniel Dai <[email protected]> wrote: > You are right. setLocation is called in frontend, however, it is in the > context of InputFormat.getSplits() and it is too late to save anything in > UDFContext. Your best bet is relativeToAbsolutePath, which is called in > frontend and you can save your stuff in UDFContext. > > Daniel > > -----Original Message----- From: Eric Tschetter > Sent: Tuesday, January 04, 2011 11:52 AM > To: [email protected] > Subject: UDFContext in 0.8 LoadFunc? > > I have a custom LoadFunc (I'm actually just extending PigStorage) that > has some added logic to spider a given path and pick out the paths > that I want. I am currently doing the spidering in setLocation > because that seemed like the place to do it. It appears as if this is > getting called on both the client and the cluster side, though, so my > mappers are spidering a path that was already spidered on the client > (wasted effort). Whenever the spidering is over a lot of directories > this is adding a significant amount of unneeded overhead to my jobs. > > I looked into using UDFContext to save the paths and then try to get > the cluster-side processes to look up the paths in UDF context and > just use them if they exist. But, it looks like the actual job > configuration object is being created before the calls to > setLocation() so the stuff that I set in UDFContext is not making it > across the wire. > > Is there a method that is called before setLocation that I can use to > set the value in UDFContext (I'd prefer something that is given a > Context/Configuration object). Or, is my only option to build a > Configuration object in the constructor, do the crawl and set the > UDFContext there (will that even work)? > > --Eric >
