I have a custom LoadFunc (I'm actually just extending PigStorage) that has some added logic to spider a given path and pick out the paths that I want. I am currently doing the spidering in setLocation because that seemed like the place to do it. It appears as if this is getting called on both the client and the cluster side, though, so my mappers are spidering a path that was already spidered on the client (wasted effort). Whenever the spidering is over a lot of directories this is adding a significant amount of unneeded overhead to my jobs.
I looked into using UDFContext to save the paths and then try to get the cluster-side processes to look up the paths in UDF context and just use them if they exist. But, it looks like the actual job configuration object is being created before the calls to setLocation() so the stuff that I set in UDFContext is not making it across the wire. Is there a method that is called before setLocation that I can use to set the value in UDFContext (I'd prefer something that is given a Context/Configuration object). Or, is my only option to build a Configuration object in the constructor, do the crawl and set the UDFContext there (will that even work)? --Eric
