Re: Question about bags and UDFs

Xiaomeng Wan Mon, 25 Apr 2011 09:09:06 -0700

I used something like this before, it worked:

in pig script
set mapred.cache.archives /user/shawn/share/xxx.dat#cached


in udf
String cachedfiles = UDFContext.getUDFContext().getJobConf()
                                                .get("mapred.cache.archives");
                                int endoffilename = 
cachedfiles.lastIndexOf("#");
                                String cachepath = cachedfiles.substring( 
endoffilename + 1);
                                String cachedfile =
cachedfiles.substring(cachedfiles.lastIndexOf("/"), endoffilename);

String localpath = cachepath + cachedfile;

Shawn

On Fri, Apr 22, 2011 at 6:10 AM, Mark Laczin <[email protected]> wrote:
> Follow-up question, how do you add it to the cache in a pig script, and once
> it's in there can you access it from the UDF using regular Java file I/O?
>  That is, it is as simple as saying:
>
> copyFromLocal $localFilePath udfFile.txt
> DEFINE someudf org.someudf CACHE('udfFile.txt#udfFile.txt');
>
> And then the UDF can read it using regular Java file streams/etc?
>
> Thanks for your help so far - the mailing list has been fairly kind to me in
> this regard, especially considering my lack of Pig experience.
>
> -Mark
>
> On Fri, Apr 22, 2011 at 7:40 AM, Mark Laczin <[email protected]> wrote:
>
>> I think I may have to go with your second option - but thanks for the info,
>> I'll keep an eye on 0.9.0.
>>
>>
>> On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <[email protected]> wrote:
>>
>>> Starting with Pig 0.9 (not yet released but you can build it off the
>>> branch) a UDF can specify a file to put in the distributed cache.  You could
>>> thus have your UDF pick up the file locally on your box and put it in the
>>> distributed cache, and then read it from the distributed cache on the back
>>> end.  If running with an un-released version isn't an option for you, you
>>> could manually load the file into the distributed cache and then read it
>>> from your UDF.
>>>
>>> Alan.
>>>
>>>
>>> On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
>>>
>>>  Does anyone know how to ship the config file in this situation?
>>>> I'm encountering problems with file not found exceptions when trying to
>>>> run
>>>> this over a cluster.
>>>>
>>>> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <[email protected]>
>>>> wrote:
>>>>
>>>>  I kind of solved it by reading in the data from my UDF constructor (it's
>>>>> just a file with a list of like 10 regular expressions, so I did manual
>>>>> file
>>>>> I/O), by passing the path (provided as a parameter), and then just
>>>>> storing
>>>>> it (and then, looping over it and testing a, b by hand).  It's not the
>>>>> MapReduce way, but it will work for this application, considering the
>>>>> small
>>>>> size of the file.
>>>>>
>>>>> If anyone knows how my "patch" might fail, or if there is a better way -
>>>>> feel free to speak up.
>>>>>
>>>>> -Mark
>>>>>
>>>>>
>>>>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <[email protected]
>>>>> >wrote:
>>>>>
>>>>>  You could try doing GROUP ALL on the contents of M, which would
>>>>>> produce a since bag containing each record and then joining M with
>>>>>> data using a surrogate constant key. Or CROSS would also work instead
>>>>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>>>>
>>>>>> (a, b, M:bag)
>>>>>>
>>>>>> I'm not sure if things would blow up if M is too large to fit into
>>>>>> memory in your UDF though.
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm trying to do something like this:
>>>>>>> (if 'data' is a set of tuples loaded from a file containing fields a,
>>>>>>> b
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>> c)
>>>>>>> (if 'M' is another set of tuples loaded from a file)
>>>>>>>
>>>>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>>>>
>>>>>>> What I'm looking for is to generate (in this case, a string) based on
>>>>>>> a
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>> b, using the contents of M inside the UDF.
>>>>>>>
>>>>>>> The UDF looks like this, in pseudocode:
>>>>>>>
>>>>>>> foreach element x in M {
>>>>>>> if a matches x or b matches x {
>>>>>>>  return "something"
>>>>>>> }
>>>>>>> }
>>>>>>> return "something else"
>>>>>>>
>>>>>>> Is this possible?  I keep getting errors related to "Scalars can only
>>>>>>> be
>>>>>>> used with projections" and the like.
>>>>>>> The thing holding me back from using filters is that I won't know
>>>>>>> what's
>>>>>>>
>>>>>> in
>>>>>>
>>>>>>> M until it's read, and since (in this case) they'll be regular
>>>>>>>
>>>>>> expressions,
>>>>>>
>>>>>>> I'd need to be able to join/group with regex matching which I don't
>>>>>>>
>>>>>> think
>>>>>>
>>>>>>> Pig can do.
>>>>>>>
>>>>>>> -Mark
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Question about bags and UDFs

Reply via email to