I would just use the HDFS interfaces directly, this is much easier.  For an 
example of a UDF that opens and HDFS file take a look at 
https://github.com/alanfgates/programmingpig/blob/master/udfs/java/com/acme/marketing/MetroResolver.java

Alan.

On Jan 20, 2012, at 12:28 AM, Michael Lok wrote:

> Hi Alan,
> 
> Quick question.  Do I use HDataStorage and HFile to read files from
> HDFS within the UDF?
> 
> Thanks.
> 
> On Fri, Jan 20, 2012 at 10:12 AM, Michael Lok <[email protected]> wrote:
>> Hi Alan,
>> 
>> Missed your suggestion earlier :)  After having a sample size of just
>> 30k records, performing a cross join totally killed the disk space I
>> have :(
>> 
>> Will try your suggestion next.
>> 
>> Thanks!
>> 
>> On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <[email protected]> wrote:
>>> 
>>> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
>>> 
>>>> Hi Dmitriy,
>>>> 
>>>> Am I correct to say that all rows in "results" is inside a bag when
>>>> passed into the UDF?
>>> 
>>> Yes.  The other issue you'll face here is that if you have more than one 
>>> map task each map task will be comparing against a different first record, 
>>> which probably isn't what you want.
>>> 
>>> The best way to do this would probably be to write a UDF that opens the 
>>> file directly in HDFS and reads the first record.  It can then compare each 
>>> input record against the first record without needing to hold all of the 
>>> records in memory and with every map seeing the same first record.
>>> 
>>> So your script would look like:
>>> 
>>> A = load 'file';
>>> B = foreach A generate yourudf('file', *);
>>> ...
>>> 
>>> Ideally the UDF should store the side file in the distributed cache to 
>>> avoid too many maps opening the file at once, but you can add that once you 
>>> get the base feature working.
>>> 
>>> Alan.
>>> 
>>> 
>>>> 
>>>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote:
>>>>> results = foreach (group raw all) generate MyUdf(raw)
>>>>> 
>>>>> input to the udf will be a tuple with a single field. This field will be a
>>>>> bag of tuples. Each of those tuples is one of your raw rows.
>>>>> 
>>>>> Note that this forces everything into memory and isn't scalable...
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote:
>>>>> 
>>>>>> Hi folks,
>>>>>> 
>>>>>> I've got one resultset which I need to run a comparison with all the
>>>>>> rows within the same resultset.  For example:
>>>>>> 
>>>>>> R1
>>>>>> R2
>>>>>> R3
>>>>>> R4
>>>>>> R5
>>>>>> 
>>>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>>>> 
>>>>>> ============================================
>>>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>>>> 
>>>>>> RAW_2 = foreach RAW generate *;
>>>>>> 
>>>>>> PROCESSED = foreach RAW {
>>>>>>    /* perform comparo here */
>>>>>> };
>>>>>> ============================================
>>>>>> 
>>>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>>>> about the comparing the rows there?
>>>>>> 
>>>>>> Any help is greatly appreciated.
>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>> 

Reply via email to