Hi Alan,

Quick question.  Do I use HDataStorage and HFile to read files from
HDFS within the UDF?

Thanks.

On Fri, Jan 20, 2012 at 10:12 AM, Michael Lok <[email protected]> wrote:
> Hi Alan,
>
> Missed your suggestion earlier :)  After having a sample size of just
> 30k records, performing a cross join totally killed the disk space I
> have :(
>
> Will try your suggestion next.
>
> Thanks!
>
> On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <[email protected]> wrote:
>>
>> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
>>
>>> Hi Dmitriy,
>>>
>>> Am I correct to say that all rows in "results" is inside a bag when
>>> passed into the UDF?
>>
>> Yes.  The other issue you'll face here is that if you have more than one map 
>> task each map task will be comparing against a different first record, which 
>> probably isn't what you want.
>>
>> The best way to do this would probably be to write a UDF that opens the file 
>> directly in HDFS and reads the first record.  It can then compare each input 
>> record against the first record without needing to hold all of the records 
>> in memory and with every map seeing the same first record.
>>
>> So your script would look like:
>>
>> A = load 'file';
>> B = foreach A generate yourudf('file', *);
>> ...
>>
>> Ideally the UDF should store the side file in the distributed cache to avoid 
>> too many maps opening the file at once, but you can add that once you get 
>> the base feature working.
>>
>> Alan.
>>
>>
>>>
>>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote:
>>>> results = foreach (group raw all) generate MyUdf(raw)
>>>>
>>>> input to the udf will be a tuple with a single field. This field will be a
>>>> bag of tuples. Each of those tuples is one of your raw rows.
>>>>
>>>> Note that this forces everything into memory and isn't scalable...
>>>>
>>>>
>>>>
>>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> I've got one resultset which I need to run a comparison with all the
>>>>> rows within the same resultset.  For example:
>>>>>
>>>>> R1
>>>>> R2
>>>>> R3
>>>>> R4
>>>>> R5
>>>>>
>>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>>>
>>>>> ============================================
>>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>>>
>>>>> RAW_2 = foreach RAW generate *;
>>>>>
>>>>> PROCESSED = foreach RAW {
>>>>>    /* perform comparo here */
>>>>> };
>>>>> ============================================
>>>>>
>>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>>> about the comparing the rows there?
>>>>>
>>>>> Any help is greatly appreciated.
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>

Reply via email to