Hi Alan,

Missed your suggestion earlier :)  After having a sample size of just
30k records, performing a cross join totally killed the disk space I
have :(

Will try your suggestion next.

Thanks!

On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <[email protected]> wrote:
>
> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
>
>> Hi Dmitriy,
>>
>> Am I correct to say that all rows in "results" is inside a bag when
>> passed into the UDF?
>
> Yes.  The other issue you'll face here is that if you have more than one map 
> task each map task will be comparing against a different first record, which 
> probably isn't what you want.
>
> The best way to do this would probably be to write a UDF that opens the file 
> directly in HDFS and reads the first record.  It can then compare each input 
> record against the first record without needing to hold all of the records in 
> memory and with every map seeing the same first record.
>
> So your script would look like:
>
> A = load 'file';
> B = foreach A generate yourudf('file', *);
> ...
>
> Ideally the UDF should store the side file in the distributed cache to avoid 
> too many maps opening the file at once, but you can add that once you get the 
> base feature working.
>
> Alan.
>
>
>>
>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote:
>>> results = foreach (group raw all) generate MyUdf(raw)
>>>
>>> input to the udf will be a tuple with a single field. This field will be a
>>> bag of tuples. Each of those tuples is one of your raw rows.
>>>
>>> Note that this forces everything into memory and isn't scalable...
>>>
>>>
>>>
>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I've got one resultset which I need to run a comparison with all the
>>>> rows within the same resultset.  For example:
>>>>
>>>> R1
>>>> R2
>>>> R3
>>>> R4
>>>> R5
>>>>
>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>>
>>>> ============================================
>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>>
>>>> RAW_2 = foreach RAW generate *;
>>>>
>>>> PROCESSED = foreach RAW {
>>>>    /* perform comparo here */
>>>> };
>>>> ============================================
>>>>
>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>> about the comparing the rows there?
>>>>
>>>> Any help is greatly appreciated.
>>>>
>>>>
>>>> Thanks!
>>>>
>

Reply via email to