Re: Comparing each row with the same resultset

Scott Carey Thu, 19 Jan 2012 10:33:36 -0800

If your goal is to compare all rows with all other rows, you can do a
distributed CROSS self-join.
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#CROSS


Something like 

exploded = CROSS data, data;

which will produce n^2 rows, where n is the number of rows in the alias
'data'.

Then you would have each row paired with each other row in your result.

I haven't tried this myself on a larger dataset -- the n^2 data explosion
is something to be wary of.

On 1/19/12 5:57 AM, "Michael Lok" <[email protected]> wrote:

>Hi Dmitriy,
>
>Am I correct to say that all rows in "results" is inside a bag when
>passed into the UDF?
>
>On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]>
>wrote:
>> results = foreach (group raw all) generate MyUdf(raw)
>>
>> input to the udf will be a tuple with a single field. This field will
>>be a
>> bag of tuples. Each of those tuples is one of your raw rows.
>>
>> Note that this forces everything into memory and isn't scalable...
>>
>>
>>
>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote:
>>
>>> Hi folks,
>>>
>>> I've got one resultset which I need to run a comparison with all the
>>> rows within the same resultset.  For example:
>>>
>>> R1
>>> R2
>>> R3
>>> R4
>>> R5
>>>
>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>> comparison will be written in a UDF.  Here's what I have so far:
>>>
>>> ============================================
>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>
>>> RAW_2 = foreach RAW generate *;
>>>
>>> PROCESSED = foreach RAW {
>>>    /* perform comparo here */
>>> };
>>> ============================================
>>>
>>> I'm stuck at the filtering inside the nested block.  How should I go
>>> about the comparing the rows there?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>> Thanks!
>>>

Re: Comparing each row with the same resultset

Reply via email to