Hi Scott,

I think the cross join approach will work. But i dont think my hdfs storage has 
sufficient space to handle the result of the join as my data size is already 
12m rows. 

Probably have to process the records in chunks. 


Thanks 



On Jan 20, 2012, at 2:36, Scott Carey <[email protected]> wrote:

> If your goal is to compare all rows with all other rows, you can do a
> distributed CROSS self-join.
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#CROSS
> 
> Something like 
> 
> exploded = CROSS data, data;
> 
> which will produce n^2 rows, where n is the number of rows in the alias
> 'data'.
> 
> Then you would have each row paired with each other row in your result.
> 
> I haven't tried this myself on a larger dataset -- the n^2 data explosion
> is something to be wary of.
> 
> On 1/19/12 5:57 AM, "Michael Lok" <[email protected]> wrote:
> 
>> Hi Dmitriy,
>> 
>> Am I correct to say that all rows in "results" is inside a bag when
>> passed into the UDF?
>> 
>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]>
>> wrote:
>>> results = foreach (group raw all) generate MyUdf(raw)
>>> 
>>> input to the udf will be a tuple with a single field. This field will
>>> be a
>>> bag of tuples. Each of those tuples is one of your raw rows.
>>> 
>>> Note that this forces everything into memory and isn't scalable...
>>> 
>>> 
>>> 
>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote:
>>> 
>>>> Hi folks,
>>>> 
>>>> I've got one resultset which I need to run a comparison with all the
>>>> rows within the same resultset.  For example:
>>>> 
>>>> R1
>>>> R2
>>>> R3
>>>> R4
>>>> R5
>>>> 
>>>> Take R1, I'll need to compare R1 with all rows from R2-R5.  The
>>>> comparison will be written in a UDF.  Here's what I have so far:
>>>> 
>>>> ============================================
>>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>> 
>>>> RAW_2 = foreach RAW generate *;
>>>> 
>>>> PROCESSED = foreach RAW {
>>>>   /* perform comparo here */
>>>> };
>>>> ============================================
>>>> 
>>>> I'm stuck at the filtering inside the nested block.  How should I go
>>>> about the comparing the rows there?
>>>> 
>>>> Any help is greatly appreciated.
>>>> 
>>>> 
>>>> Thanks!
>>>> 
> 

Reply via email to