If your goal is to compare all rows with all other rows, you can do a distributed CROSS self-join. http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#CROSS
Something like exploded = CROSS data, data; which will produce n^2 rows, where n is the number of rows in the alias 'data'. Then you would have each row paired with each other row in your result. I haven't tried this myself on a larger dataset -- the n^2 data explosion is something to be wary of. On 1/19/12 5:57 AM, "Michael Lok" <[email protected]> wrote: >Hi Dmitriy, > >Am I correct to say that all rows in "results" is inside a bag when >passed into the UDF? > >On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> >wrote: >> results = foreach (group raw all) generate MyUdf(raw) >> >> input to the udf will be a tuple with a single field. This field will >>be a >> bag of tuples. Each of those tuples is one of your raw rows. >> >> Note that this forces everything into memory and isn't scalable... >> >> >> >> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote: >> >>> Hi folks, >>> >>> I've got one resultset which I need to run a comparison with all the >>> rows within the same resultset. For example: >>> >>> R1 >>> R2 >>> R3 >>> R4 >>> R5 >>> >>> Take R1, I'll need to compare R1 with all rows from R2-R5. The >>> comparison will be written in a UDF. Here's what I have so far: >>> >>> ============================================ >>> RAW = load 'raw_data.txt' using PigStorage(','); >>> >>> RAW_2 = foreach RAW generate *; >>> >>> PROCESSED = foreach RAW { >>> /* perform comparo here */ >>> }; >>> ============================================ >>> >>> I'm stuck at the filtering inside the nested block. How should I go >>> about the comparing the rows there? >>> >>> Any help is greatly appreciated. >>> >>> >>> Thanks! >>>
