Hi Dmitriy, Am I correct to say that all rows in "results" is inside a bag when passed into the UDF?
On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote: > results = foreach (group raw all) generate MyUdf(raw) > > input to the udf will be a tuple with a single field. This field will be a > bag of tuples. Each of those tuples is one of your raw rows. > > Note that this forces everything into memory and isn't scalable... > > > > On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote: > >> Hi folks, >> >> I've got one resultset which I need to run a comparison with all the >> rows within the same resultset. For example: >> >> R1 >> R2 >> R3 >> R4 >> R5 >> >> Take R1, I'll need to compare R1 with all rows from R2-R5. The >> comparison will be written in a UDF. Here's what I have so far: >> >> ============================================ >> RAW = load 'raw_data.txt' using PigStorage(','); >> >> RAW_2 = foreach RAW generate *; >> >> PROCESSED = foreach RAW { >> /* perform comparo here */ >> }; >> ============================================ >> >> I'm stuck at the filtering inside the nested block. How should I go >> about the comparing the rows there? >> >> Any help is greatly appreciated. >> >> >> Thanks! >>
