Hi Alan, Missed your suggestion earlier :) After having a sample size of just 30k records, performing a cross join totally killed the disk space I have :(
Will try your suggestion next. Thanks! On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <[email protected]> wrote: > > On Jan 19, 2012, at 5:57 AM, Michael Lok wrote: > >> Hi Dmitriy, >> >> Am I correct to say that all rows in "results" is inside a bag when >> passed into the UDF? > > Yes. The other issue you'll face here is that if you have more than one map > task each map task will be comparing against a different first record, which > probably isn't what you want. > > The best way to do this would probably be to write a UDF that opens the file > directly in HDFS and reads the first record. It can then compare each input > record against the first record without needing to hold all of the records in > memory and with every map seeing the same first record. > > So your script would look like: > > A = load 'file'; > B = foreach A generate yourudf('file', *); > ... > > Ideally the UDF should store the side file in the distributed cache to avoid > too many maps opening the file at once, but you can add that once you get the > base feature working. > > Alan. > > >> >> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote: >>> results = foreach (group raw all) generate MyUdf(raw) >>> >>> input to the udf will be a tuple with a single field. This field will be a >>> bag of tuples. Each of those tuples is one of your raw rows. >>> >>> Note that this forces everything into memory and isn't scalable... >>> >>> >>> >>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote: >>> >>>> Hi folks, >>>> >>>> I've got one resultset which I need to run a comparison with all the >>>> rows within the same resultset. For example: >>>> >>>> R1 >>>> R2 >>>> R3 >>>> R4 >>>> R5 >>>> >>>> Take R1, I'll need to compare R1 with all rows from R2-R5. The >>>> comparison will be written in a UDF. Here's what I have so far: >>>> >>>> ============================================ >>>> RAW = load 'raw_data.txt' using PigStorage(','); >>>> >>>> RAW_2 = foreach RAW generate *; >>>> >>>> PROCESSED = foreach RAW { >>>> /* perform comparo here */ >>>> }; >>>> ============================================ >>>> >>>> I'm stuck at the filtering inside the nested block. How should I go >>>> about the comparing the rows there? >>>> >>>> Any help is greatly appreciated. >>>> >>>> >>>> Thanks! >>>> >
