Hi Alan, Quick question. Do I use HDataStorage and HFile to read files from HDFS within the UDF?
Thanks. On Fri, Jan 20, 2012 at 10:12 AM, Michael Lok <[email protected]> wrote: > Hi Alan, > > Missed your suggestion earlier :) After having a sample size of just > 30k records, performing a cross join totally killed the disk space I > have :( > > Will try your suggestion next. > > Thanks! > > On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <[email protected]> wrote: >> >> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote: >> >>> Hi Dmitriy, >>> >>> Am I correct to say that all rows in "results" is inside a bag when >>> passed into the UDF? >> >> Yes. The other issue you'll face here is that if you have more than one map >> task each map task will be comparing against a different first record, which >> probably isn't what you want. >> >> The best way to do this would probably be to write a UDF that opens the file >> directly in HDFS and reads the first record. It can then compare each input >> record against the first record without needing to hold all of the records >> in memory and with every map seeing the same first record. >> >> So your script would look like: >> >> A = load 'file'; >> B = foreach A generate yourudf('file', *); >> ... >> >> Ideally the UDF should store the side file in the distributed cache to avoid >> too many maps opening the file at once, but you can add that once you get >> the base feature working. >> >> Alan. >> >> >>> >>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote: >>>> results = foreach (group raw all) generate MyUdf(raw) >>>> >>>> input to the udf will be a tuple with a single field. This field will be a >>>> bag of tuples. Each of those tuples is one of your raw rows. >>>> >>>> Note that this forces everything into memory and isn't scalable... >>>> >>>> >>>> >>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote: >>>> >>>>> Hi folks, >>>>> >>>>> I've got one resultset which I need to run a comparison with all the >>>>> rows within the same resultset. For example: >>>>> >>>>> R1 >>>>> R2 >>>>> R3 >>>>> R4 >>>>> R5 >>>>> >>>>> Take R1, I'll need to compare R1 with all rows from R2-R5. The >>>>> comparison will be written in a UDF. Here's what I have so far: >>>>> >>>>> ============================================ >>>>> RAW = load 'raw_data.txt' using PigStorage(','); >>>>> >>>>> RAW_2 = foreach RAW generate *; >>>>> >>>>> PROCESSED = foreach RAW { >>>>> /* perform comparo here */ >>>>> }; >>>>> ============================================ >>>>> >>>>> I'm stuck at the filtering inside the nested block. How should I go >>>>> about the comparing the rows there? >>>>> >>>>> Any help is greatly appreciated. >>>>> >>>>> >>>>> Thanks! >>>>> >>
