I would just use the HDFS interfaces directly, this is much easier. For an example of a UDF that opens and HDFS file take a look at https://github.com/alanfgates/programmingpig/blob/master/udfs/java/com/acme/marketing/MetroResolver.java
Alan. On Jan 20, 2012, at 12:28 AM, Michael Lok wrote: > Hi Alan, > > Quick question. Do I use HDataStorage and HFile to read files from > HDFS within the UDF? > > Thanks. > > On Fri, Jan 20, 2012 at 10:12 AM, Michael Lok <[email protected]> wrote: >> Hi Alan, >> >> Missed your suggestion earlier :) After having a sample size of just >> 30k records, performing a cross join totally killed the disk space I >> have :( >> >> Will try your suggestion next. >> >> Thanks! >> >> On Fri, Jan 20, 2012 at 12:01 AM, Alan Gates <[email protected]> wrote: >>> >>> On Jan 19, 2012, at 5:57 AM, Michael Lok wrote: >>> >>>> Hi Dmitriy, >>>> >>>> Am I correct to say that all rows in "results" is inside a bag when >>>> passed into the UDF? >>> >>> Yes. The other issue you'll face here is that if you have more than one >>> map task each map task will be comparing against a different first record, >>> which probably isn't what you want. >>> >>> The best way to do this would probably be to write a UDF that opens the >>> file directly in HDFS and reads the first record. It can then compare each >>> input record against the first record without needing to hold all of the >>> records in memory and with every map seeing the same first record. >>> >>> So your script would look like: >>> >>> A = load 'file'; >>> B = foreach A generate yourudf('file', *); >>> ... >>> >>> Ideally the UDF should store the side file in the distributed cache to >>> avoid too many maps opening the file at once, but you can add that once you >>> get the base feature working. >>> >>> Alan. >>> >>> >>>> >>>> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote: >>>>> results = foreach (group raw all) generate MyUdf(raw) >>>>> >>>>> input to the udf will be a tuple with a single field. This field will be a >>>>> bag of tuples. Each of those tuples is one of your raw rows. >>>>> >>>>> Note that this forces everything into memory and isn't scalable... >>>>> >>>>> >>>>> >>>>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> I've got one resultset which I need to run a comparison with all the >>>>>> rows within the same resultset. For example: >>>>>> >>>>>> R1 >>>>>> R2 >>>>>> R3 >>>>>> R4 >>>>>> R5 >>>>>> >>>>>> Take R1, I'll need to compare R1 with all rows from R2-R5. The >>>>>> comparison will be written in a UDF. Here's what I have so far: >>>>>> >>>>>> ============================================ >>>>>> RAW = load 'raw_data.txt' using PigStorage(','); >>>>>> >>>>>> RAW_2 = foreach RAW generate *; >>>>>> >>>>>> PROCESSED = foreach RAW { >>>>>> /* perform comparo here */ >>>>>> }; >>>>>> ============================================ >>>>>> >>>>>> I'm stuck at the filtering inside the nested block. How should I go >>>>>> about the comparing the rows there? >>>>>> >>>>>> Any help is greatly appreciated. >>>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>
