On Jan 19, 2012, at 5:57 AM, Michael Lok wrote:
> Hi Dmitriy,
>
> Am I correct to say that all rows in "results" is inside a bag when
> passed into the UDF?
Yes. The other issue you'll face here is that if you have more than one map
task each map task will be comparing against a different first record, which
probably isn't what you want.
The best way to do this would probably be to write a UDF that opens the file
directly in HDFS and reads the first record. It can then compare each input
record against the first record without needing to hold all of the records in
memory and with every map seeing the same first record.
So your script would look like:
A = load 'file';
B = foreach A generate yourudf('file', *);
...
Ideally the UDF should store the side file in the distributed cache to avoid
too many maps opening the file at once, but you can add that once you get the
base feature working.
Alan.
>
> On Thu, Jan 19, 2012 at 7:23 PM, Dmitriy Ryaboy <[email protected]> wrote:
>> results = foreach (group raw all) generate MyUdf(raw)
>>
>> input to the udf will be a tuple with a single field. This field will be a
>> bag of tuples. Each of those tuples is one of your raw rows.
>>
>> Note that this forces everything into memory and isn't scalable...
>>
>>
>>
>> On Thu, Jan 19, 2012 at 12:54 AM, Michael Lok <[email protected]> wrote:
>>
>>> Hi folks,
>>>
>>> I've got one resultset which I need to run a comparison with all the
>>> rows within the same resultset. For example:
>>>
>>> R1
>>> R2
>>> R3
>>> R4
>>> R5
>>>
>>> Take R1, I'll need to compare R1 with all rows from R2-R5. The
>>> comparison will be written in a UDF. Here's what I have so far:
>>>
>>> ============================================
>>> RAW = load 'raw_data.txt' using PigStorage(',');
>>>
>>> RAW_2 = foreach RAW generate *;
>>>
>>> PROCESSED = foreach RAW {
>>> /* perform comparo here */
>>> };
>>> ============================================
>>>
>>> I'm stuck at the filtering inside the nested block. How should I go
>>> about the comparing the rows there?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>> Thanks!
>>>