Adam,
Take a look at the CountEach udf in the datafu library
(http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/CountEach.html).
Eg:
res = foreach raw3 {
field4bag = foreach raw2 generate field4;
field5bag = foreach raw2 generate field5;
field4cnts = CountEach(field4bag);
field5cnts = CountEach(field5bag);
field4max = TOP(1, 1, field4cnts);
field5max = TOP(1, 1, field5cnts);
generate
flatten(group) as (field1, field2, field3),
flatten(field4max.tuple_schema.$0) as field4max,
flatten(field5max.tuple_schema.$0) as field5max;
};
Generates: (1,2,3,(a),(x)) for your input. You can do further projections to
rearrange how you like downstream.
Best of luck.
--jacob
@thedatachef
On Mar 20, 2014, at 5:59 AM, Adamantios Corais <[email protected]>
wrote:
> Hi,
>
> I have the following schema: raw3: {group: (field1: int,field2:
> chararray,field3: chararray),raw2: {(field1: int,field2: chararray,field4:
> chararray,field3: chararray,field5: boolean)}}
>
> What I am trying to do is extract the most frequent value of field4 and the
> most frequent of field field5 along with the group fields (field1, field2,
> field3).
>
> I know that GROUP BY is not allowed (yet) with FOREACH statements. How can I
> accomplice the same functionality without writing a UDF?
>
> Example:
>
> input: ((1,2,3),
> {(1,2,a,3,x),(1,2,b,3,x),(1,2,a,3,x),(1,2,v,3,x),(1,2,f,3,z),(1,2,a,3,z)})
>
> output: (1,2,a,3,x)
>
> Thank you,
> Adam.