Hi all
I need some help with PIG. The requirement is to generate the topX
records for a group. I can easily do this using PIG script where I can order
by DESC and then limit at X. If there are more than X records in the
group,I need to aggregate the rest as a single record. How can I achieve
this?
I am generating topX as below
*kwgroup* = GROUP *kws* BY (type,category);
*topkws* = FOREACH *kwgroup* {
sorted = ORDER *kws* BY visits DESC;
*ltd* = limit sorted 5;
GENERATE FLATTEN(*ltd*);}
For aggregating the rest,
I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
aggregate these records. How can I get the TotalCount of records in a
group? I tried the below, but fails.
*
bottomkws* = FOREACH kwgroup_cnt_gt_top {
sorted_asc = ORDER *kws* BY visits ASC;
ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
GENERATE FLATTEN(ltd_bottom);}
But this fails with the erro message that we should use INTEGER instead of
COUNT(kws)
Is it better to do this using UDF? In that case UDF will have to sort, limit
,aggregate .Could you point to some samples that take a group of records and
return a group(bag)
Any help in this regard is appreciated.
Thanks
Sheeba