Hello everyone,
I have a foreach statement and inside of it, I use an order by. After the order
by, I have a UDF. Example like this:
logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
service_flavors = FOREACH logs_g {
t = ORDER logs BY status;
GENERATE group.date as dates, group.site as site, group.profile as
profile,
FLATTEN(MY_UDF(t)) as (generic_status);
};
The problem is that I get duplicate results.. I know that MY_UDF is running on
mappers, but shouldn't each mapper take 1 group from the logs_g? Is something
wrong with order by? I tried to add order by parallel but I get syntax
errors...
My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL
1; But this is not a scalable solution. Can someone help me pls? I am using pig
0.11
Cheers,
Anastasis