Where exactly are you getting duplicates? I'm not sure I understand your
question. Can you give an example please?


On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
andronat_...@hotmail.com> wrote:

> Hello everyone,
>
> I have a foreach statement and inside of it, I use an order by. After the
> order by, I have a UDF. Example like this:
>
>
> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>
> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>
> service_flavors = FOREACH logs_g {
>         t = ORDER logs BY status;
>         GENERATE group.date as dates, group.site as site, group.profile as
> profile,
>                                         FLATTEN(MY_UDF(t)) as
> (generic_status);
> };
>
> The problem is that I get duplicate results.. I know that MY_UDF is
> running on mappers, but shouldn't each mapper take 1 group from the logs_g?
> Is something wrong with order by? I tried to add  order by parallel but I
> get syntax errors...
>
> My problem is resolved if I put  GROUP logs BY (date, site, profile)
> PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
> am using pig 0.11
>
> Cheers,
> Anastasis

Reply via email to