I am trying to aggregate on the cross product of two relations. It can be
done using a single M/R job but pig is using two. The pig code looks like
this:
C = cross A, B;
C = filter C by Š;
G = group C by x;
G = foreach G generate group, COUNT(G);
The resulting M/R plan is this:
Map1 (LOAD, CROSS) -> Reduce1 (FILTER) -> Map2 (seemingly useless) ->
Reduce2 (COUNT)
Of course, the IO between Reduce1 and Map2 is massive. This job can only
be done efficiently if done like so:
Map1 (LOAD, CROSS) -> Combine1(FILTER, COUNT) -> Reduce1(COUNT)
Is there some way to force pig to use this M/R plan? Or do I have to
write my own M/R job?
Thanks!