What really happens after a group?

Rodrigo Ferreira Sat, 12 Jul 2014 11:24:55 -0700

Hi everyone,

I have a doubt.


Well, as far as I understood from the book "Programming Pig", after GROUP
all the records with the same key go to the same reduce. Well, so far so
good.

This allows us to write a statement like this:

*foreach grpd generate group, COUNT(input)*

which should count the elements *per* key.

Then comes my issue. I have a script like this:

B = GROUP A BY key PARALLEL p;
C = FILTER B BY NOT IsEmpty(A);
D = FOREACH C GENERATE FLATTEN(MyFunction(A)) AS (mySchema);

If I go through all the tuples in the bag handed to *MyFunction*, I see
elements with different keys (although they are sorted)! Am I doing
something wrong? What am I missing here?

So far, I'm managing this by checking when the key changes and then
computing my stuff in a per key basis. But I'm not sure if this is OK or if
it's a kind of  a hack.

Thank you!

Rodrigo Ferreira.

What really happens after a group?

Reply via email to