Hi everyone, I have a doubt.
Well, as far as I understood from the book "Programming Pig", after GROUP all the records with the same key go to the same reduce. Well, so far so good. This allows us to write a statement like this: *foreach grpd generate group, COUNT(input)* which should count the elements *per* key. Then comes my issue. I have a script like this: B = GROUP A BY key PARALLEL p; C = FILTER B BY NOT IsEmpty(A); D = FOREACH C GENERATE FLATTEN(MyFunction(A)) AS (mySchema); If I go through all the tuples in the bag handed to *MyFunction*, I see elements with different keys (although they are sorted)! Am I doing something wrong? What am I missing here? So far, I'm managing this by checking when the key changes and then computing my stuff in a per key basis. But I'm not sure if this is OK or if it's a kind of a hack. Thank you! Rodrigo Ferreira.