hrm this is obviously my bad. The right dataset was just having multiple
keys... Sorry if someone has taken the time to read the garbage.

Cheers,
-Marco

On Tue, Jun 26, 2012 at 3:35 PM, Marco Cadetg <[email protected]> wrote:

> Hi there,
>
> I'm doing a join like this:
>
> A = LOAD '/data/sessions' USING PigStorage(',') AS
> (userid:chararray, client_type:chararray, flag:long);
>
> A1 = GROUP bettyy_sessions ALL;
> A1 = FOREACH A1 GENERATE COUNT(A);
> DUMP A1
> (543872)
>
> B = LOAD '/data/userdb'  USING PigStorage(',') AS (uid:chararray,
> birth_year:int);
> A = JOIN A by userid, B by uid;
> A1 = GROUP bettyy_sessions ALL;
> A1 = FOREACH A1 GENERATE COUNT(A);
> DUMP A1
> (1079122)
>
> Now the dataset has more rows than before the join which is basically the
> opposite of what I'm expecting as not all userids on A do have a uid on the
> B dataset.
>
> Does anyone of you do have a hint what the problem here is?
>
> Thanks,
> -Marco
>

Reply via email to