Hi there,
I'm doing a join like this:
A = LOAD '/data/sessions' USING PigStorage(',') AS
(userid:chararray, client_type:chararray, flag:long);
A1 = GROUP bettyy_sessions ALL;
A1 = FOREACH A1 GENERATE COUNT(A);
DUMP A1
(543872)
B = LOAD '/data/userdb' USING PigStorage(',') AS (uid:chararray,
birth_year:int);
A = JOIN A by userid, B by uid;
A1 = GROUP bettyy_sessions ALL;
A1 = FOREACH A1 GENERATE COUNT(A);
DUMP A1
(1079122)
Now the dataset has more rows than before the join which is basically the
opposite of what I'm expecting as not all userids on A do have a uid on the
B dataset.
Does anyone of you do have a hint what the problem here is?
Thanks,
-Marco