hrm this is obviously my bad. The right dataset was just having multiple keys... Sorry if someone has taken the time to read the garbage.
Cheers, -Marco On Tue, Jun 26, 2012 at 3:35 PM, Marco Cadetg <[email protected]> wrote: > Hi there, > > I'm doing a join like this: > > A = LOAD '/data/sessions' USING PigStorage(',') AS > (userid:chararray, client_type:chararray, flag:long); > > A1 = GROUP bettyy_sessions ALL; > A1 = FOREACH A1 GENERATE COUNT(A); > DUMP A1 > (543872) > > B = LOAD '/data/userdb' USING PigStorage(',') AS (uid:chararray, > birth_year:int); > A = JOIN A by userid, B by uid; > A1 = GROUP bettyy_sessions ALL; > A1 = FOREACH A1 GENERATE COUNT(A); > DUMP A1 > (1079122) > > Now the dataset has more rows than before the join which is basically the > opposite of what I'm expecting as not all userids on A do have a uid on the > B dataset. > > Does anyone of you do have a hint what the problem here is? > > Thanks, > -Marco >
