Re: many-to-many join
If I understand this correctly, you could join area_code_user and area_code_state and then flat map to get user, areacode, state. Then groupby/reduce by user. You can also try some join optimizations like partitioning on area code or broadcasting smaller table depending on size of area_code_state. On Jul 22, 2015 10:15 AM, John Berryman jo...@eventbrite.com wrote: Quick example problem that's stumping me: * Users have 1 or more phone numbers and therefore one or more area codes. * There are 100M users. * States have one or more area codes. * I would like to the states for the users (as indicated by phone area code). I was thinking about something like this: If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567]) and area_code_state looks like (area_code,state) ex: (615, [Tennessee]) then we could do states_and_users_mixed = area_code_user.join(area_code_state) \ .reduceByKey(lambda a,b: a+b) \ .values() user_state_pairs = states_and_users_mixed.flatMap( emit_cartesian_prod_of_userids_and_states ) user_to_states = user_state_pairs.reduceByKey(lambda a,b: a+b) user_to_states.first(1) (1234567,[Tennessee,Tennessee,California]) This would work, but the user_state_pairs is just a list of user_ids and state names mixed together and emit_cartesian_prod_of_userids_and_states has to correctly pair them. This is problematic because 1) it's weird and sloppy and 2) there will be lots of users per state and having so many users in a single row is going to make emit_cartesian_prod_of_userids_and_states work extra hard to first locate states and then emit all userid-state pairs. How should I be doing this? Thanks, -John
Re: many-to-many join
Hi RDD solution: u = [(615,1),(720,1),(615,2)] urdd=sc.parallelize(u,1) a1 = [(615,'T'),(720,'C')] ardd=sc.parallelize(a1,1) def addString(s1,s2): ... return s1+','+s2 j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString) print j.collect() [(2, 'T'), (1, 'C,T')] However, if you can assume number of users is far far greater than number of distinct area codes, you may think to broadcast variable in a dict format and look up in the map. Like this u = [(1,615),(1,720),(2,615)] a = {615:'T',720:'C'} urdd=sc.parallelize(u) def usr_area_state(tup): ... uid=tup[0] ... aid=tup[1] ... sid=bc.value[aid] ... return uid,(sid,) ... bc=sc.broadcast(a) usrdd=urdd.map(usr_area_state) def addTuple(t1,t2): ... return t1+t2 ... out=usrdd.reduceByKey(addTuple) print out.collect() [(1, ('T', 'C')), (2, ('T',))] Best Ayan On Wed, Jul 22, 2015 at 5:14 PM, Sonal Goyal sonalgoy...@gmail.com wrote: If I understand this correctly, you could join area_code_user and area_code_state and then flat map to get user, areacode, state. Then groupby/reduce by user. You can also try some join optimizations like partitioning on area code or broadcasting smaller table depending on size of area_code_state. On Jul 22, 2015 10:15 AM, John Berryman jo...@eventbrite.com wrote: Quick example problem that's stumping me: * Users have 1 or more phone numbers and therefore one or more area codes. * There are 100M users. * States have one or more area codes. * I would like to the states for the users (as indicated by phone area code). I was thinking about something like this: If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567]) and area_code_state looks like (area_code,state) ex: (615, [Tennessee]) then we could do states_and_users_mixed = area_code_user.join(area_code_state) \ .reduceByKey(lambda a,b: a+b) \ .values() user_state_pairs = states_and_users_mixed.flatMap( emit_cartesian_prod_of_userids_and_states ) user_to_states = user_state_pairs.reduceByKey(lambda a,b: a+b) user_to_states.first(1) (1234567,[Tennessee,Tennessee,California]) This would work, but the user_state_pairs is just a list of user_ids and state names mixed together and emit_cartesian_prod_of_userids_and_states has to correctly pair them. This is problematic because 1) it's weird and sloppy and 2) there will be lots of users per state and having so many users in a single row is going to make emit_cartesian_prod_of_userids_and_states work extra hard to first locate states and then emit all userid-state pairs. How should I be doing this? Thanks, -John -- Best Regards, Ayan Guha
many-to-many join
Quick example problem that's stumping me: * Users have 1 or more phone numbers and therefore one or more area codes. * There are 100M users. * States have one or more area codes. * I would like to the states for the users (as indicated by phone area code). I was thinking about something like this: If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567]) and area_code_state looks like (area_code,state) ex: (615, [Tennessee]) then we could do states_and_users_mixed = area_code_user.join(area_code_state) \ .reduceByKey(lambda a,b: a+b) \ .values() user_state_pairs = states_and_users_mixed.flatMap( emit_cartesian_prod_of_userids_and_states ) user_to_states = user_state_pairs.reduceByKey(lambda a,b: a+b) user_to_states.first(1) (1234567,[Tennessee,Tennessee,California]) This would work, but the user_state_pairs is just a list of user_ids and state names mixed together and emit_cartesian_prod_of_userids_and_states has to correctly pair them. This is problematic because 1) it's weird and sloppy and 2) there will be lots of users per state and having so many users in a single row is going to make emit_cartesian_prod_of_userids_and_states work extra hard to first locate states and then emit all userid-state pairs. How should I be doing this? Thanks, -John