Re: many-to-many join

2015-07-22 Thread Sonal Goyal
If I understand this correctly, you could join area_code_user and
area_code_state and then flat map to get
user, areacode, state. Then groupby/reduce by user.

You can also try some join optimizations like partitioning on area code or
broadcasting smaller table depending on size of area_code_state.
On Jul 22, 2015 10:15 AM, John Berryman jo...@eventbrite.com wrote:

 Quick example problem that's stumping me:

 * Users have 1 or more phone numbers and therefore one or more area codes.
 * There are 100M users.
 * States have one or more area codes.
 * I would like to the states for the users (as indicated by phone area
 code).

 I was thinking about something like this:

 If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
 and area_code_state looks like (area_code,state) ex: (615, [Tennessee])
 then we could do

 states_and_users_mixed = area_code_user.join(area_code_state) \
 .reduceByKey(lambda a,b: a+b) \
 .values()

 user_state_pairs = states_and_users_mixed.flatMap(
 emit_cartesian_prod_of_userids_and_states )
 user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)

 user_to_states.first(1)

  (1234567,[Tennessee,Tennessee,California])

 This would work, but the user_state_pairs is just a list of user_ids and
 state names mixed together and emit_cartesian_prod_of_userids_and_states
 has to correctly pair them. This is problematic because 1) it's weird and
 sloppy and 2) there will be lots of users per state and having so many
 users in a single row is going to make
 emit_cartesian_prod_of_userids_and_states work extra hard to first locate
 states and then emit all userid-state pairs.

 How should I be doing this?

 Thanks,
 -John



Re: many-to-many join

2015-07-22 Thread ayan guha
Hi

RDD solution:
 u = [(615,1),(720,1),(615,2)]
 urdd=sc.parallelize(u,1)
 a1 = [(615,'T'),(720,'C')]
 ardd=sc.parallelize(a1,1)
 def addString(s1,s2):
... return s1+','+s2
 j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString)
 print j.collect()
[(2, 'T'), (1, 'C,T')]

However, if you can assume number of users  is far far greater than
number of distinct area codes, you may think to broadcast variable in a
dict format and look up in the map. Like this

 u = [(1,615),(1,720),(2,615)]
 a = {615:'T',720:'C'}
 urdd=sc.parallelize(u)
 def usr_area_state(tup):
... uid=tup[0]
... aid=tup[1]
... sid=bc.value[aid]
... return uid,(sid,)
...
 bc=sc.broadcast(a)
 usrdd=urdd.map(usr_area_state)
 def addTuple(t1,t2):
... return t1+t2
...
 out=usrdd.reduceByKey(addTuple)
 print out.collect()
[(1, ('T', 'C')), (2, ('T',))]

Best
Ayan

On Wed, Jul 22, 2015 at 5:14 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 If I understand this correctly, you could join area_code_user and
 area_code_state and then flat map to get
 user, areacode, state. Then groupby/reduce by user.

 You can also try some join optimizations like partitioning on area code or
 broadcasting smaller table depending on size of area_code_state.
 On Jul 22, 2015 10:15 AM, John Berryman jo...@eventbrite.com wrote:

 Quick example problem that's stumping me:

 * Users have 1 or more phone numbers and therefore one or more area codes.
 * There are 100M users.
 * States have one or more area codes.
 * I would like to the states for the users (as indicated by phone area
 code).

 I was thinking about something like this:

 If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
 and area_code_state looks like (area_code,state) ex: (615, [Tennessee])
 then we could do

 states_and_users_mixed = area_code_user.join(area_code_state) \
 .reduceByKey(lambda a,b: a+b) \
 .values()

 user_state_pairs = states_and_users_mixed.flatMap(
 emit_cartesian_prod_of_userids_and_states )
 user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)

 user_to_states.first(1)

  (1234567,[Tennessee,Tennessee,California])

 This would work, but the user_state_pairs is just a list of user_ids and
 state names mixed together and emit_cartesian_prod_of_userids_and_states
 has to correctly pair them. This is problematic because 1) it's weird and
 sloppy and 2) there will be lots of users per state and having so many
 users in a single row is going to make
 emit_cartesian_prod_of_userids_and_states work extra hard to first locate
 states and then emit all userid-state pairs.

 How should I be doing this?

 Thanks,
 -John




-- 
Best Regards,
Ayan Guha


many-to-many join

2015-07-21 Thread John Berryman
Quick example problem that's stumping me:

* Users have 1 or more phone numbers and therefore one or more area codes.
* There are 100M users.
* States have one or more area codes.
* I would like to the states for the users (as indicated by phone area
code).

I was thinking about something like this:

If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
and area_code_state looks like (area_code,state) ex: (615, [Tennessee])
then we could do

states_and_users_mixed = area_code_user.join(area_code_state) \
.reduceByKey(lambda a,b: a+b) \
.values()

user_state_pairs = states_and_users_mixed.flatMap(
emit_cartesian_prod_of_userids_and_states )
user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)

user_to_states.first(1)

 (1234567,[Tennessee,Tennessee,California])

This would work, but the user_state_pairs is just a list of user_ids and
state names mixed together and emit_cartesian_prod_of_userids_and_states
has to correctly pair them. This is problematic because 1) it's weird and
sloppy and 2) there will be lots of users per state and having so many
users in a single row is going to make
emit_cartesian_prod_of_userids_and_states work extra hard to first locate
states and then emit all userid-state pairs.

How should I be doing this?

Thanks,
-John