Hi,

I am processing an RDD of key-value pairs. The key is an user_id, and the
value is an website url the user has ever visited.

Since I need to know all the urls each user has visited, I am  tempted to
call the groupByKey on this RDD. However, since there could be millions of
users and urls, the shuffling caused by groupByKey proves to be a major
bottleneck to get the job done. Is there any workaround? I want to end up
with an RDD of key-value pairs, where the key is an user_id, the value is a
list of all the urls visited by the user.

Thanks,

Jianguo

Reply via email to