It all depends on what it is you need to do with the pages. If you’re just 
going to be collecting them then it’s really not much different than a 
groupByKey. If instead you’re looking to derive some other value from the 
series of pages then you could potentially partition by user id and run a 
mapPartitions or one of the other combineByKey APIs?


From: Jianguo Li
Date: Tuesday, June 23, 2015 at 9:46 AM
To: Silvio Fiorito
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: workaround for groupByKey

Thanks. Yes, unfortunately, they all need to be grouped. I guess I can 
partition the record by user id. However, I have millions of users, do you 
think partition by user id will help?

Jianguo

On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> wrote:
You’re right of course, I’m sorry. I was typing before thinking about what you 
actually asked!

On a second thought, what is the ultimate outcome for what you want the 
sequence of pages for? Do they need to actually all be grouped? Could you 
instead partition by user id then use a mapPartitions perhaps?

From: Jianguo Li
Date: Monday, June 22, 2015 at 6:21 PM
To: Silvio Fiorito
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: workaround for groupByKey

Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. 
I read in the Learning Sparking

We can disable map-side aggregation in combineByKey() if we know that our data 
won’t benefit from it. For example, groupByKey() disables map-side aggregation 
as the aggregation function (appending to a list) does not save any space. If 
we want to disable map-side combines, we need to specify the partitioner; for 
now you can just use the partitioner on the source RDD by passingrdd.partitioner

It seems that when the map-side aggregation function is to append something to 
a list (as opposed to summing over all the numbers), then this map-side 
aggregation does not offer any benefit since appending to a list does not save 
any space. Is my understanding correct?

Thanks,

Jianguo

On Mon, Jun 22, 2015 at 4:43 PM, Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> wrote:
You can use aggregateByKey as one option:

val input: RDD[Int, String] = ...

val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += b, (a, 
b) => a ++ b)

From: Jianguo Li
Date: Monday, June 22, 2015 at 5:12 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: workaround for groupByKey

Hi,

I am processing an RDD of key-value pairs. The key is an user_id, and the value 
is an website url the user has ever visited.

Since I need to know all the urls each user has visited, I am  tempted to call 
the groupByKey on this RDD. However, since there could be millions of users and 
urls, the shuffling caused by groupByKey proves to be a major bottleneck to get 
the job done. Is there any workaround? I want to end up with an RDD of 
key-value pairs, where the key is an user_id, the value is a list of all the 
urls visited by the user.

Thanks,

Jianguo


Reply via email to