there is no way to avoid shuffle if you use combine by key, no matter if
your data is cached in memory, because the shuffle write must write the
data into disk. And It seem that spark can not guarantee the similar
key(K1) goes to the Container_X.
you can use the tmpfs for your shuffle dir, this can improve your shuffle
write speed.
If the number of worker nodes is enough, then hundreds of GB is not quite
big to deal with.
On Wed, Jan 14, 2015 at 5:30 AM, Puneet Kapoor puneet.cse.i...@gmail.com
wrote:
Hi,
I have a usecase where in I have hourly spark job which creates hourly
RDDs, which are partitioned by keys.
At the end of the day I need to access all of these RDDs and combine the
Key/Value pairs over the day.
If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour
of the day); we need to combine all the values of this K1 using some logic.
What I want to do is to avoid the shuffling at the end of the day since
the data in huge ~ hundreds of GB.
Questions
---
1.) Is there a way that i can persist hourly RDDs with partition
information and then while reading back the RDDs the partition information
is restored.
2.) Can i ensure that partitioning is similar for different hours. Like if
K1 goes to container_X, it would go to the same container in the next hour
and so on.
Regards
Puneet