Re: Save RDD with partition information

2015-01-13 Thread Raghavendra Pandey
I believe the default hash partitioner logic in spark will send all the
same keys to same machine.

On Wed, Jan 14, 2015, 03:03 Puneet Kapoor puneet.cse.i...@gmail.com wrote:

 Hi,

 I have a usecase where in I have hourly spark job which creates hourly
 RDDs, which are partitioned by keys.

 At the end of the day I need to access all of these RDDs and combine the
 Key/Value pairs over the day.

 If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour
 of the day); we need to combine all the values of this K1 using some logic.

 What I want to do is to avoid the shuffling at the end of the day since
 the data in huge ~ hundreds of GB.

 Questions
 ---
 1.) Is there a way that i can persist hourly RDDs with partition
 information and then while reading back the RDDs the partition information
 is restored.
 2.) Can i ensure that partitioning is similar for different hours. Like if
 K1 goes to container_X, it would go to the same container in the next hour
 and so on.

 Regards
 Puneet




Re: Save RDD with partition information

2015-01-13 Thread lihu
there is no way to avoid shuffle if you use combine by key, no matter if
your data is cached in memory, because the shuffle write must write the
data into disk. And It seem that spark can not guarantee the similar
key(K1) goes to the Container_X.

you can use the tmpfs for your shuffle dir, this can improve your shuffle
write speed.

If the number of worker nodes is enough, then hundreds of GB is not quite
big to deal with.


On Wed, Jan 14, 2015 at 5:30 AM, Puneet Kapoor puneet.cse.i...@gmail.com
wrote:

 Hi,

 I have a usecase where in I have hourly spark job which creates hourly
 RDDs, which are partitioned by keys.

 At the end of the day I need to access all of these RDDs and combine the
 Key/Value pairs over the day.

 If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour
 of the day); we need to combine all the values of this K1 using some logic.

 What I want to do is to avoid the shuffling at the end of the day since
 the data in huge ~ hundreds of GB.

 Questions
 ---
 1.) Is there a way that i can persist hourly RDDs with partition
 information and then while reading back the RDDs the partition information
 is restored.
 2.) Can i ensure that partitioning is similar for different hours. Like if
 K1 goes to container_X, it would go to the same container in the next hour
 and so on.

 Regards
 Puneet




Re: Save RDD with partition information

2015-01-13 Thread lihu
By the way, I am not sure enough wether the shuffle key can go into the
similar container.