Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-15 Thread Carlos J. Gil Bellosta
Thanks for your reply.

I think that the problem was that SparkR tried to serialize the whole
environment. Mind that the large dataframe was part of it. So every
worker received their slice / partition (which is very small) plus the
whole thing!

So I deleted the large dataframe and list before parallelizing and the
cluster ran without memory issues.

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

2014-08-15 3:53 GMT+02:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu:
 Could you try increasing the number of slices with the large data set ?
 SparkR assumes that each slice (or partition in Spark terminology) can fit
 in memory of a single machine.  Also is the error happening when you do the
 map function or does it happen when you combine the results ?

 Thanks
 Shivaram


 On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta
 gilbello...@gmail.com wrote:

 Hello,

 I am having problems trying to apply the split-apply-combine strategy
 for dataframes using SparkR.

 I have a largish dataframe and I would like to achieve something similar
 to what

 ddply(df, .(id), foo)

 would do, only that using SparkR as computing engine. My df has a few
 million records and I would like to split it by id and operate on
 the pieces. These pieces are quite small in size: just a few hundred
 records.

 I do something along the following lines:

 1) Use split to transform df into a list of dfs.
 2) parallelize the resulting list as a RDD (using a few thousand slices)
 3) map my function on the pieces using Spark.
 4) recombine the results (do.call, rbind, etc.)

 My cluster works and I can perform medium sized batch jobs.

 However, it fails with my full df: I get a heap space out of memory
 error. It is funny as the slices are very small in size.

 Should I send smaller batches to my cluster? Is there any recommended
 general approach to these kind of split-apply-combine problems?

 Best,

 Carlos J. Gil Bellosta
 http://www.datanalytics.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ?
SparkR assumes that each slice (or partition in Spark terminology) can fit
in memory of a single machine.  Also is the error happening when you do the
map function or does it happen when you combine the results ?

Thanks
Shivaram


On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta 
gilbello...@gmail.com wrote:

 Hello,

 I am having problems trying to apply the split-apply-combine strategy
 for dataframes using SparkR.

 I have a largish dataframe and I would like to achieve something similar
 to what

 ddply(df, .(id), foo)

 would do, only that using SparkR as computing engine. My df has a few
 million records and I would like to split it by id and operate on
 the pieces. These pieces are quite small in size: just a few hundred
 records.

 I do something along the following lines:

 1) Use split to transform df into a list of dfs.
 2) parallelize the resulting list as a RDD (using a few thousand slices)
 3) map my function on the pieces using Spark.
 4) recombine the results (do.call, rbind, etc.)

 My cluster works and I can perform medium sized batch jobs.

 However, it fails with my full df: I get a heap space out of memory
 error. It is funny as the slices are very small in size.

 Should I send smaller batches to my cluster? Is there any recommended
 general approach to these kind of split-apply-combine problems?

 Best,

 Carlos J. Gil Bellosta
 http://www.datanalytics.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org