Hello,

I am having problems trying to apply the split-apply-combine strategy
for dataframes using SparkR.

I have a largish dataframe and I would like to achieve something similar to what

ddply(df, .(id), foo)

would do, only that using SparkR as computing engine. My df has a few
million records and I would like to split it by "id" and operate on
the pieces. These pieces are quite small in size: just a few hundred
records.

I do something along the following lines:

1) Use split to transform df into a list of dfs.
2) parallelize the resulting list as a RDD (using a few thousand slices)
3) map my function on the pieces using Spark.
4) recombine the results (do.call, rbind, etc.)

My cluster works and I can perform medium sized batch jobs.

However, it fails with my full df: I get a heap space out of memory
error. It is funny as the slices are very small in size.

Should I send smaller batches to my cluster? Is there any recommended
general approach to these kind of split-apply-combine problems?

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to