Hello, I am having problems trying to apply the split-apply-combine strategy for dataframes using SparkR.
I have a largish dataframe and I would like to achieve something similar to what ddply(df, .(id), foo) would do, only that using SparkR as computing engine. My df has a few million records and I would like to split it by "id" and operate on the pieces. These pieces are quite small in size: just a few hundred records. I do something along the following lines: 1) Use split to transform df into a list of dfs. 2) parallelize the resulting list as a RDD (using a few thousand slices) 3) map my function on the pieces using Spark. 4) recombine the results (do.call, rbind, etc.) My cluster works and I can perform medium sized batch jobs. However, it fails with my full df: I get a heap space out of memory error. It is funny as the slices are very small in size. Should I send smaller batches to my cluster? Is there any recommended general approach to these kind of split-apply-combine problems? Best, Carlos J. Gil Bellosta http://www.datanalytics.com --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org