Scalability of group by

Ulanov, Alexander Mon, 27 Apr 2015 18:31:42 -0700

Hi,

I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in 
Spark 1.3 as follows:
"select id, time, first(value) from data group by id, time"


My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is 
allocated with 5GB of memory. However, all executors are being lost during the 
query execution and I get "ExecutorLostFailure".

Could you suggest what might be the reason for it? Could it be that "group by" 
is implemented as RDD.groupBy so it holds the group by result in memory? What 
is the workaround?

Best regards, Alexander

Scalability of group by

Reply via email to