Problem with groupBy and OOM when just writing the group in a file

Mario Pastorelli Mon, 30 Mar 2015 02:09:13 -0700

we are experiencing some problems with the groupBy operations when usedto group together data that will be written in the same file. Theoperation that we want to do is the following: given some data with atimestamp, we want to sort it by timestamp, group it by hour and writeone file per hour. One could do something like


rdd.groupBy(hour).foreach{ case (hour, group) =>
    val writer = writerForHour(hour)
    group.toSeq.sortBy(hour).foreach(writer.write)
    writer.close()
}

but this will load all the data for one hour in memory and do out ofmemory easily. Originally we though it was a problem with the toSeq thatwas making string the iterable that you obtain as value from the groupBybut apparently it is not. We removed the toSeq.sortBy(hour) but we stillget OOM when the data in a group is huge.I saw that there have been a discussion on the ML about groupBy thatmust require everything to stay in memory athttp://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html#a11487but I found no solution to my problem.


So my questions are the following:

1) is this groupBy problem still in Spark 1.3?

2) why the groupBy requires everything to stay in memory? In myignorance, I was convinced that groupBy was working with lazy Iteratorsinstead of a strict Iterable. I think this is how mapPartition works.The operation after the groupBy then would decide if the iterator shouldbe strict or not. So groupBy.foreach would be lazy and every record gotby the foreach could be directly passed to the foreach without waitingfor the others. Is this not possible for some reason?3) is there an another way to do what I want to do? Keep in mind that Ican't repartition because the number of partitions is dynamic on thenumber of year/days/hours. One solution could be to work at minuteslevel when there is too much data but we still wants to create one fileper hour.


Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Problem with groupBy and OOM when just writing the group in a file

Reply via email to