Thank you. Actually I have numbers for the merged byte[] size for a group for my test input data. Max merged byte[] size per group is about 16KB, while the average is about 1KB. It’s not that big. Since the executor memory was set to 4GB and spark.storage.memoryFraction set to 0.3, I think it is unlikely that groupByKey() is the cause. (Or is there something I didn't understand?)
Regarding the executor memory size, actually I observed the task ran slower when I assigned each executor more memory and reduced the number of executors accordingly. I guess it is for the GC overhead. (For that time, my program did not include the HBase export task.) BTW, I use Spark 1.0.0. Thank you. -----Original Message----- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, September 22, 2014 6:26 PM To: innowireless TaeYun Kim Cc: user Subject: Re: Bulk-load to HBase On Mon, Sep 22, 2014 at 10:21 AM, innowireless TaeYun Kim <taeyun....@innowireless.co.kr> wrote: > I have to merge the byte[]s that have the same key. > If merging is done with reduceByKey(), a lot of intermediate byte[] > allocation and System.arraycopy() is executed, and it is too slow. So I had > to resort to groupByKey(), and in the callback allocate the byte[] that has > the total size of the byte[]s, and arraycopy() into it. > groupByKey() works for this, since the size of the group is manageable in my > application. The problem is that you will first collect and allocate many small byte[] in memory, and then merge them. If the total size of the byte[]s is very large, you run out of memory, as you observe. If you want to do this, use more executor memory. You may find it's not worth the tradeoff of having more, smaller executors merging pieces of the overall byte[] array. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org