RE: Bulk-load to HBase

innowireless TaeYun Kim Mon, 22 Sep 2014 02:44:58 -0700

Thank you.

Actually I have numbers for the merged byte[] size for a group for my test 
input data.
Max merged byte[] size per group is about 16KB, while the average is about 1KB. 
It’s not that big.
Since the executor memory was set to 4GB and spark.storage.memoryFraction set 
to 0.3, I think it is unlikely that groupByKey() is the cause. (Or is there 
something I didn't understand?)

Regarding the executor memory size, actually I observed the task ran slower 
when I assigned each executor more memory and reduced the number of executors 
accordingly. I guess it is for the GC overhead. (For that time, my program did 
not include the HBase export task.)

BTW, I use Spark 1.0.0.

Thank you.

-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Monday, September 22, 2014 6:26 PM
To: innowireless TaeYun Kim
Cc: user
Subject: Re: Bulk-load to HBase

On Mon, Sep 22, 2014 at 10:21 AM, innowireless TaeYun Kim 
<taeyun....@innowireless.co.kr> wrote:
> I have to merge the byte[]s that have the same key.
> If merging is done with reduceByKey(), a lot of intermediate byte[] 
> allocation and System.arraycopy() is executed, and it is too slow. So I had 
> to resort to groupByKey(), and in the callback allocate the byte[] that has 
> the total size of the byte[]s, and arraycopy() into it.
> groupByKey() works for this, since the size of the group is manageable in my 
> application.

The problem is that you will first collect and allocate many small byte[] in 
memory, and then merge them. If the total size of the byte[]s is very large, 
you run out of memory, as you observe. If you want to do this, use more 
executor memory. You may find it's not worth the tradeoff of having more, 
smaller executors merging pieces of the overall byte[] array.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Bulk-load to HBase

Reply via email to