rdd caching and use thereof

2014-10-17 Thread Nathan Kronenfeld
I'm trying to understand two things about how spark is working.

(1) When I try to cache an rdd that fits well within memory (about 60g with
about 600g of memory), I get seemingly random levels of caching, from
around 60% to 100%, given the same tuning parameters.  What governs how
much of an RDD gets cached when there is enough memory?

(2) Even when cached, when I run some tasks over the data, I get various
locality states.  Sometimes it works perfectly, with everything
PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
the task takes minutes instead of seconds); often this will vary if I run
the task twice in a row in the same shell.  Is there anything I can do to
affect this?  I tried caching with replication, but that caused everything
to run out of memory nearly instantly (with the same 60g data set in 4-600g
of memory)

Thanks for the help,

-Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: rdd caching and use thereof

2014-10-17 Thread Nathan Kronenfeld
Oh, I forgot - I've set the following parameters at the moment (besides the
standard location, memory, and core setup):

spark.logConf  true
spark.shuffle.consolidateFiles true
spark.ui.port  4042
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
spark.shuffle.file.buffer.kb   500
spark.speculation  true



On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld 
nkronenf...@oculusinfo.com wrote:

 I'm trying to understand two things about how spark is working.

 (1) When I try to cache an rdd that fits well within memory (about 60g
 with about 600g of memory), I get seemingly random levels of caching, from
 around 60% to 100%, given the same tuning parameters.  What governs how
 much of an RDD gets cached when there is enough memory?

 (2) Even when cached, when I run some tasks over the data, I get various
 locality states.  Sometimes it works perfectly, with everything
 PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
 the task takes minutes instead of seconds); often this will vary if I run
 the task twice in a row in the same shell.  Is there anything I can do to
 affect this?  I tried caching with replication, but that caused everything
 to run out of memory nearly instantly (with the same 60g data set in 4-600g
 of memory)

 Thanks for the help,

 -Nathan


 --
 Nathan Kronenfeld
 Senior Visualization Developer
 Oculus Info Inc
 2 Berkeley Street, Suite 600,
 Toronto, Ontario M5A 4J5
 Phone:  +1-416-203-3003 x 238
 Email:  nkronenf...@oculusinfo.com




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com