Re: rdd caching and use thereof

Nathan Kronenfeld Thu, 16 Oct 2014 23:49:45 -0700

Oh, I forgot - I've set the following parameters at the moment (besides the
standard location, memory, and core setup):


spark.logConf                  true
spark.shuffle.consolidateFiles true
spark.ui.port                  4042
spark.io.compression.codec     org.apache.spark.io.SnappyCompressionCodec
spark.shuffle.file.buffer.kb   500
spark.speculation              true



On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> I'm trying to understand two things about how spark is working.
>
> (1) When I try to cache an rdd that fits well within memory (about 60g
> with about 600g of memory), I get seemingly random levels of caching, from
> around 60% to 100%, given the same tuning parameters.  What governs how
> much of an RDD gets cached when there is enough memory?
>
> (2) Even when cached, when I run some tasks over the data, I get various
> locality states.  Sometimes it works perfectly, with everything
> PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
> the task takes minutes instead of seconds); often this will vary if I run
> the task twice in a row in the same shell.  Is there anything I can do to
> affect this?  I tried caching with replication, but that caused everything
> to run out of memory nearly instantly (with the same 60g data set in 4-600g
> of memory)
>
> Thanks for the help,
>
>                 -Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Re: rdd caching and use thereof

Reply via email to