Oh, I forgot - I've set the following parameters at the moment (besides the standard location, memory, and core setup):
spark.logConf true spark.shuffle.consolidateFiles true spark.ui.port 4042 spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 500 spark.speculation true On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > I'm trying to understand two things about how spark is working. > > (1) When I try to cache an rdd that fits well within memory (about 60g > with about 600g of memory), I get seemingly random levels of caching, from > around 60% to 100%, given the same tuning parameters. What governs how > much of an RDD gets cached when there is enough memory? > > (2) Even when cached, when I run some tasks over the data, I get various > locality states. Sometimes it works perfectly, with everything > PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and > the task takes minutes instead of seconds); often this will vary if I run > the task twice in a row in the same shell. Is there anything I can do to > affect this? I tried caching with replication, but that caused everything > to run out of memory nearly instantly (with the same 60g data set in 4-600g > of memory) > > Thanks for the help, > > -Nathan > > > -- > Nathan Kronenfeld > Senior Visualization Developer > Oculus Info Inc > 2 Berkeley Street, Suite 600, > Toronto, Ontario M5A 4J5 > Phone: +1-416-203-3003 x 238 > Email: nkronenf...@oculusinfo.com > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com