Hi, When running spark on ec2 cluster, I find setting spark.local.dir on driver program doesn't take effect.
INFO: - standalone mode - cluster launched via python script along with spark - instance type R3.large - ebs attached (using persistent-hdfs) - spark version: 1.0.0 prebuilt-hadoop1,sbt downloaded, - run prog: sbt package run *Here is my setting:* val conf = new SparkConf() .setAppName("RecoSys") .setMaster(masterURL) .set("spark.local.dir", "/mnt") .set("spark.executor.memory", "10g") .set("spark.logConf", "true") .setJars(Seq("target/scala-2.10/recosys_2.10-0.1.jar")) After checking log, I find this: 14/07/31 08:46:04 INFO spark.SparkContext: Spark configuration: spark.app.name=RecoSys spark.executor.memory=10g spark.jars=target/scala-2.10/recosys_2.10-0.1.jar spark.local.dir=/mnt spark.logConf=true On port 4040's environment tab, I find the same thing. It looks like "spark.local.dir=/mnt" is used. My prog need store RDD with StorageLevel.MEMORY_AND_DISK, so some data will be persisted on local.dir. It is supposed to store the RDD *ONLY* on /mnt. However, I find a big spark/ directory in /mnt2. root@ip-10-186-147-175 mnt2]$ du -ah --max-depth=1 | sort -n 2.4G . 2.4G ./spark 32K ./ephemeral-hdfs Since */mnt/spark* and */mnt2/spark* are the default local.dir set in spark-env.sh, I am quite sure that my local.dir setting on the driver program is not used. So I think, the spark-env.sh overwrites my settings on driver program. (Any one can confirm this ?) So I changed it in spark-env.sh like: # export SPARK_LOCAL_DIRS="/mnt/spark, /mnt2/spark" export SPARK_LOCAL_DIRS="/mnt/spark" After running the prog, nothing changed, /mnt2/spark filled with data. It seems that even setting variable in spark-env.sh can not modify the env variable already loaded at the booting time of cluster. My workaround is: change the spark-env.sh -> restart all spark daemons in cluster -> re-run prog. This time, it works. RDD was only stored on /mnt. This is quite different from what I read from http://spark.apache.org/docs/latest/configuration.html "In Standalone and Mesos modes, this file can give machine specific information such as hostnames. *It is also sourced when running local Spark applications or submission scripts*." According to what I observed, the spark-env.sh is not sourced when running local Spark applications. So here are my questions: 1) When is the spark-env.sh loaded exactly ? (maybe show me some code in block manager) 2) Does spark-env.sh loaded overwrite some config set in driver prog ? 3) Which config will be used ? config in driver prog or in spark-env.sh ? 4) When to use config in driver prog ? When is spark-env.sh useful ? Thank you. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-local-dir-on-driver-program-doesn-t-take-effect-tp11040.html Sent from the Apache Spark User List mailing list archive at Nabble.com.