In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than overriding it, because I want to support both adding a jar to all instances of a shell (in spark-env.sh) and adding a jar to a single shell instance (SPARK_CLASSPATH=/path/to/my.jar /path/to/spark-shell)
That looks like this: # spark-env.sh export SPARK_CLASSPATH+=":/path/to/hadoop-lzo.jar" However when my Master and workers run, they have duplicates of the SPARK_CLASSPATH jars. There are 3 copies of hadoop-lzo on the classpath, 2 of which are unnecessary. The resulting command line in ps looks like this: /path/to/java -cp :/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:[core spark jars] ... -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://my-host:7077 I tracked it down and the problem is that spark-env.sh is sourced 3 times: in spark-daemon.sh, in compute-classpath.sh, and in spark-class. Each of those adds to the SPARK_CLASSPATH until its contents are in triplicate. Are all of those calls necessary? Is it possible to edit the daemon scripts to only call spark-env.sh once? FYI I'm starting the daemons with ./bin/start-master.sh and ./bin/start-slave.sh 1 $SPARK_URL Thanks, Andrew
