Hello, I have a few questions about configuring memory usage on standalone clusters. Can someone help me out?
1) The terms "slave" in ./bin/start-slaves.sh and "worker" in the docs seem to be used interchangeably. Are they the same? 2) On a worker/slave, is there only one JVM running that has all the data in it, or is there a separate JVM spun up for each application (Hadoop style)? Ignoring the SPARK_WORKER_INSTANCES setting. 3) There are lots of configuration options for defining memory usage. What do they all mean? Is my below summary correct? a) SPARK_WORKER_MEMORY -- maximum amount of memory that the Spark worker will ever use, regardless of applications started. This sets the maximum heap size of the Spark worker JVM b) ./bin/start-worker.sh --memory -- same as SPARK_WORKER_MEMORY? If both are set, which takes priority? c) SPARK_JAVA_OPTS="-Xmx512m -Xms512m" -- same as (a) and (b)? d) -Dspark.executor.memory -- maximum amount of memory this particular application will use on a single worker If that's correct I'll send a PR to the docs that would have clarified these for me. http://spark.incubator.apache.org/docs/latest/configuration.html#system-properties http://spark.incubator.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts 4) If I set SPARK_WORKER_MEMORY really high, I think I then also have to set spark.executor.memory really high to take advantage of it (since spark.executor.memory is 512m by default). Is there a way to optimize a cluster for the "one big application" scenario better than manually keeping these in sync? I'd like spark.executor.memory to match SPARK_WORKER_MEMORY if it's not set, I think. Thanks! Andrew