Hello,

I have a few questions about configuring memory usage on standalone
clusters.  Can someone help me out?

1) The terms "slave" in ./bin/start-slaves.sh and "worker" in the docs seem
to be used interchangeably.  Are they the same?

2) On a worker/slave, is there only one JVM running that has all the data
in it, or is there a separate JVM spun up for each application (Hadoop
style)?  Ignoring the SPARK_WORKER_INSTANCES setting.

3) There are lots of configuration options for defining memory usage.  What
do they all mean?  Is my below summary correct?
a) SPARK_WORKER_MEMORY -- maximum amount of memory that the Spark worker
will ever use, regardless of applications started.  This sets the maximum
heap size of the Spark worker JVM
b) ./bin/start-worker.sh --memory -- same as SPARK_WORKER_MEMORY?  If both
are set, which takes priority?
c) SPARK_JAVA_OPTS="-Xmx512m -Xms512m" -- same as (a) and (b)?
d) -Dspark.executor.memory -- maximum amount of memory this particular
application will use on a single worker

If that's correct I'll send a PR to the docs that would have clarified
these for me.

http://spark.incubator.apache.org/docs/latest/configuration.html#system-properties
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts


4) If I set SPARK_WORKER_MEMORY really high, I think I then also have to
set spark.executor.memory really high to take advantage of it (since
spark.executor.memory is 512m by default).  Is there a way to optimize a
cluster for the "one big application" scenario better than manually keeping
these in sync?  I'd like spark.executor.memory to match SPARK_WORKER_MEMORY
if it's not set, I think.


Thanks!
Andrew

Reply via email to