Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.
ll=true --conf spark.yarn.executor.memoryOverhead=512M Additionally, executor and memory have dedicated options: pyspark --master yarn-client --conf spark.shuffle.spill=true --conf spark.yarn.executor.memoryOverhead=512M --driver-memory 3G --executor-memory 5G -Sandy On Tue, Sep 16,

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.
Hello friends: Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution. Everything went fine, and everything seems to work, but for the following. Following are two invocations of the 'pyspark' script, one with enclosing quotes around the options passed to '--driver-java-op

If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Dimension Data, LLC.
Hello friends: It was mentioned in another (Y.A.R.N.-centric) email thread that 'SPARK_JAR' was deprecated, and to use the 'spark.yarn.jar' property instead for YARN submission. For example: user$ pyspark [some-options] --driver-java-options spark.yarn.jar=hdfs://namenode:8020/path/to/spa

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Dimension Data, LLC.
Hi: Curious... is there any reason not to use one of the below pyspark options (in red)? Assuming each file is, say 10k in size, is 50 files too much? Does that touch on some practical limitation? Usage: ./bin/pyspark [options] Options: --master MASTER_URL spark://host:port, mesos://h

Re: Spark on YARN question

2014-09-02 Thread Dimension Data, LLC.
Hello friends: I have a follow-up to Andrew's well articulated answer below (thank you for that). (1) I've seen both of these invocations in various places: (a) '--master yarn' (b) '--master yarn-client' the latter of which doesn't appear in '/pyspark//|//spark-submit|spark-