Here, I wanted to ask a different thing though. Let me put it this way. What is the relationship between the performance of a Spark Job and the number of cores in the standalone Spark single node cluster.
Thank You On Tue, Feb 24, 2015 at 8:39 AM, Deep Pradhan <pradhandeep1...@gmail.com> wrote: > You mean SPARK_WORKER_CORES in /conf/spark-env.sh? > > On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui <same...@databricks.com> > wrote: > >> In Standalone mode, a Worker JVM starts an Executor. Inside the Exec >> there are slots for task threads. The slot count is configured by the >> num_cores setting. Generally over subscribe this. So if you have 10 free >> CPU cores, set num_cores to 20. >> >> >> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com> >> wrote: >> >>> How is task slot different from # of Workers? >>> >>> >>> >> so don't read into any performance metrics you've collected to >>> extrapolate what may happen at scale. >>> I did not get you in this. >>> >>> Thank You >>> >>> On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui < >>> same...@databricks.com> wrote: >>> >>>> In general you should first figure out how many task slots are in the >>>> cluster and then repartition the RDD to maybe 2x that #. So if you have a >>>> 100 slots, then maybe RDDs with partition count of 100-300 would be normal. >>>> >>>> But also size of each partition can matter. You want a task to operate >>>> on a partition for at least 200ms, but no longer than around 20 seconds. >>>> >>>> Even if you have 100 slots, it could be okay to have a RDD with 10,000 >>>> partitions if you've read in a large file. >>>> >>>> So don't repartition your RDD to match the # of Worker JVMs, but rather >>>> align it to the total # of task slots in the Executors. >>>> >>>> If you're running on a single node, shuffle operations become almost >>>> free (because there's no network movement), so don't read into any >>>> performance metrics you've collected to extrapolate what may happen at >>>> scale. >>>> >>>> >>>> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> If I repartition my data by a factor equal to the number of worker >>>>> instances, will the performance be better or worse? >>>>> As far as I understand, the performance should be better, but in my >>>>> case it is becoming worse. >>>>> I have a single node standalone cluster, is it because of this? >>>>> Am I guaranteed to have a better performance if I do the same thing in >>>>> a multi-node cluster? >>>>> >>>>> Thank You >>>>> >>>> >>> >