Re: Repartition and Worker Instances

Deep Pradhan Mon, 23 Feb 2015 19:20:19 -0800

Here, I wanted to ask a different thing though.
Let me put it this way.
What is the relationship between the performance of a Spark Job and the
number of cores in the standalone Spark single node cluster.


Thank You

On Tue, Feb 24, 2015 at 8:39 AM, Deep Pradhan <pradhandeep1...@gmail.com>
wrote:

> You mean SPARK_WORKER_CORES in /conf/spark-env.sh?
>
> On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui <same...@databricks.com>
> wrote:
>
>> In Standalone mode, a Worker JVM starts an Executor. Inside the Exec
>> there are slots for task threads. The slot count is configured by the
>> num_cores setting. Generally over subscribe this. So if you have 10 free
>> CPU cores, set num_cores to 20.
>>
>>
>> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com>
>> wrote:
>>
>>> How is task slot different from # of Workers?
>>>
>>>
>>> >> so don't read into any performance metrics you've collected to
>>> extrapolate what may happen at scale.
>>> I did not get you in this.
>>>
>>> Thank You
>>>
>>> On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui <
>>> same...@databricks.com> wrote:
>>>
>>>> In general you should first figure out how many task slots are in the
>>>> cluster and then repartition the RDD to maybe 2x that #. So if you have a
>>>> 100 slots, then maybe RDDs with partition count of 100-300 would be normal.
>>>>
>>>> But also size of each partition can matter. You want a task to operate
>>>> on a partition for at least 200ms, but no longer than around 20 seconds.
>>>>
>>>> Even if you have 100 slots, it could be okay to have a RDD with 10,000
>>>> partitions if you've read in a large file.
>>>>
>>>> So don't repartition your RDD to match the # of Worker JVMs, but rather
>>>> align it to the total # of task slots in the Executors.
>>>>
>>>> If you're running on a single node, shuffle operations become almost
>>>> free (because there's no network movement), so don't read into any
>>>> performance metrics you've collected to extrapolate what may happen at
>>>> scale.
>>>>
>>>>
>>>> On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> If I repartition my data by a factor equal to the number of worker
>>>>> instances, will the performance be better or worse?
>>>>> As far as I understand, the performance should be better, but in my
>>>>> case it is becoming worse.
>>>>> I have a single node standalone cluster, is it because of this?
>>>>> Am I guaranteed to have a better performance if I do the same thing in
>>>>> a multi-node cluster?
>>>>>
>>>>> Thank You
>>>>>
>>>>
>>>
>

Re: Repartition and Worker Instances

Reply via email to