How is task slot different from # of Workers?
>> so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui <same...@databricks.com> wrote: > In general you should first figure out how many task slots are in the > cluster and then repartition the RDD to maybe 2x that #. So if you have a > 100 slots, then maybe RDDs with partition count of 100-300 would be normal. > > But also size of each partition can matter. You want a task to operate on > a partition for at least 200ms, but no longer than around 20 seconds. > > Even if you have 100 slots, it could be okay to have a RDD with 10,000 > partitions if you've read in a large file. > > So don't repartition your RDD to match the # of Worker JVMs, but rather > align it to the total # of task slots in the Executors. > > If you're running on a single node, shuffle operations become almost free > (because there's no network movement), so don't read into any performance > metrics you've collected to extrapolate what may happen at scale. > > > On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com> > wrote: > >> Hi, >> If I repartition my data by a factor equal to the number of worker >> instances, will the performance be better or worse? >> As far as I understand, the performance should be better, but in my case >> it is becoming worse. >> I have a single node standalone cluster, is it because of this? >> Am I guaranteed to have a better performance if I do the same thing in a >> multi-node cluster? >> >> Thank You >> >