So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase?
Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan <pradhandeep1...@gmail.com> wrote: > So, with the increase in the number of worker instances, if I also > increase the degree of parallelism, will it make any difference? > I can use this model even the other way round right? I can always predict > the performance of an app with the increase in number of worker instances, > the deterioration in performance, right? > > Thank You > > On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan <pradhandeep1...@gmail.com> > wrote: > >> Yes, I have decreased the executor memory. >> But,if I have to do this, then I have to tweak around with the code >> corresponding to each configuration right? >> >> On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> "Workers" has a specific meaning in Spark. You are running many on one >>> machine? that's possible but not usual. >>> >>> Each worker's executors have access to a fraction of your machine's >>> resources then. If you're not increasing parallelism, maybe you're not >>> actually using additional workers, so are using less resource for your >>> problem. >>> >>> Or because the resulting executors are smaller, maybe you're hitting >>> GC thrashing in these executors with smaller heaps. >>> >>> Or if you're not actually configuring the executors to use less >>> memory, maybe you're over-committing your RAM and swapping? >>> >>> Bottom line, you wouldn't use multiple workers on one small standalone >>> node. This isn't a good way to estimate performance on a distributed >>> cluster either. >>> >>> On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan <pradhandeep1...@gmail.com> >>> wrote: >>> > No, I just have a single node standalone cluster. >>> > >>> > I am not tweaking around with the code to increase parallelism. I am >>> just >>> > running SparkKMeans that is there in Spark-1.0.0 >>> > I just wanted to know, if this behavior is natural. And if so, what >>> causes >>> > this? >>> > >>> > Thank you >>> > >>> > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <so...@cloudera.com> wrote: >>> >> >>> >> What's your storage like? are you adding worker machines that are >>> >> remote from where the data lives? I wonder if it just means you are >>> >> spending more and more time sending the data over the network as you >>> >> try to ship more of it to more remote workers. >>> >> >>> >> To answer your question, no in general more workers means more >>> >> parallelism and therefore faster execution. But that depends on a lot >>> >> of things. For example, if your process isn't parallelize to use all >>> >> available execution slots, adding more slots doesn't do anything. >>> >> >>> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < >>> pradhandeep1...@gmail.com> >>> >> wrote: >>> >> > Yes, I am talking about standalone single node cluster. >>> >> > >>> >> > No, I am not increasing parallelism. I just wanted to know if it is >>> >> > natural. >>> >> > Does message passing across the workers account for the happenning? >>> >> > >>> >> > I am running SparkKMeans, just to validate one prediction model. I >>> am >>> >> > using >>> >> > several data sets. I have a standalone mode. I am varying the >>> workers >>> >> > from 1 >>> >> > to 16 >>> >> > >>> >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <so...@cloudera.com> >>> wrote: >>> >> >> >>> >> >> I can imagine a few reasons. Adding workers might cause fewer >>> tasks to >>> >> >> execute locally (?) So you may be execute more remotely. >>> >> >> >>> >> >> Are you increasing parallelism? for trivial jobs, chopping them up >>> >> >> further may cause you to pay more overhead of managing so many >>> small >>> >> >> tasks, for no speed up in execution time. >>> >> >> >>> >> >> Can you provide any more specifics though? you haven't said what >>> >> >> you're running, what mode, how many workers, how long it takes, >>> etc. >>> >> >> >>> >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >>> >> >> <pradhandeep1...@gmail.com> >>> >> >> wrote: >>> >> >> > Hi, >>> >> >> > I have been running some jobs in my local single node stand alone >>> >> >> > cluster. I >>> >> >> > am varying the worker instances for the same job, and the time >>> taken >>> >> >> > for >>> >> >> > the >>> >> >> > job to complete increases with increase in the number of >>> workers. I >>> >> >> > repeated >>> >> >> > some experiments varying the number of nodes in a cluster too >>> and the >>> >> >> > same >>> >> >> > behavior is seen. >>> >> >> > Can the idea of worker instances be extrapolated to the nodes in >>> a >>> >> >> > cluster? >>> >> >> > >>> >> >> > Thank You >>> >> > >>> >> > >>> > >>> > >>> >> >> >