Re: Worker and Nodes
Also, If I take SparkPageRank for example (org.apache.spark.examples), there are various RDDs that are created and transformed in the code that is written. If I want to increase the number of partitions and test out, what is the optimum number of partitions that gives me the best performance, I have to change the number of partitions in each run, right? Now, there are various RDDs there, so, which RDD do I partition? In other words, if I partition the first RDD that is created from the data in HDFS, am I ensured that other RDDs that are transformed from this RDD will also be partitioned in the same way? Thank You On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan wrote: > >> So increasing Executors without increasing physical resources > If I have a 16 GB RAM system and then I allocate 1 GB for each executor, > and give number of executors as 8, then I am increasing the resource right? > In this case, how do you explain? > > Thank You > > On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson > wrote: > >> Note that the parallelism (i.e., number of partitions) is just an upper >> bound on how much of the work can be done in parallel. If you have 200 >> partitions, then you can divide the work among between 1 and 200 cores and >> all resources will remain utilized. If you have more than 200 cores, >> though, then some will not be used, so you would want to increase >> parallelism further. (There are other rules-of-thumb -- for instance, it's >> generally good to have at least 2x more partitions than cores for straggler >> mitigation, but these are essentially just optimizations.) >> >> Further note that when you increase the number of Executors for the same >> set of resources (i.e., starting 10 Executors on a single machine instead >> of 1), you make Spark's job harder. Spark has to communicate in an >> all-to-all manner across Executors for shuffle operations, and it uses TCP >> sockets to do so whether or not the Executors happen to be on the same >> machine. So increasing Executors without increasing physical resources >> means Spark has to do more communication to do the same work. >> >> We expect that increasing the number of Executors by a factor of 10, >> given an increase in the number of physical resources by the same factor, >> would also improve performance by 10x. This is not always the case for the >> precise reason above (increased communication overhead), but typically we >> can get close. The actual observed improvement is very algorithm-dependent, >> though; for instance, some ML algorithms become hard to scale out past a >> certain point because the increase in communication overhead outweighs the >> increase in parallelism. >> >> On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan >> wrote: >> >>> So, if I keep the number of instances constant and increase the degree >>> of parallelism in steps, can I expect the performance to increase? >>> >>> Thank You >>> >>> On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan >> > wrote: >>> So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan < pradhandeep1...@gmail.com> wrote: > Yes, I have decreased the executor memory. > But,if I have to do this, then I have to tweak around with the code > corresponding to each configuration right? > > On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: > >> "Workers" has a specific meaning in Spark. You are running many on one >> machine? that's possible but not usual. >> >> Each worker's executors have access to a fraction of your machine's >> resources then. If you're not increasing parallelism, maybe you're not >> actually using additional workers, so are using less resource for your >> problem. >> >> Or because the resulting executors are smaller, maybe you're hitting >> GC thrashing in these executors with smaller heaps. >> >> Or if you're not actually configuring the executors to use less >> memory, maybe you're over-committing your RAM and swapping? >> >> Bottom line, you wouldn't use multiple workers on one small standalone >> node. This isn't a good way to estimate performance on a distributed >> cluster either. >> >> On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan < >> pradhandeep1...@gmail.com> wrote: >> > No, I just have a single node standalone cluster. >> > >> > I am not tweaking around with the code to increase parallelism. I >> am just >> > running SparkKMeans that is there in Spark-1.0.0 >> > I just wanted to know, if this behavior is natural. And if so, what >> causes >> > this? >> > >> > Thank you >
Re: Worker and Nodes
>> So increasing Executors without increasing physical resources If I have a 16 GB RAM system and then I allocate 1 GB for each executor, and give number of executors as 8, then I am increasing the resource right? In this case, how do you explain? Thank You On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson wrote: > Note that the parallelism (i.e., number of partitions) is just an upper > bound on how much of the work can be done in parallel. If you have 200 > partitions, then you can divide the work among between 1 and 200 cores and > all resources will remain utilized. If you have more than 200 cores, > though, then some will not be used, so you would want to increase > parallelism further. (There are other rules-of-thumb -- for instance, it's > generally good to have at least 2x more partitions than cores for straggler > mitigation, but these are essentially just optimizations.) > > Further note that when you increase the number of Executors for the same > set of resources (i.e., starting 10 Executors on a single machine instead > of 1), you make Spark's job harder. Spark has to communicate in an > all-to-all manner across Executors for shuffle operations, and it uses TCP > sockets to do so whether or not the Executors happen to be on the same > machine. So increasing Executors without increasing physical resources > means Spark has to do more communication to do the same work. > > We expect that increasing the number of Executors by a factor of 10, given > an increase in the number of physical resources by the same factor, would > also improve performance by 10x. This is not always the case for the > precise reason above (increased communication overhead), but typically we > can get close. The actual observed improvement is very algorithm-dependent, > though; for instance, some ML algorithms become hard to scale out past a > certain point because the increase in communication overhead outweighs the > increase in parallelism. > > On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan > wrote: > >> So, if I keep the number of instances constant and increase the degree of >> parallelism in steps, can I expect the performance to increase? >> >> Thank You >> >> On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan >> wrote: >> >>> So, with the increase in the number of worker instances, if I also >>> increase the degree of parallelism, will it make any difference? >>> I can use this model even the other way round right? I can always >>> predict the performance of an app with the increase in number of worker >>> instances, the deterioration in performance, right? >>> >>> Thank You >>> >>> On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan >> > wrote: >>> Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: > "Workers" has a specific meaning in Spark. You are running many on one > machine? that's possible but not usual. > > Each worker's executors have access to a fraction of your machine's > resources then. If you're not increasing parallelism, maybe you're not > actually using additional workers, so are using less resource for your > problem. > > Or because the resulting executors are smaller, maybe you're hitting > GC thrashing in these executors with smaller heaps. > > Or if you're not actually configuring the executors to use less > memory, maybe you're over-committing your RAM and swapping? > > Bottom line, you wouldn't use multiple workers on one small standalone > node. This isn't a good way to estimate performance on a distributed > cluster either. > > On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan < > pradhandeep1...@gmail.com> wrote: > > No, I just have a single node standalone cluster. > > > > I am not tweaking around with the code to increase parallelism. I am > just > > running SparkKMeans that is there in Spark-1.0.0 > > I just wanted to know, if this behavior is natural. And if so, what > causes > > this? > > > > Thank you > > > > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen > wrote: > >> > >> What's your storage like? are you adding worker machines that are > >> remote from where the data lives? I wonder if it just means you are > >> spending more and more time sending the data over the network as you > >> try to ship more of it to more remote workers. > >> > >> To answer your question, no in general more workers means more > >> parallelism and therefore faster execution. But that depends on a > lot > >> of things. For example, if your process isn't parallelize to use all > >> available execution slots, adding more slots doesn't do anything. > >> > >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < > pradhandeep1...@gmail.com> > >> wrote: > >
Re: Worker and Nodes
Note that the parallelism (i.e., number of partitions) is just an upper bound on how much of the work can be done in parallel. If you have 200 partitions, then you can divide the work among between 1 and 200 cores and all resources will remain utilized. If you have more than 200 cores, though, then some will not be used, so you would want to increase parallelism further. (There are other rules-of-thumb -- for instance, it's generally good to have at least 2x more partitions than cores for straggler mitigation, but these are essentially just optimizations.) Further note that when you increase the number of Executors for the same set of resources (i.e., starting 10 Executors on a single machine instead of 1), you make Spark's job harder. Spark has to communicate in an all-to-all manner across Executors for shuffle operations, and it uses TCP sockets to do so whether or not the Executors happen to be on the same machine. So increasing Executors without increasing physical resources means Spark has to do more communication to do the same work. We expect that increasing the number of Executors by a factor of 10, given an increase in the number of physical resources by the same factor, would also improve performance by 10x. This is not always the case for the precise reason above (increased communication overhead), but typically we can get close. The actual observed improvement is very algorithm-dependent, though; for instance, some ML algorithms become hard to scale out past a certain point because the increase in communication overhead outweighs the increase in parallelism. On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan wrote: > So, if I keep the number of instances constant and increase the degree of > parallelism in steps, can I expect the performance to increase? > > Thank You > > On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan > wrote: > >> So, with the increase in the number of worker instances, if I also >> increase the degree of parallelism, will it make any difference? >> I can use this model even the other way round right? I can always predict >> the performance of an app with the increase in number of worker instances, >> the deterioration in performance, right? >> >> Thank You >> >> On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan >> wrote: >> >>> Yes, I have decreased the executor memory. >>> But,if I have to do this, then I have to tweak around with the code >>> corresponding to each configuration right? >>> >>> On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: >>> "Workers" has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan < pradhandeep1...@gmail.com> wrote: > No, I just have a single node standalone cluster. > > I am not tweaking around with the code to increase parallelism. I am just > running SparkKMeans that is there in Spark-1.0.0 > I just wanted to know, if this behavior is natural. And if so, what causes > this? > > Thank you > > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: >> >> What's your storage like? are you adding worker machines that are >> remote from where the data lives? I wonder if it just means you are >> spending more and more time sending the data over the network as you >> try to ship more of it to more remote workers. >> >> To answer your question, no in general more workers means more >> parallelism and therefore faster execution. But that depends on a lot >> of things. For example, if your process isn't parallelize to use all >> available execution slots, adding more slots doesn't do anything. >> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < pradhandeep1...@gmail.com> >> wrote: >> > Yes, I am talking about standalone single node cluster. >> > >> > No, I am not increasing parallelism. I just wanted to know if it is >> > natural. >> > Does message passing across the workers account for the happenning? >> > >> > I am running SparkKMeans, just to validate one prediction model. I am >> > using >> > several data sets. I have a standalone mode. I am varying the workers >> >
Re: Worker and Nodes
So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase? Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan wrote: > So, with the increase in the number of worker instances, if I also > increase the degree of parallelism, will it make any difference? > I can use this model even the other way round right? I can always predict > the performance of an app with the increase in number of worker instances, > the deterioration in performance, right? > > Thank You > > On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan > wrote: > >> Yes, I have decreased the executor memory. >> But,if I have to do this, then I have to tweak around with the code >> corresponding to each configuration right? >> >> On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: >> >>> "Workers" has a specific meaning in Spark. You are running many on one >>> machine? that's possible but not usual. >>> >>> Each worker's executors have access to a fraction of your machine's >>> resources then. If you're not increasing parallelism, maybe you're not >>> actually using additional workers, so are using less resource for your >>> problem. >>> >>> Or because the resulting executors are smaller, maybe you're hitting >>> GC thrashing in these executors with smaller heaps. >>> >>> Or if you're not actually configuring the executors to use less >>> memory, maybe you're over-committing your RAM and swapping? >>> >>> Bottom line, you wouldn't use multiple workers on one small standalone >>> node. This isn't a good way to estimate performance on a distributed >>> cluster either. >>> >>> On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan >>> wrote: >>> > No, I just have a single node standalone cluster. >>> > >>> > I am not tweaking around with the code to increase parallelism. I am >>> just >>> > running SparkKMeans that is there in Spark-1.0.0 >>> > I just wanted to know, if this behavior is natural. And if so, what >>> causes >>> > this? >>> > >>> > Thank you >>> > >>> > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: >>> >> >>> >> What's your storage like? are you adding worker machines that are >>> >> remote from where the data lives? I wonder if it just means you are >>> >> spending more and more time sending the data over the network as you >>> >> try to ship more of it to more remote workers. >>> >> >>> >> To answer your question, no in general more workers means more >>> >> parallelism and therefore faster execution. But that depends on a lot >>> >> of things. For example, if your process isn't parallelize to use all >>> >> available execution slots, adding more slots doesn't do anything. >>> >> >>> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < >>> pradhandeep1...@gmail.com> >>> >> wrote: >>> >> > Yes, I am talking about standalone single node cluster. >>> >> > >>> >> > No, I am not increasing parallelism. I just wanted to know if it is >>> >> > natural. >>> >> > Does message passing across the workers account for the happenning? >>> >> > >>> >> > I am running SparkKMeans, just to validate one prediction model. I >>> am >>> >> > using >>> >> > several data sets. I have a standalone mode. I am varying the >>> workers >>> >> > from 1 >>> >> > to 16 >>> >> > >>> >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen >>> wrote: >>> >> >> >>> >> >> I can imagine a few reasons. Adding workers might cause fewer >>> tasks to >>> >> >> execute locally (?) So you may be execute more remotely. >>> >> >> >>> >> >> Are you increasing parallelism? for trivial jobs, chopping them up >>> >> >> further may cause you to pay more overhead of managing so many >>> small >>> >> >> tasks, for no speed up in execution time. >>> >> >> >>> >> >> Can you provide any more specifics though? you haven't said what >>> >> >> you're running, what mode, how many workers, how long it takes, >>> etc. >>> >> >> >>> >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >>> >> >> >>> >> >> wrote: >>> >> >> > Hi, >>> >> >> > I have been running some jobs in my local single node stand alone >>> >> >> > cluster. I >>> >> >> > am varying the worker instances for the same job, and the time >>> taken >>> >> >> > for >>> >> >> > the >>> >> >> > job to complete increases with increase in the number of >>> workers. I >>> >> >> > repeated >>> >> >> > some experiments varying the number of nodes in a cluster too >>> and the >>> >> >> > same >>> >> >> > behavior is seen. >>> >> >> > Can the idea of worker instances be extrapolated to the nodes in >>> a >>> >> >> > cluster? >>> >> >> > >>> >> >> > Thank You >>> >> > >>> >> > >>> > >>> > >>> >> >> >
Re: Worker and Nodes
So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan wrote: > Yes, I have decreased the executor memory. > But,if I have to do this, then I have to tweak around with the code > corresponding to each configuration right? > > On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: > >> "Workers" has a specific meaning in Spark. You are running many on one >> machine? that's possible but not usual. >> >> Each worker's executors have access to a fraction of your machine's >> resources then. If you're not increasing parallelism, maybe you're not >> actually using additional workers, so are using less resource for your >> problem. >> >> Or because the resulting executors are smaller, maybe you're hitting >> GC thrashing in these executors with smaller heaps. >> >> Or if you're not actually configuring the executors to use less >> memory, maybe you're over-committing your RAM and swapping? >> >> Bottom line, you wouldn't use multiple workers on one small standalone >> node. This isn't a good way to estimate performance on a distributed >> cluster either. >> >> On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan >> wrote: >> > No, I just have a single node standalone cluster. >> > >> > I am not tweaking around with the code to increase parallelism. I am >> just >> > running SparkKMeans that is there in Spark-1.0.0 >> > I just wanted to know, if this behavior is natural. And if so, what >> causes >> > this? >> > >> > Thank you >> > >> > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: >> >> >> >> What's your storage like? are you adding worker machines that are >> >> remote from where the data lives? I wonder if it just means you are >> >> spending more and more time sending the data over the network as you >> >> try to ship more of it to more remote workers. >> >> >> >> To answer your question, no in general more workers means more >> >> parallelism and therefore faster execution. But that depends on a lot >> >> of things. For example, if your process isn't parallelize to use all >> >> available execution slots, adding more slots doesn't do anything. >> >> >> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < >> pradhandeep1...@gmail.com> >> >> wrote: >> >> > Yes, I am talking about standalone single node cluster. >> >> > >> >> > No, I am not increasing parallelism. I just wanted to know if it is >> >> > natural. >> >> > Does message passing across the workers account for the happenning? >> >> > >> >> > I am running SparkKMeans, just to validate one prediction model. I am >> >> > using >> >> > several data sets. I have a standalone mode. I am varying the workers >> >> > from 1 >> >> > to 16 >> >> > >> >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen >> wrote: >> >> >> >> >> >> I can imagine a few reasons. Adding workers might cause fewer tasks >> to >> >> >> execute locally (?) So you may be execute more remotely. >> >> >> >> >> >> Are you increasing parallelism? for trivial jobs, chopping them up >> >> >> further may cause you to pay more overhead of managing so many small >> >> >> tasks, for no speed up in execution time. >> >> >> >> >> >> Can you provide any more specifics though? you haven't said what >> >> >> you're running, what mode, how many workers, how long it takes, etc. >> >> >> >> >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >> >> >> >> >> >> wrote: >> >> >> > Hi, >> >> >> > I have been running some jobs in my local single node stand alone >> >> >> > cluster. I >> >> >> > am varying the worker instances for the same job, and the time >> taken >> >> >> > for >> >> >> > the >> >> >> > job to complete increases with increase in the number of workers. >> I >> >> >> > repeated >> >> >> > some experiments varying the number of nodes in a cluster too and >> the >> >> >> > same >> >> >> > behavior is seen. >> >> >> > Can the idea of worker instances be extrapolated to the nodes in a >> >> >> > cluster? >> >> >> > >> >> >> > Thank You >> >> > >> >> > >> > >> > >> > >
Re: Worker and Nodes
Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen wrote: > "Workers" has a specific meaning in Spark. You are running many on one > machine? that's possible but not usual. > > Each worker's executors have access to a fraction of your machine's > resources then. If you're not increasing parallelism, maybe you're not > actually using additional workers, so are using less resource for your > problem. > > Or because the resulting executors are smaller, maybe you're hitting > GC thrashing in these executors with smaller heaps. > > Or if you're not actually configuring the executors to use less > memory, maybe you're over-committing your RAM and swapping? > > Bottom line, you wouldn't use multiple workers on one small standalone > node. This isn't a good way to estimate performance on a distributed > cluster either. > > On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan > wrote: > > No, I just have a single node standalone cluster. > > > > I am not tweaking around with the code to increase parallelism. I am just > > running SparkKMeans that is there in Spark-1.0.0 > > I just wanted to know, if this behavior is natural. And if so, what > causes > > this? > > > > Thank you > > > > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: > >> > >> What's your storage like? are you adding worker machines that are > >> remote from where the data lives? I wonder if it just means you are > >> spending more and more time sending the data over the network as you > >> try to ship more of it to more remote workers. > >> > >> To answer your question, no in general more workers means more > >> parallelism and therefore faster execution. But that depends on a lot > >> of things. For example, if your process isn't parallelize to use all > >> available execution slots, adding more slots doesn't do anything. > >> > >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < > pradhandeep1...@gmail.com> > >> wrote: > >> > Yes, I am talking about standalone single node cluster. > >> > > >> > No, I am not increasing parallelism. I just wanted to know if it is > >> > natural. > >> > Does message passing across the workers account for the happenning? > >> > > >> > I am running SparkKMeans, just to validate one prediction model. I am > >> > using > >> > several data sets. I have a standalone mode. I am varying the workers > >> > from 1 > >> > to 16 > >> > > >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen > wrote: > >> >> > >> >> I can imagine a few reasons. Adding workers might cause fewer tasks > to > >> >> execute locally (?) So you may be execute more remotely. > >> >> > >> >> Are you increasing parallelism? for trivial jobs, chopping them up > >> >> further may cause you to pay more overhead of managing so many small > >> >> tasks, for no speed up in execution time. > >> >> > >> >> Can you provide any more specifics though? you haven't said what > >> >> you're running, what mode, how many workers, how long it takes, etc. > >> >> > >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan > >> >> > >> >> wrote: > >> >> > Hi, > >> >> > I have been running some jobs in my local single node stand alone > >> >> > cluster. I > >> >> > am varying the worker instances for the same job, and the time > taken > >> >> > for > >> >> > the > >> >> > job to complete increases with increase in the number of workers. I > >> >> > repeated > >> >> > some experiments varying the number of nodes in a cluster too and > the > >> >> > same > >> >> > behavior is seen. > >> >> > Can the idea of worker instances be extrapolated to the nodes in a > >> >> > cluster? > >> >> > > >> >> > Thank You > >> > > >> > > > > > >
Re: Worker and Nodes
There could be many different things causing this. For example, if you only have a single partition of data, increasing the number of tasks will only increase execution time due to higher scheduling overhead. Additionally, how large is a single partition in your application relative to the amount of memory on the machine? If you are running on a machine with a small amount of memory, increasing the number of executors per machine may increase GC/memory pressure. On a single node, since your executors share a memory and I/O system, you could just thrash everything. In any case, you can’t normally generalize between increased parallelism on a single node and increased parallelism across a cluster. If you are purely limited by CPU, then yes, you can normally make that generalization. However, when you increase the number of workers in a cluster, you are providing your app with more resources (memory capacity and bandwidth, and disk bandwidth). When you increase the number of tasks executing on a single node, you do not increase the pool of available resources. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 21, 2015, at 4:11 PM, Deep Pradhan wrote: > No, I just have a single node standalone cluster. > > I am not tweaking around with the code to increase parallelism. I am just > running SparkKMeans that is there in Spark-1.0.0 > I just wanted to know, if this behavior is natural. And if so, what causes > this? > > Thank you > > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: > What's your storage like? are you adding worker machines that are > remote from where the data lives? I wonder if it just means you are > spending more and more time sending the data over the network as you > try to ship more of it to more remote workers. > > To answer your question, no in general more workers means more > parallelism and therefore faster execution. But that depends on a lot > of things. For example, if your process isn't parallelize to use all > available execution slots, adding more slots doesn't do anything. > > On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan > wrote: > > Yes, I am talking about standalone single node cluster. > > > > No, I am not increasing parallelism. I just wanted to know if it is natural. > > Does message passing across the workers account for the happenning? > > > > I am running SparkKMeans, just to validate one prediction model. I am using > > several data sets. I have a standalone mode. I am varying the workers from 1 > > to 16 > > > > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen wrote: > >> > >> I can imagine a few reasons. Adding workers might cause fewer tasks to > >> execute locally (?) So you may be execute more remotely. > >> > >> Are you increasing parallelism? for trivial jobs, chopping them up > >> further may cause you to pay more overhead of managing so many small > >> tasks, for no speed up in execution time. > >> > >> Can you provide any more specifics though? you haven't said what > >> you're running, what mode, how many workers, how long it takes, etc. > >> > >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan > >> wrote: > >> > Hi, > >> > I have been running some jobs in my local single node stand alone > >> > cluster. I > >> > am varying the worker instances for the same job, and the time taken for > >> > the > >> > job to complete increases with increase in the number of workers. I > >> > repeated > >> > some experiments varying the number of nodes in a cluster too and the > >> > same > >> > behavior is seen. > >> > Can the idea of worker instances be extrapolated to the nodes in a > >> > cluster? > >> > > >> > Thank You > > > > >
Re: Worker and Nodes
"Workers" has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan wrote: > No, I just have a single node standalone cluster. > > I am not tweaking around with the code to increase parallelism. I am just > running SparkKMeans that is there in Spark-1.0.0 > I just wanted to know, if this behavior is natural. And if so, what causes > this? > > Thank you > > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: >> >> What's your storage like? are you adding worker machines that are >> remote from where the data lives? I wonder if it just means you are >> spending more and more time sending the data over the network as you >> try to ship more of it to more remote workers. >> >> To answer your question, no in general more workers means more >> parallelism and therefore faster execution. But that depends on a lot >> of things. For example, if your process isn't parallelize to use all >> available execution slots, adding more slots doesn't do anything. >> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan >> wrote: >> > Yes, I am talking about standalone single node cluster. >> > >> > No, I am not increasing parallelism. I just wanted to know if it is >> > natural. >> > Does message passing across the workers account for the happenning? >> > >> > I am running SparkKMeans, just to validate one prediction model. I am >> > using >> > several data sets. I have a standalone mode. I am varying the workers >> > from 1 >> > to 16 >> > >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen wrote: >> >> >> >> I can imagine a few reasons. Adding workers might cause fewer tasks to >> >> execute locally (?) So you may be execute more remotely. >> >> >> >> Are you increasing parallelism? for trivial jobs, chopping them up >> >> further may cause you to pay more overhead of managing so many small >> >> tasks, for no speed up in execution time. >> >> >> >> Can you provide any more specifics though? you haven't said what >> >> you're running, what mode, how many workers, how long it takes, etc. >> >> >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >> >> >> >> wrote: >> >> > Hi, >> >> > I have been running some jobs in my local single node stand alone >> >> > cluster. I >> >> > am varying the worker instances for the same job, and the time taken >> >> > for >> >> > the >> >> > job to complete increases with increase in the number of workers. I >> >> > repeated >> >> > some experiments varying the number of nodes in a cluster too and the >> >> > same >> >> > behavior is seen. >> >> > Can the idea of worker instances be extrapolated to the nodes in a >> >> > cluster? >> >> > >> >> > Thank You >> > >> > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Worker and Nodes
No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen wrote: > What's your storage like? are you adding worker machines that are > remote from where the data lives? I wonder if it just means you are > spending more and more time sending the data over the network as you > try to ship more of it to more remote workers. > > To answer your question, no in general more workers means more > parallelism and therefore faster execution. But that depends on a lot > of things. For example, if your process isn't parallelize to use all > available execution slots, adding more slots doesn't do anything. > > On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan > wrote: > > Yes, I am talking about standalone single node cluster. > > > > No, I am not increasing parallelism. I just wanted to know if it is > natural. > > Does message passing across the workers account for the happenning? > > > > I am running SparkKMeans, just to validate one prediction model. I am > using > > several data sets. I have a standalone mode. I am varying the workers > from 1 > > to 16 > > > > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen wrote: > >> > >> I can imagine a few reasons. Adding workers might cause fewer tasks to > >> execute locally (?) So you may be execute more remotely. > >> > >> Are you increasing parallelism? for trivial jobs, chopping them up > >> further may cause you to pay more overhead of managing so many small > >> tasks, for no speed up in execution time. > >> > >> Can you provide any more specifics though? you haven't said what > >> you're running, what mode, how many workers, how long it takes, etc. > >> > >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan < > pradhandeep1...@gmail.com> > >> wrote: > >> > Hi, > >> > I have been running some jobs in my local single node stand alone > >> > cluster. I > >> > am varying the worker instances for the same job, and the time taken > for > >> > the > >> > job to complete increases with increase in the number of workers. I > >> > repeated > >> > some experiments varying the number of nodes in a cluster too and the > >> > same > >> > behavior is seen. > >> > Can the idea of worker instances be extrapolated to the nodes in a > >> > cluster? > >> > > >> > Thank You > > > > >
Re: Worker and Nodes
In this case, I just wanted to know if a single node cluster with various workers act like a simulator of a multi-node cluster with various nodes. Like, if we have a single node cluster with 10 workers, say, then can we tell that the same behavior will take place with cluster of 10 nodes? It is like, without having the 10 nodes cluster, I can know the behavior of the application in 10 nodes cluster by having a single node with 10 workers. The time taken may vary but I am talking about the behavior. Can we say that? On Sat, Feb 21, 2015 at 8:21 PM, Deep Pradhan wrote: > Yes, I am talking about standalone single node cluster. > > No, I am not increasing parallelism. I just wanted to know if it is > natural. Does message passing across the workers account for the happenning? > > I am running SparkKMeans, just to validate one prediction model. I am > using several data sets. I have a standalone mode. I am varying the workers > from 1 to 16 > > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen wrote: > >> I can imagine a few reasons. Adding workers might cause fewer tasks to >> execute locally (?) So you may be execute more remotely. >> >> Are you increasing parallelism? for trivial jobs, chopping them up >> further may cause you to pay more overhead of managing so many small >> tasks, for no speed up in execution time. >> >> Can you provide any more specifics though? you haven't said what >> you're running, what mode, how many workers, how long it takes, etc. >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >> wrote: >> > Hi, >> > I have been running some jobs in my local single node stand alone >> cluster. I >> > am varying the worker instances for the same job, and the time taken >> for the >> > job to complete increases with increase in the number of workers. I >> repeated >> > some experiments varying the number of nodes in a cluster too and the >> same >> > behavior is seen. >> > Can the idea of worker instances be extrapolated to the nodes in a >> cluster? >> > >> > Thank You >> > >
Re: Worker and Nodes
What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more workers means more parallelism and therefore faster execution. But that depends on a lot of things. For example, if your process isn't parallelize to use all available execution slots, adding more slots doesn't do anything. On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan wrote: > Yes, I am talking about standalone single node cluster. > > No, I am not increasing parallelism. I just wanted to know if it is natural. > Does message passing across the workers account for the happenning? > > I am running SparkKMeans, just to validate one prediction model. I am using > several data sets. I have a standalone mode. I am varying the workers from 1 > to 16 > > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen wrote: >> >> I can imagine a few reasons. Adding workers might cause fewer tasks to >> execute locally (?) So you may be execute more remotely. >> >> Are you increasing parallelism? for trivial jobs, chopping them up >> further may cause you to pay more overhead of managing so many small >> tasks, for no speed up in execution time. >> >> Can you provide any more specifics though? you haven't said what >> you're running, what mode, how many workers, how long it takes, etc. >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >> wrote: >> > Hi, >> > I have been running some jobs in my local single node stand alone >> > cluster. I >> > am varying the worker instances for the same job, and the time taken for >> > the >> > job to complete increases with increase in the number of workers. I >> > repeated >> > some experiments varying the number of nodes in a cluster too and the >> > same >> > behavior is seen. >> > Can the idea of worker instances be extrapolated to the nodes in a >> > cluster? >> > >> > Thank You > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Worker and Nodes
Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen wrote: > I can imagine a few reasons. Adding workers might cause fewer tasks to > execute locally (?) So you may be execute more remotely. > > Are you increasing parallelism? for trivial jobs, chopping them up > further may cause you to pay more overhead of managing so many small > tasks, for no speed up in execution time. > > Can you provide any more specifics though? you haven't said what > you're running, what mode, how many workers, how long it takes, etc. > > On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan > wrote: > > Hi, > > I have been running some jobs in my local single node stand alone > cluster. I > > am varying the worker instances for the same job, and the time taken for > the > > job to complete increases with increase in the number of workers. I > repeated > > some experiments varying the number of nodes in a cluster too and the > same > > behavior is seen. > > Can the idea of worker instances be extrapolated to the nodes in a > cluster? > > > > Thank You >
Re: Worker and Nodes
I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan wrote: > Hi, > I have been running some jobs in my local single node stand alone cluster. I > am varying the worker instances for the same job, and the time taken for the > job to complete increases with increase in the number of workers. I repeated > some experiments varying the number of nodes in a cluster too and the same > behavior is seen. > Can the idea of worker instances be extrapolated to the nodes in a cluster? > > Thank You - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Worker and Nodes
Hi, I have experienced the same behavior. You are talking about standalone cluster mode right? BR On 21 February 2015 at 14:37, Deep Pradhan wrote: > Hi, > I have been running some jobs in my local single node stand alone cluster. > I am varying the worker instances for the same job, and the time taken for > the job to complete increases with increase in the number of workers. I > repeated some experiments varying the number of nodes in a cluster too and > the same behavior is seen. > Can the idea of worker instances be extrapolated to the nodes in a cluster? > > Thank You >
Worker and Nodes
Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You