Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone
I mean that local[*] = all cores on the machines, whereas in your example you seem to be choosing 8 cores per executor in the distributed case. You'd have 12 cores in your local case - which is still less than 2x8, but just the kind of thing to consider when comparing these setups. Indeed, how well something parallelizes depends wholly on the code, input and cluster. There are trivial examples that parallelize perfectly, that are equally unuseful to you, like SparkPi. You can also construct jobs that will never ever be faster on a cluster (very small computation). What matters is understanding how your real problem executes. On Fri, Sep 25, 2020 at 10:26 AM javaguy Java wrote: > > Thanks - that's great I'll check out both spark-bench and SparkPi. > > I do have more than 8 cores in the local setup. 24 cores in total (12 per > machine). > > However on AWS with the same cluster setup, that is not the case; I chose > Medium size instances hoping that a much smaller instance since would show me > the benefits of the Spark Cluster. > > Perhaps I'm not making it clear but I'm not too interested in understanding > and optimising someone else's code that has no material value to me; I'm > interested in seeing a simple example of something working that I can then > carry across to my own datasets with a view to adopting the platform. > > Thx > > > > On Fri, Sep 25, 2020 at 2:29 PM Sean Owen wrote: >> >> Maybe the better approach is to understand why your job isnt scaling - >> what does the UI show? are the resources actually the same? for >> example do you have more than 8 cores in the local setup? >> Is there enough parallelism? for example it doesn't look like the >> small input is repartitioned to at least the cluster parallelism / >> default parallelism. >> >> Something that should trivially parallelize? the SparkPi example >> >> You can try tools like https://codait.github.io/spark-bench/ to >> generate large workloads. >> >> On Fri, Sep 25, 2020 at 1:03 AM javaguy Java wrote: >> > >> > Hi Sean, >> > >> > Thanks for your reply. >> > >> > I understand distribution and parallelism very well and have used it with >> > other products like GridGain and various master worker patterns etc; I >> > just don't have a simple example working with Apache Spark which is what I >> > am looking for. I know Spark doesn't follow the others parallelism >> > paradigm so I'm looking for a distributed example that illustrates Spark's >> > distribution capabilities very well - and correct I want the total wall >> > clock completion time to go down. >> > >> > I think you misunderstood one thing re: the several machines blurb I have >> > in my post. My spark cluster has the 2 of the same "identical" machines. >> > So its not a split of cores and memory. It's a doubling of cores and >> > memory. >> > >> > To recap, the spark cluster on my home network is running 2x Mac's with >> > 32GB ram EACH (so 64GB ram in total) with the same processor size on each; >> > however when I run this code example on just one mac + Spark Standalone + >> > local[*] it is faster >> > >> > I have subsequently moved my example to AWS and on AWS I'm running two >> > identical EC2 instances (so again double the RAM and Cores) co-located in >> > the same AZ and the spark cluster is still slower compared to spark >> > standalone on one of these EC2 instances :( >> > >> > Hence my posts to Spark user group. >> > >> > I'm not wedded to this Udemy course example; I wish someone could just >> > point me at an example with some quick code and a large public data set >> > and say this runs faster on a cluster than standalone. I'd be happy to >> > make a post myself for any new people interested in Spark. >> > >> > Thanks >> > >> > >> > >> > >> > >> > >> > >> > >> > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen wrote: >> >> >> >> If you have the same amount of resource (cores, memory, etc) on one >> >> machine, that is pretty much always going to be faster than using >> >> those same resources split across several machines. >> >> Even if you have somewhat more resource available on a cluster, the >> >> distributed version could be slower if you, for example, are >> >> bottlenecking on network I/O and leaving some resources underutilized. >> >> >> >> Distributing isn't really going to make the workload consume less >> >> resource; on the contrary it makes it take more. However it might make >> >> the total wall-clock completion time go way down through parallelism. >> >> How much you benefit from parallelism really depends on the problem, >> >> the cluster, the input, etc. You may not see a speedup in this problem >> >> until you hit more scale or modify the job to distribute a little >> >> better, etc. >> >> >> >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java wrote: >> >> > >> >> > Hi, >> >> > >> >> > I made a post on stackoverflow that I can't seem to make any headway on >> >> >
Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone
Thanks - that's great I'll check out both spark-bench and SparkPi. I do have more than 8 cores in the local setup. 24 cores in total (12 per machine). However on AWS with the same cluster setup, that is not the case; I chose Medium size instances hoping that a much smaller instance since would show me the benefits of the Spark Cluster. Perhaps I'm not making it clear but I'm not too interested in understanding and optimising someone else's code that has no material value to me; I'm interested in seeing a simple example of something working that I can then carry across to my own datasets with a view to adopting the platform. Thx On Fri, Sep 25, 2020 at 2:29 PM Sean Owen wrote: > Maybe the better approach is to understand why your job isnt scaling - > what does the UI show? are the resources actually the same? for > example do you have more than 8 cores in the local setup? > Is there enough parallelism? for example it doesn't look like the > small input is repartitioned to at least the cluster parallelism / > default parallelism. > > Something that should trivially parallelize? the SparkPi example > > You can try tools like https://codait.github.io/spark-bench/ to > generate large workloads. > > On Fri, Sep 25, 2020 at 1:03 AM javaguy Java wrote: > > > > Hi Sean, > > > > Thanks for your reply. > > > > I understand distribution and parallelism very well and have used it > with other products like GridGain and various master worker patterns etc; I > just don't have a simple example working with Apache Spark which is what I > am looking for. I know Spark doesn't follow the others parallelism > paradigm so I'm looking for a distributed example that illustrates Spark's > distribution capabilities very well - and correct I want the total wall > clock completion time to go down. > > > > I think you misunderstood one thing re: the several machines blurb I > have in my post. My spark cluster has the 2 of the same "identical" > machines. So its not a split of cores and memory. It's a doubling of > cores and memory. > > > > To recap, the spark cluster on my home network is running 2x Mac's with > 32GB ram EACH (so 64GB ram in total) with the same processor size on each; > however when I run this code example on just one mac + Spark Standalone + > local[*] it is faster > > > > I have subsequently moved my example to AWS and on AWS I'm running two > identical EC2 instances (so again double the RAM and Cores) co-located in > the same AZ and the spark cluster is still slower compared to spark > standalone on one of these EC2 instances :( > > > > Hence my posts to Spark user group. > > > > I'm not wedded to this Udemy course example; I wish someone could just > point me at an example with some quick code and a large public data set and > say this runs faster on a cluster than standalone. I'd be happy to make a > post myself for any new people interested in Spark. > > > > Thanks > > > > > > > > > > > > > > > > > > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen wrote: > >> > >> If you have the same amount of resource (cores, memory, etc) on one > >> machine, that is pretty much always going to be faster than using > >> those same resources split across several machines. > >> Even if you have somewhat more resource available on a cluster, the > >> distributed version could be slower if you, for example, are > >> bottlenecking on network I/O and leaving some resources underutilized. > >> > >> Distributing isn't really going to make the workload consume less > >> resource; on the contrary it makes it take more. However it might make > >> the total wall-clock completion time go way down through parallelism. > >> How much you benefit from parallelism really depends on the problem, > >> the cluster, the input, etc. You may not see a speedup in this problem > >> until you hit more scale or modify the job to distribute a little > >> better, etc. > >> > >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java > wrote: > >> > > >> > Hi, > >> > > >> > I made a post on stackoverflow that I can't seem to make any headway > on > >> > > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster > >> > > >> > Before someone starts making suggestions on changing the code; note > that the code and example on the above post is from a Udemy course and is > not my code. I am looking to take this dataset and code and executing the > same on a cluster I am looking to see the value of Spark by seeing results > so that the job submitted to the Spark Cluster runs in a faster time > compared to Standalone. > >> > > >> > I am currently evaluating Spark and I've thus far spent about a month > of weekends of my free time trying to get a Spark Cluster to show me > improved performance in comparison to Spark Standalone but I am not having > success, and after spending so much time in this, I am now looking for help > from as I'm time constrained (in general I'm time constrained, not for a > project or deadline re: Spark).
Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone
Maybe the better approach is to understand why your job isnt scaling - what does the UI show? are the resources actually the same? for example do you have more than 8 cores in the local setup? Is there enough parallelism? for example it doesn't look like the small input is repartitioned to at least the cluster parallelism / default parallelism. Something that should trivially parallelize? the SparkPi example You can try tools like https://codait.github.io/spark-bench/ to generate large workloads. On Fri, Sep 25, 2020 at 1:03 AM javaguy Java wrote: > > Hi Sean, > > Thanks for your reply. > > I understand distribution and parallelism very well and have used it with > other products like GridGain and various master worker patterns etc; I just > don't have a simple example working with Apache Spark which is what I am > looking for. I know Spark doesn't follow the others parallelism paradigm so > I'm looking for a distributed example that illustrates Spark's distribution > capabilities very well - and correct I want the total wall clock completion > time to go down. > > I think you misunderstood one thing re: the several machines blurb I have in > my post. My spark cluster has the 2 of the same "identical" machines. So > its not a split of cores and memory. It's a doubling of cores and memory. > > To recap, the spark cluster on my home network is running 2x Mac's with 32GB > ram EACH (so 64GB ram in total) with the same processor size on each; however > when I run this code example on just one mac + Spark Standalone + local[*] it > is faster > > I have subsequently moved my example to AWS and on AWS I'm running two > identical EC2 instances (so again double the RAM and Cores) co-located in the > same AZ and the spark cluster is still slower compared to spark standalone on > one of these EC2 instances :( > > Hence my posts to Spark user group. > > I'm not wedded to this Udemy course example; I wish someone could just point > me at an example with some quick code and a large public data set and say > this runs faster on a cluster than standalone. I'd be happy to make a post > myself for any new people interested in Spark. > > Thanks > > > > > > > > > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen wrote: >> >> If you have the same amount of resource (cores, memory, etc) on one >> machine, that is pretty much always going to be faster than using >> those same resources split across several machines. >> Even if you have somewhat more resource available on a cluster, the >> distributed version could be slower if you, for example, are >> bottlenecking on network I/O and leaving some resources underutilized. >> >> Distributing isn't really going to make the workload consume less >> resource; on the contrary it makes it take more. However it might make >> the total wall-clock completion time go way down through parallelism. >> How much you benefit from parallelism really depends on the problem, >> the cluster, the input, etc. You may not see a speedup in this problem >> until you hit more scale or modify the job to distribute a little >> better, etc. >> >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java wrote: >> > >> > Hi, >> > >> > I made a post on stackoverflow that I can't seem to make any headway on >> > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster >> > >> > Before someone starts making suggestions on changing the code; note that >> > the code and example on the above post is from a Udemy course and is not >> > my code. I am looking to take this dataset and code and executing the same >> > on a cluster I am looking to see the value of Spark by seeing results so >> > that the job submitted to the Spark Cluster runs in a faster time compared >> > to Standalone. >> > >> > I am currently evaluating Spark and I've thus far spent about a month of >> > weekends of my free time trying to get a Spark Cluster to show me improved >> > performance in comparison to Spark Standalone but I am not having success, >> > and after spending so much time in this, I am now looking for help from as >> > I'm time constrained (in general I'm time constrained, not for a project >> > or deadline re: Spark). >> > >> > If anyone can comment on what I need to make my example work faster on a >> > spark cluster vs standalone I'd appreciate it. >> > >> > Alternatively if someone can point me to a simple code example + dataset >> > that works better and illustrates the power of distributed spark I'd be >> > happy to use that instead - I'm not wedded to this example that I got from >> > the course - I'm just looking for the simple 5 min to 30 min example quick >> > start that shows the power of Spark distributed clusters. >> > >> > There's a higher level question here and one that is not obvious to find >> > an answer for. There are many examples on Spark out there, but there is >> > not a simple large dataset + code example that illustrates the performance >> >
Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone
Hi Sean, Thanks for your reply. I understand distribution and parallelism very well and have used it with other products like GridGain and various master worker patterns etc; I just don't have a simple example working with Apache Spark which is what I am looking for. I know Spark doesn't follow the others parallelism paradigm so I'm looking for a distributed example that illustrates Spark's distribution capabilities very well - and correct I want the total wall clock completion time to go down. I think you misunderstood one thing re: the several machines blurb I have in my post. My spark cluster has the 2 of the same "identical" machines. So its not a split of cores and memory. It's a doubling of cores and memory. To recap, the spark cluster on my home network is running 2x Mac's with 32GB ram EACH (so 64GB ram in total) with the same processor size on each; however when I run this code example on just one mac + Spark Standalone + local[*] it is faster I have subsequently moved my example to AWS and on AWS I'm running two identical EC2 instances (so again double the RAM and Cores) co-located in the same AZ and the spark cluster is still slower compared to spark standalone on one of these EC2 instances :( Hence my posts to Spark user group. I'm not wedded to this Udemy course example; I wish someone could just point me at an example with some quick code and a large public data set and say this runs faster on a cluster than standalone. I'd be happy to make a post myself for any new people interested in Spark. Thanks On Thu, Sep 24, 2020 at 9:58 PM Sean Owen wrote: > If you have the same amount of resource (cores, memory, etc) on one > machine, that is pretty much always going to be faster than using > those same resources split across several machines. > Even if you have somewhat more resource available on a cluster, the > distributed version could be slower if you, for example, are > bottlenecking on network I/O and leaving some resources underutilized. > > Distributing isn't really going to make the workload consume less > resource; on the contrary it makes it take more. However it might make > the total wall-clock completion time go way down through parallelism. > How much you benefit from parallelism really depends on the problem, > the cluster, the input, etc. You may not see a speedup in this problem > until you hit more scale or modify the job to distribute a little > better, etc. > > On Thu, Sep 24, 2020 at 1:43 PM javaguy Java wrote: > > > > Hi, > > > > I made a post on stackoverflow that I can't seem to make any headway on > > > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster > > > > Before someone starts making suggestions on changing the code; note that > the code and example on the above post is from a Udemy course and is not my > code. I am looking to take this dataset and code and executing the same on > a cluster I am looking to see the value of Spark by seeing results so that > the job submitted to the Spark Cluster runs in a faster time compared to > Standalone. > > > > I am currently evaluating Spark and I've thus far spent about a month of > weekends of my free time trying to get a Spark Cluster to show me improved > performance in comparison to Spark Standalone but I am not having success, > and after spending so much time in this, I am now looking for help from as > I'm time constrained (in general I'm time constrained, not for a project or > deadline re: Spark). > > > > If anyone can comment on what I need to make my example work faster on a > spark cluster vs standalone I'd appreciate it. > > > > Alternatively if someone can point me to a simple code example + dataset > that works better and illustrates the power of distributed spark I'd be > happy to use that instead - I'm not wedded to this example that I got from > the course - I'm just looking for the simple 5 min to 30 min example quick > start that shows the power of Spark distributed clusters. > > > > There's a higher level question here and one that is not obvious to find > an answer for. There are many examples on Spark out there, but there is > not a simple large dataset + code example that illustrates the performance > gain of Spark's cluster and distributed computing benefits vs just a single > local standalone; which is what someone in my position is looking for > (someone who makes architectural and platform decisions and is bandwidth / > time constrained and wants to see the power and advantages of Spark cluster > and distributed computing without spending weeks on the problem). > > > > I'm also willing to open this up to a consulting engagement if anyone is > interested as I'd expect it to be quick (either you have a simple example > that just needs to be setup etc or its easy for you to demonstrate cluster > performance > standalone for this dataset) > > > > Thx > > > > > > >
Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone
If you have the same amount of resource (cores, memory, etc) on one machine, that is pretty much always going to be faster than using those same resources split across several machines. Even if you have somewhat more resource available on a cluster, the distributed version could be slower if you, for example, are bottlenecking on network I/O and leaving some resources underutilized. Distributing isn't really going to make the workload consume less resource; on the contrary it makes it take more. However it might make the total wall-clock completion time go way down through parallelism. How much you benefit from parallelism really depends on the problem, the cluster, the input, etc. You may not see a speedup in this problem until you hit more scale or modify the job to distribute a little better, etc. On Thu, Sep 24, 2020 at 1:43 PM javaguy Java wrote: > > Hi, > > I made a post on stackoverflow that I can't seem to make any headway on > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster > > Before someone starts making suggestions on changing the code; note that the > code and example on the above post is from a Udemy course and is not my code. > I am looking to take this dataset and code and executing the same on a > cluster I am looking to see the value of Spark by seeing results so that the > job submitted to the Spark Cluster runs in a faster time compared to > Standalone. > > I am currently evaluating Spark and I've thus far spent about a month of > weekends of my free time trying to get a Spark Cluster to show me improved > performance in comparison to Spark Standalone but I am not having success, > and after spending so much time in this, I am now looking for help from as > I'm time constrained (in general I'm time constrained, not for a project or > deadline re: Spark). > > If anyone can comment on what I need to make my example work faster on a > spark cluster vs standalone I'd appreciate it. > > Alternatively if someone can point me to a simple code example + dataset that > works better and illustrates the power of distributed spark I'd be happy to > use that instead - I'm not wedded to this example that I got from the course > - I'm just looking for the simple 5 min to 30 min example quick start that > shows the power of Spark distributed clusters. > > There's a higher level question here and one that is not obvious to find an > answer for. There are many examples on Spark out there, but there is not a > simple large dataset + code example that illustrates the performance gain of > Spark's cluster and distributed computing benefits vs just a single local > standalone; which is what someone in my position is looking for (someone who > makes architectural and platform decisions and is bandwidth / time > constrained and wants to see the power and advantages of Spark cluster and > distributed computing without spending weeks on the problem). > > I'm also willing to open this up to a consulting engagement if anyone is > interested as I'd expect it to be quick (either you have a simple example > that just needs to be setup etc or its easy for you to demonstrate cluster > performance > standalone for this dataset) > > Thx > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone
Hi, I made a post on stackoverflow that I can't seem to make any headway on https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster Before someone starts making suggestions on changing the code; note that the code and example on the above post is from a Udemy course and is not my code. I am looking to take this dataset and code and executing the same on a cluster I am looking to see the value of Spark by seeing results so that the job submitted to the Spark Cluster runs in a faster time compared to Standalone. I am currently evaluating Spark and I've thus far spent about a month of weekends of my free time trying to get a Spark Cluster to show me improved performance in comparison to Spark Standalone but I am not having success, and after spending so much time in this, I am now looking for help from as I'm time constrained (in general I'm time constrained, not for a project or deadline re: Spark). If anyone can comment on what I need to make my example work faster on a spark cluster vs standalone I'd appreciate it. Alternatively if someone can point me to a simple code example + dataset that works better and illustrates the power of distributed spark I'd be happy to use that instead - I'm not wedded to this example that I got from the course - I'm just looking for the simple 5 min to 30 min example quick start that shows the power of Spark distributed clusters. There's a higher level question here and one that is not obvious to find an answer for. There are many examples on Spark out there, but there is not a simple large dataset + code example that illustrates the performance gain of Spark's cluster and distributed computing benefits vs just a single local standalone; which is what someone in my position is looking for (someone who makes architectural and platform decisions and is bandwidth / time constrained and wants to see the power and advantages of Spark cluster and distributed computing without spending weeks on the problem). I'm also willing to open this up to a consulting engagement if anyone is interested as I'd expect it to be quick (either you have a simple example that just needs to be setup etc or its easy for you to demonstrate cluster performance > standalone for this dataset) Thx