Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread Sean Owen
I mean that local[*] = all cores on the machines, whereas in your
example you seem to be choosing 8 cores per executor in the
distributed case. You'd have 12 cores in your local case - which is
still less than 2x8, but just the kind of thing to consider when
comparing these setups.

Indeed, how well something parallelizes depends wholly on the code,
input and cluster. There are trivial examples that parallelize
perfectly, that are equally unuseful to you, like SparkPi. You can
also construct jobs that will never ever be faster on a cluster (very
small computation). What matters is understanding how your real
problem executes.

On Fri, Sep 25, 2020 at 10:26 AM javaguy Java  wrote:
>
> Thanks - that's great I'll check out both spark-bench and SparkPi.
>
> I do have more than 8 cores in the local setup.  24 cores in total (12 per 
> machine).
>
> However on AWS with the same cluster setup, that is not the case; I chose 
> Medium size instances hoping that a much smaller instance since would show me 
> the benefits of the Spark Cluster.
>
> Perhaps I'm not making it clear but I'm not too interested in understanding 
> and optimising someone else's code that has no material value to me; I'm 
> interested in seeing a simple example of something working that I can then 
> carry across to my own datasets with a view to adopting the platform.
>
> Thx
>
>
>
> On Fri, Sep 25, 2020 at 2:29 PM Sean Owen  wrote:
>>
>> Maybe the better approach is to understand why your job isnt scaling -
>> what does the UI show? are the resources actually the same? for
>> example do you have more than 8 cores in the local setup?
>> Is there enough parallelism? for example it doesn't look like the
>> small input is repartitioned to at least the cluster parallelism /
>> default parallelism.
>>
>> Something that should trivially parallelize? the SparkPi example
>>
>> You can try tools like https://codait.github.io/spark-bench/ to
>> generate large workloads.
>>
>> On Fri, Sep 25, 2020 at 1:03 AM javaguy Java  wrote:
>> >
>> > Hi Sean,
>> >
>> > Thanks for your reply.
>> >
>> > I understand distribution and parallelism very well and have used it with 
>> > other products like GridGain and various master worker patterns etc; I 
>> > just don't have a simple example working with Apache Spark which is what I 
>> > am looking for.  I know Spark doesn't follow the others parallelism 
>> > paradigm so I'm looking for a distributed example that illustrates Spark's 
>> > distribution capabilities very well - and correct I want the total wall 
>> > clock completion time to go down.
>> >
>> > I think you misunderstood one thing re: the several machines blurb I have 
>> > in my post.  My spark cluster has the 2 of the same "identical" machines.  
>> > So its not a split of cores and memory.  It's a doubling of cores and 
>> > memory.
>> >
>> > To recap, the spark cluster on my home network is running 2x Mac's with 
>> > 32GB ram EACH (so 64GB ram in total) with the same processor size on each; 
>> > however when I run this code example on just one mac + Spark Standalone + 
>> > local[*] it is faster
>> >
>> > I have subsequently moved my example to AWS and on AWS I'm running two 
>> > identical EC2 instances (so again double the RAM and Cores) co-located in 
>> > the same AZ and the spark cluster is still slower compared to spark 
>> > standalone on one of these EC2 instances :(
>> >
>> > Hence my posts to Spark user group.
>> >
>> > I'm not wedded to this Udemy course example; I wish someone could just 
>> > point me at an example with some quick code and a large public data set 
>> > and say this runs faster on a cluster than standalone.  I'd be happy to 
>> > make a post myself for any new people interested in Spark.
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen  wrote:
>> >>
>> >> If you have the same amount of resource (cores, memory, etc) on one
>> >> machine, that is pretty much always going to be faster than using
>> >> those same resources split across several machines.
>> >> Even if you have somewhat more resource available on a cluster, the
>> >> distributed version could be slower if you, for example, are
>> >> bottlenecking on network I/O and leaving some resources underutilized.
>> >>
>> >> Distributing isn't really going to make the workload consume less
>> >> resource; on the contrary it makes it take more. However it might make
>> >> the total wall-clock completion time go way down through parallelism.
>> >> How much you benefit from parallelism really depends on the problem,
>> >> the cluster, the input, etc. You may not see a speedup in this problem
>> >> until you hit more scale or modify the job to distribute a little
>> >> better, etc.
>> >>
>> >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java  wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I made a post on stackoverflow that I can't seem to make any headway on
>> >> > 

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread javaguy Java
Thanks - that's great I'll check out both spark-bench and SparkPi.

I do have more than 8 cores in the local setup.  24 cores in total (12 per
machine).

However on AWS with the same cluster setup, that is not the case; I chose
Medium size instances hoping that a much smaller instance since would show
me the benefits of the Spark Cluster.

Perhaps I'm not making it clear but I'm not too interested in understanding
and optimising someone else's code that has no material value to me; I'm
interested in seeing a simple example of something working that I can then
carry across to my own datasets with a view to adopting the platform.

Thx



On Fri, Sep 25, 2020 at 2:29 PM Sean Owen  wrote:

> Maybe the better approach is to understand why your job isnt scaling -
> what does the UI show? are the resources actually the same? for
> example do you have more than 8 cores in the local setup?
> Is there enough parallelism? for example it doesn't look like the
> small input is repartitioned to at least the cluster parallelism /
> default parallelism.
>
> Something that should trivially parallelize? the SparkPi example
>
> You can try tools like https://codait.github.io/spark-bench/ to
> generate large workloads.
>
> On Fri, Sep 25, 2020 at 1:03 AM javaguy Java  wrote:
> >
> > Hi Sean,
> >
> > Thanks for your reply.
> >
> > I understand distribution and parallelism very well and have used it
> with other products like GridGain and various master worker patterns etc; I
> just don't have a simple example working with Apache Spark which is what I
> am looking for.  I know Spark doesn't follow the others parallelism
> paradigm so I'm looking for a distributed example that illustrates Spark's
> distribution capabilities very well - and correct I want the total wall
> clock completion time to go down.
> >
> > I think you misunderstood one thing re: the several machines blurb I
> have in my post.  My spark cluster has the 2 of the same "identical"
> machines.  So its not a split of cores and memory.  It's a doubling of
> cores and memory.
> >
> > To recap, the spark cluster on my home network is running 2x Mac's with
> 32GB ram EACH (so 64GB ram in total) with the same processor size on each;
> however when I run this code example on just one mac + Spark Standalone +
> local[*] it is faster
> >
> > I have subsequently moved my example to AWS and on AWS I'm running two
> identical EC2 instances (so again double the RAM and Cores) co-located in
> the same AZ and the spark cluster is still slower compared to spark
> standalone on one of these EC2 instances :(
> >
> > Hence my posts to Spark user group.
> >
> > I'm not wedded to this Udemy course example; I wish someone could just
> point me at an example with some quick code and a large public data set and
> say this runs faster on a cluster than standalone.  I'd be happy to make a
> post myself for any new people interested in Spark.
> >
> > Thanks
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen  wrote:
> >>
> >> If you have the same amount of resource (cores, memory, etc) on one
> >> machine, that is pretty much always going to be faster than using
> >> those same resources split across several machines.
> >> Even if you have somewhat more resource available on a cluster, the
> >> distributed version could be slower if you, for example, are
> >> bottlenecking on network I/O and leaving some resources underutilized.
> >>
> >> Distributing isn't really going to make the workload consume less
> >> resource; on the contrary it makes it take more. However it might make
> >> the total wall-clock completion time go way down through parallelism.
> >> How much you benefit from parallelism really depends on the problem,
> >> the cluster, the input, etc. You may not see a speedup in this problem
> >> until you hit more scale or modify the job to distribute a little
> >> better, etc.
> >>
> >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I made a post on stackoverflow that I can't seem to make any headway
> on
> >> >
> https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster
> >> >
> >> > Before someone starts making suggestions on changing the code; note
> that the code and example on the above post is from a Udemy course and is
> not my code. I am looking to take this dataset and code and executing the
> same on a cluster I am looking to see the value of Spark by seeing results
> so that the job submitted to the Spark Cluster runs in a faster time
> compared to Standalone.
> >> >
> >> > I am currently evaluating Spark and I've thus far spent about a month
> of weekends of my free time trying to get a Spark Cluster to show me
> improved performance in comparison to Spark Standalone but I am not having
> success, and after spending so much time in this, I am now looking for help
> from as I'm time constrained (in general I'm time constrained, not for a
> project or deadline re: Spark).

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread Sean Owen
Maybe the better approach is to understand why your job isnt scaling -
what does the UI show? are the resources actually the same? for
example do you have more than 8 cores in the local setup?
Is there enough parallelism? for example it doesn't look like the
small input is repartitioned to at least the cluster parallelism /
default parallelism.

Something that should trivially parallelize? the SparkPi example

You can try tools like https://codait.github.io/spark-bench/ to
generate large workloads.

On Fri, Sep 25, 2020 at 1:03 AM javaguy Java  wrote:
>
> Hi Sean,
>
> Thanks for your reply.
>
> I understand distribution and parallelism very well and have used it with 
> other products like GridGain and various master worker patterns etc; I just 
> don't have a simple example working with Apache Spark which is what I am 
> looking for.  I know Spark doesn't follow the others parallelism paradigm so 
> I'm looking for a distributed example that illustrates Spark's distribution 
> capabilities very well - and correct I want the total wall clock completion 
> time to go down.
>
> I think you misunderstood one thing re: the several machines blurb I have in 
> my post.  My spark cluster has the 2 of the same "identical" machines.  So 
> its not a split of cores and memory.  It's a doubling of cores and memory.
>
> To recap, the spark cluster on my home network is running 2x Mac's with 32GB 
> ram EACH (so 64GB ram in total) with the same processor size on each; however 
> when I run this code example on just one mac + Spark Standalone + local[*] it 
> is faster
>
> I have subsequently moved my example to AWS and on AWS I'm running two 
> identical EC2 instances (so again double the RAM and Cores) co-located in the 
> same AZ and the spark cluster is still slower compared to spark standalone on 
> one of these EC2 instances :(
>
> Hence my posts to Spark user group.
>
> I'm not wedded to this Udemy course example; I wish someone could just point 
> me at an example with some quick code and a large public data set and say 
> this runs faster on a cluster than standalone.  I'd be happy to make a post 
> myself for any new people interested in Spark.
>
> Thanks
>
>
>
>
>
>
>
>
> On Thu, Sep 24, 2020 at 9:58 PM Sean Owen  wrote:
>>
>> If you have the same amount of resource (cores, memory, etc) on one
>> machine, that is pretty much always going to be faster than using
>> those same resources split across several machines.
>> Even if you have somewhat more resource available on a cluster, the
>> distributed version could be slower if you, for example, are
>> bottlenecking on network I/O and leaving some resources underutilized.
>>
>> Distributing isn't really going to make the workload consume less
>> resource; on the contrary it makes it take more. However it might make
>> the total wall-clock completion time go way down through parallelism.
>> How much you benefit from parallelism really depends on the problem,
>> the cluster, the input, etc. You may not see a speedup in this problem
>> until you hit more scale or modify the job to distribute a little
>> better, etc.
>>
>> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java  wrote:
>> >
>> > Hi,
>> >
>> > I made a post on stackoverflow that I can't seem to make any headway on
>> > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster
>> >
>> > Before someone starts making suggestions on changing the code; note that 
>> > the code and example on the above post is from a Udemy course and is not 
>> > my code. I am looking to take this dataset and code and executing the same 
>> > on a cluster I am looking to see the value of Spark by seeing results so 
>> > that the job submitted to the Spark Cluster runs in a faster time compared 
>> > to Standalone.
>> >
>> > I am currently evaluating Spark and I've thus far spent about a month of 
>> > weekends of my free time trying to get a Spark Cluster to show me improved 
>> > performance in comparison to Spark Standalone but I am not having success, 
>> > and after spending so much time in this, I am now looking for help from as 
>> > I'm time constrained (in general I'm time constrained, not for a project 
>> > or deadline re: Spark).
>> >
>> > If anyone can comment on what I need to make my example work faster on a 
>> > spark cluster vs standalone I'd appreciate it.
>> >
>> > Alternatively if someone can point me to a simple code example + dataset 
>> > that works better and illustrates the power of distributed spark I'd be 
>> > happy to use that instead - I'm not wedded to this example that I got from 
>> > the course - I'm just looking for the simple 5 min to 30 min example quick 
>> > start that shows the power of Spark distributed clusters.
>> >
>> > There's a higher level question here and one that is not obvious to find 
>> > an answer for.  There are many examples on Spark out there, but there is 
>> > not a simple large dataset + code example that illustrates the performance 
>> > 

https://issues.apache.org/jira/browse/SPARK-18381

2020-09-25 Thread ayan guha
Anyone aware of any workaround for
https://issues.apache.org/jira/browse/SPARK-18381

Other than upgrade to Spark 3 I mean,,,

-- 
Best Regards,
Ayan Guha


Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread javaguy Java
Hi Sean,

Thanks for your reply.

I understand distribution and parallelism very well and have used it with
other products like GridGain and various master worker patterns etc; I just
don't have a simple example working with Apache Spark which is what I am
looking for.  I know Spark doesn't follow the others parallelism paradigm
so I'm looking for a distributed example that illustrates Spark's
distribution capabilities very well - and correct I want the total wall
clock completion time to go down.

I think you misunderstood one thing re: the several machines blurb I have
in my post.  My spark cluster has the 2 of the same "identical" machines.
So its not a split of cores and memory.  It's a doubling of cores and
memory.

To recap, the spark cluster on my home network is running 2x Mac's with
32GB ram EACH (so 64GB ram in total) with the same processor size on each;
however when I run this code example on just one mac + Spark Standalone +
local[*] it is faster

I have subsequently moved my example to AWS and on AWS I'm running two
identical EC2 instances (so again double the RAM and Cores) co-located in
the same AZ and the spark cluster is still slower compared to spark
standalone on one of these EC2 instances :(

Hence my posts to Spark user group.

I'm not wedded to this Udemy course example; I wish someone could just
point me at an example with some quick code and a large public data set and
say this runs faster on a cluster than standalone.  I'd be happy to make a
post myself for any new people interested in Spark.

Thanks








On Thu, Sep 24, 2020 at 9:58 PM Sean Owen  wrote:

> If you have the same amount of resource (cores, memory, etc) on one
> machine, that is pretty much always going to be faster than using
> those same resources split across several machines.
> Even if you have somewhat more resource available on a cluster, the
> distributed version could be slower if you, for example, are
> bottlenecking on network I/O and leaving some resources underutilized.
>
> Distributing isn't really going to make the workload consume less
> resource; on the contrary it makes it take more. However it might make
> the total wall-clock completion time go way down through parallelism.
> How much you benefit from parallelism really depends on the problem,
> the cluster, the input, etc. You may not see a speedup in this problem
> until you hit more scale or modify the job to distribute a little
> better, etc.
>
> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java  wrote:
> >
> > Hi,
> >
> > I made a post on stackoverflow that I can't seem to make any headway on
> >
> https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster
> >
> > Before someone starts making suggestions on changing the code; note that
> the code and example on the above post is from a Udemy course and is not my
> code. I am looking to take this dataset and code and executing the same on
> a cluster I am looking to see the value of Spark by seeing results so that
> the job submitted to the Spark Cluster runs in a faster time compared to
> Standalone.
> >
> > I am currently evaluating Spark and I've thus far spent about a month of
> weekends of my free time trying to get a Spark Cluster to show me improved
> performance in comparison to Spark Standalone but I am not having success,
> and after spending so much time in this, I am now looking for help from as
> I'm time constrained (in general I'm time constrained, not for a project or
> deadline re: Spark).
> >
> > If anyone can comment on what I need to make my example work faster on a
> spark cluster vs standalone I'd appreciate it.
> >
> > Alternatively if someone can point me to a simple code example + dataset
> that works better and illustrates the power of distributed spark I'd be
> happy to use that instead - I'm not wedded to this example that I got from
> the course - I'm just looking for the simple 5 min to 30 min example quick
> start that shows the power of Spark distributed clusters.
> >
> > There's a higher level question here and one that is not obvious to find
> an answer for.  There are many examples on Spark out there, but there is
> not a simple large dataset + code example that illustrates the performance
> gain of Spark's cluster and distributed computing benefits vs just a single
> local standalone; which is what someone in my position is looking for
> (someone who makes architectural and platform decisions and is bandwidth /
> time constrained and wants to see the power and advantages of Spark cluster
> and distributed computing without spending weeks on the problem).
> >
> > I'm also willing to open this up to a consulting engagement if anyone is
> interested as I'd expect it to be quick (either you have a simple example
> that just needs to be setup etc or its easy for you to demonstrate cluster
> performance > standalone for this dataset)
> >
> > Thx
> >
> >
> >
>