Hi,

I made a post on stackoverflow that I can't seem to make any headway on
https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster

Before someone starts making suggestions on changing the code; note that
the code and example on the above post is from a Udemy course and is not my
code. I am looking to take this dataset and code and executing the same on
a cluster I am looking to see the value of Spark by seeing results so that
the job submitted to the Spark Cluster runs in a faster time compared to
Standalone.

I am currently evaluating Spark and I've thus far spent about a month of
weekends of my free time trying to get a Spark Cluster to show me improved
performance in comparison to Spark Standalone but I am not having success,
and after spending so much time in this, I am now looking for help from as
I'm time constrained (in general I'm time constrained, not for a project or
deadline re: Spark).

If anyone can comment on what I need to make my example work faster on a
spark cluster vs standalone I'd appreciate it.

Alternatively if someone can point me to a simple code example + dataset
that works better and illustrates the power of distributed spark I'd be
happy to use that instead - I'm not wedded to this example that I got from
the course - I'm just looking for the simple 5 min to 30 min example quick
start that shows the power of Spark distributed clusters.

There's a higher level question here and one that is not obvious to find an
answer for.  There are many examples on Spark out there, but there is not a
simple large dataset + code example that illustrates the performance gain
of Spark's cluster and distributed computing benefits vs just a single
local standalone; which is what someone in my position is looking for
(someone who makes architectural and platform decisions and is bandwidth /
time constrained and wants to see the power and advantages of Spark cluster
and distributed computing without spending weeks on the problem).

I'm also willing to open this up to a consulting engagement if anyone is
interested as I'd expect it to be quick (either you have a simple example
that just needs to be setup etc or its easy for you to demonstrate cluster
performance > standalone for this dataset)

Thx

Reply via email to