Hi, I made a post on stackoverflow that I can't seem to make any headway on https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster
Before someone starts making suggestions on changing the code; note that the code and example on the above post is from a Udemy course and is not my code. I am looking to take this dataset and code and executing the same on a cluster I am looking to see the value of Spark by seeing results so that the job submitted to the Spark Cluster runs in a faster time compared to Standalone. I am currently evaluating Spark and I've thus far spent about a month of weekends of my free time trying to get a Spark Cluster to show me improved performance in comparison to Spark Standalone but I am not having success, and after spending so much time in this, I am now looking for help from as I'm time constrained (in general I'm time constrained, not for a project or deadline re: Spark). If anyone can comment on what I need to make my example work faster on a spark cluster vs standalone I'd appreciate it. Alternatively if someone can point me to a simple code example + dataset that works better and illustrates the power of distributed spark I'd be happy to use that instead - I'm not wedded to this example that I got from the course - I'm just looking for the simple 5 min to 30 min example quick start that shows the power of Spark distributed clusters. There's a higher level question here and one that is not obvious to find an answer for. There are many examples on Spark out there, but there is not a simple large dataset + code example that illustrates the performance gain of Spark's cluster and distributed computing benefits vs just a single local standalone; which is what someone in my position is looking for (someone who makes architectural and platform decisions and is bandwidth / time constrained and wants to see the power and advantages of Spark cluster and distributed computing without spending weeks on the problem). I'm also willing to open this up to a consulting engagement if anyone is interested as I'd expect it to be quick (either you have a simple example that just needs to be setup etc or its easy for you to demonstrate cluster performance > standalone for this dataset) Thx