Hi Xiao, Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest version of Spark locally and make the comparison that way.
Regards, Kalin On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <gatorsm...@gmail.com> wrote: > EMR is having their own fork of Spark, called EMR runtime. They are not > Apache Spark. You might need to talk with them instead of posting questions > in the Apache Spark community. > > Cheers, > > Xiao > > Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午9:53写道: > >> Hi all, >> >> First of all let me say that I am pretty new to Spark so this could be >> entirely my fault somehow... >> I noticed this when I was running a job on an amazon emr cluster with >> Spark 2.4.4, and it got done slower than when I had ran it locally (on >> Spark 2.4.1). I checked out the event logs, and the one from the newer >> version had more stages. >> Then I decided to do a comparison in the same environment so I created >> the two versions of the same cluster with the only difference being the emr >> release, and hence the spark version(?) - first one was emr-5.24.1 with >> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, >> the same thing happened with the newer version having more stages and >> taking almost twice as long to finish. >> So I am pretty much at a loss here - could it be that it is not because >> of spark itself, but because of some difference introduced in the emr >> releases? At the moment I can't think of any other alternative besides it >> being a bug... >> >> Here are the two event logs: >> >> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing >> and my code is here: >> https://github.com/kgskgs/stars-spark3d >> >> I ran it like so on the clusters (after putting it on s3): >> spark-submit --deploy-mode cluster --py-files >> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py >> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 >> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ >> >> So yeah I was considering submitting a bug report, but in the guide it >> said it's better to ask here first, so any ideas on what's going on? Maybe >> I am missing something? >> >> Regards, >> Kalin >> >