Hi all, @Enrico, I've added just the SQL query pages (+js dependencies etc.) in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing That is what you had in mind right? They are different indeed. (For some reason after I saved them off of the history server the graphs get drawn twice, but that shouldn't matter)
@Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a cluster, so I can't check that for now; also I am using just s3:// @Xiao, Yes, I will try to run this locally as well, but installing new versions of Spark won't be very fast and easy for me, so I won't be doing it right away. Regards, Kalin On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <gatorsm...@gmail.com> wrote: > If you can confirm that this is caused by Apache Spark, feel free to open > a JIRA. In each release, I do not expect your queries should hit such a > major performance regression. Also, please try the 3.0 preview releases. > > Thanks, > > Xiao > > Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午10:53写道: > >> Hi Xiao, >> >> Thanks, I didn't know that. This >> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ >> implies that their fork is not used in emr 5.27. I tried that and it has >> the same issue. But then again in their article they were comparing emr >> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest >> version of Spark locally and make the comparison that way. >> >> Regards, >> Kalin >> >> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <gatorsm...@gmail.com> wrote: >> >>> EMR is having their own fork of Spark, called EMR runtime. They are not >>> Apache Spark. You might need to talk with them instead of posting questions >>> in the Apache Spark community. >>> >>> Cheers, >>> >>> Xiao >>> >>> Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午9:53写道: >>> >>>> Hi all, >>>> >>>> First of all let me say that I am pretty new to Spark so this could be >>>> entirely my fault somehow... >>>> I noticed this when I was running a job on an amazon emr cluster with >>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on >>>> Spark 2.4.1). I checked out the event logs, and the one from the newer >>>> version had more stages. >>>> Then I decided to do a comparison in the same environment so I created >>>> the two versions of the same cluster with the only difference being the emr >>>> release, and hence the spark version(?) - first one was emr-5.24.1 with >>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, >>>> the same thing happened with the newer version having more stages and >>>> taking almost twice as long to finish. >>>> So I am pretty much at a loss here - could it be that it is not because >>>> of spark itself, but because of some difference introduced in the emr >>>> releases? At the moment I can't think of any other alternative besides it >>>> being a bug... >>>> >>>> Here are the two event logs: >>>> >>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing >>>> and my code is here: >>>> https://github.com/kgskgs/stars-spark3d >>>> >>>> I ran it like so on the clusters (after putting it on s3): >>>> spark-submit --deploy-mode cluster --py-files >>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py >>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 >>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ >>>> >>>> So yeah I was considering submitting a bug report, but in the guide it >>>> said it's better to ask here first, so any ideas on what's going on? Maybe >>>> I am missing something? >>>> >>>> Regards, >>>> Kalin >>>> >>>