Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Kalin Stoyanov Wed, 15 Jan 2020 13:10:58 -0800

Hi all,

@Enrico, I've added just the SQL query pages (+js dependencies etc.)  in
the google drive -
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
That is what you had in mind right? They are different indeed. (For some
reason after I saved them off of the history server the graphs get drawn
twice, but that shouldn't matter)


@Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a
cluster, so I can't check that for now; also I am using just s3://

@Xiao, Yes, I will try to run this locally as well, but installing new
versions of Spark won't be very fast and easy for me, so I won't be doing
it right away.

Regards,
Kalin


On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <gatorsm...@gmail.com> wrote:

> If you can confirm that this is caused by Apache Spark, feel free to open
> a JIRA. In each release, I do not expect your queries should hit such a
> major performance regression. Also, please try the 3.0 preview releases.
>
> Thanks,
>
> Xiao
>
> Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午10:53写道：
>
>> Hi Xiao,
>>
>> Thanks, I didn't know that. This
>> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
>> implies that their fork is not used in emr 5.27. I tried that and it has
>> the same issue. But then again in their article they were comparing emr
>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
>> version of Spark locally and make the comparison that way.
>>
>> Regards,
>> Kalin
>>
>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <gatorsm...@gmail.com> wrote:
>>
>>> EMR is having their own fork of Spark, called EMR runtime. They are not
>>> Apache Spark. You might need to talk with them instead of posting questions
>>> in the Apache Spark community.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午9:53写道：
>>>
>>>> Hi all,
>>>>
>>>> First of all let me say that I am pretty new to Spark so this could be
>>>> entirely my fault somehow...
>>>> I noticed this when I was running a job on an amazon emr cluster with
>>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>>> version had more stages.
>>>> Then I decided to do a comparison in the same environment so I created
>>>> the two versions of the same cluster with the only difference being the emr
>>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>>> the same thing happened with the newer version having more stages and
>>>> taking almost twice as long to finish.
>>>> So I am pretty much at a loss here - could it be that it is not because
>>>> of spark itself, but because of some difference introduced in the emr
>>>> releases? At the moment I can't think of any other alternative besides it
>>>> being a bug...
>>>>
>>>> Here are the two event logs:
>>>>
>>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>>>> and my code is here:
>>>> https://github.com/kgskgs/stars-spark3d
>>>>
>>>> I ran it like so on the clusters (after putting it on s3):
>>>> spark-submit --deploy-mode cluster --py-files
>>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>>
>>>> So yeah I was considering submitting a bug report, but in the guide it
>>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>>> I am missing something?
>>>>
>>>> Regards,
>>>> Kalin
>>>>
>>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Reply via email to