Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Gourav Sengupta Wed, 15 Jan 2020 11:15:41 -0800

Hi,

I am pretty sure that AWS has released 5.28.1 with some bug fixes day
before yesterday.


Also please ensure that you are using s3:// instead of s3a:// or anything
like that.

On another note, Xiao, is not entirely right in mentioning about issues in
EMR not to be posted here, a large group of users use SPARK in Databricks,
GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have
always found that the Apache SPARK community takes care of each other and
answers questions to the largest user base, just like I did now. I think
that only Matei Zaharia can take such a sweeping call on what this entire
community is about.


Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov <kgs.v...@gmail.com> wrote:

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Reply via email to