date:20200115

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta

Hi Xiao,

that is the right attitude, thanks a ton :)

Hi Kalin,
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5281-relnotes
EMR latest version should be available right out of the box, perhaps you
can raise a quick AWS ticket and find out in case its release it getting
delayed in your region or not. The release notes does mention that it fixes
a few SPARK compatibility issues. Also working on the latest version of
SPARK takes less than 10 seconds after you have downloaded and unzipped the
file from APACHE SPARK. Besides that I am almost always sure that starting
SPARK session in EMR using the following statement is always going to give
the same performance and predictability. As Xiao mentions it might be
better to first isolate the cause and replicate it before raising issues.

(spark = SparkSession.builder.getOrCreate())

Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 9:10 PM Kalin Stoyanov  wrote:

> Hi all,
>
> @Enrico, I've added just the SQL query pages (+js dependencies etc.)  in
> the google drive -
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> That is what you had in mind right? They are different indeed. (For some
> reason after I saved them off of the history server the graphs get drawn
> twice, but that shouldn't matter)
>
> @Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a
> cluster, so I can't check that for now; also I am using just s3://
>
> @Xiao, Yes, I will try to run this locally as well, but installing new
> versions of Spark won't be very fast and easy for me, so I won't be doing
> it right away.
>
> Regards,
> Kalin
>
>
> On Wed, Jan 15, 2020 at 10:20 PM Xiao Li  wrote:
>
>> If you can confirm that this is caused by Apache Spark, feel free to open
>> a JIRA. In each release, I do not expect your queries should hit such a
>> major performance regression. Also, please try the 3.0 preview releases.
>>
>> Thanks,
>>
>> Xiao
>>
>> Kalin Stoyanov  于2020年1月15日周三 上午10:53写道：
>>
>>> Hi Xiao,
>>>
>>> Thanks, I didn't know that. This
>>> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
>>> implies that their fork is not used in emr 5.27. I tried that and it has
>>> the same issue. But then again in their article they were comparing emr
>>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
>>> version of Spark locally and make the comparison that way.
>>>
>>> Regards,
>>> Kalin
>>>
>>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li  wrote:
>>>
 EMR is having their own fork of Spark, called EMR runtime. They are not
 Apache Spark. You might need to talk with them instead of posting questions
 in the Apache Spark community.

 Cheers,

 Xiao

 Kalin Stoyanov  于2020年1月15日周三 上午9:53写道：

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created
> the two versions of the same cluster with the only difference being the 
> emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure 
> enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not
> because of spark itself, but because of some difference introduced in the
> emr releases? At the moment I can't think of any other alternative besides
> it being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov

Hi all,

@Enrico, I've added just the SQL query pages (+js dependencies etc.)  in
the google drive -
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
That is what you had in mind right? They are different indeed. (For some
reason after I saved them off of the history server the graphs get drawn
twice, but that shouldn't matter)

@Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a
cluster, so I can't check that for now; also I am using just s3://

@Xiao, Yes, I will try to run this locally as well, but installing new
versions of Spark won't be very fast and easy for me, so I won't be doing
it right away.

Regards,
Kalin

On Wed, Jan 15, 2020 at 10:20 PM Xiao Li  wrote:

> If you can confirm that this is caused by Apache Spark, feel free to open
> a JIRA. In each release, I do not expect your queries should hit such a
> major performance regression. Also, please try the 3.0 preview releases.
>
> Thanks,
>
> Xiao
>
> Kalin Stoyanov  于2020年1月15日周三 上午10:53写道：
>
>> Hi Xiao,
>>
>> Thanks, I didn't know that. This
>> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
>> implies that their fork is not used in emr 5.27. I tried that and it has
>> the same issue. But then again in their article they were comparing emr
>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
>> version of Spark locally and make the comparison that way.
>>
>> Regards,
>> Kalin
>>
>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li  wrote:
>>
>>> EMR is having their own fork of Spark, called EMR runtime. They are not
>>> Apache Spark. You might need to talk with them instead of posting questions
>>> in the Apache Spark community.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Kalin Stoyanov  于2020年1月15日周三 上午9:53写道：
>>>
 Hi all,

 First of all let me say that I am pretty new to Spark so this could be
 entirely my fault somehow...
 I noticed this when I was running a job on an amazon emr cluster with
 Spark 2.4.4, and it got done slower than when I had ran it locally (on
 Spark 2.4.1). I checked out the event logs, and the one from the newer
 version had more stages.
 Then I decided to do a comparison in the same environment so I created
 the two versions of the same cluster with the only difference being the emr
 release, and hence the spark version(?) - first one was emr-5.24.1 with
 Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
 the same thing happened with the newer version having more stages and
 taking almost twice as long to finish.
 So I am pretty much at a loss here - could it be that it is not because
 of spark itself, but because of some difference introduced in the emr
 releases? At the moment I can't think of any other alternative besides it
 being a bug...

 Here are the two event logs:

 https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
 and my code is here:
 https://github.com/kgskgs/stars-spark3d

 I ran it like so on the clusters (after putting it on s3):
 spark-submit --deploy-mode cluster --py-files
 s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
 --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

 So yeah I was considering submitting a bug report, but in the guide it
 said it's better to ask here first, so any ideas on what's going on? Maybe
 I am missing something?

 Regards,
 Kalin

>>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li

If you can confirm that this is caused by Apache Spark, feel free to open a
JIRA. In each release, I do not expect your queries should hit such a major
performance regression. Also, please try the 3.0 preview releases.

Thanks,

Xiao

Kalin Stoyanov  于2020年1月15日周三 上午10:53写道：

> Hi Xiao,
>
> Thanks, I didn't know that. This
> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
> implies that their fork is not used in emr 5.27. I tried that and it has
> the same issue. But then again in their article they were comparing emr
> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
> version of Spark locally and make the comparison that way.
>
> Regards,
> Kalin
>
> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li  wrote:
>
>> EMR is having their own fork of Spark, called EMR runtime. They are not
>> Apache Spark. You might need to talk with them instead of posting questions
>> in the Apache Spark community.
>>
>> Cheers,
>>
>> Xiao
>>
>> Kalin Stoyanov  于2020年1月15日周三 上午9:53写道：
>>
>>> Hi all,
>>>
>>> First of all let me say that I am pretty new to Spark so this could be
>>> entirely my fault somehow...
>>> I noticed this when I was running a job on an amazon emr cluster with
>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>> version had more stages.
>>> Then I decided to do a comparison in the same environment so I created
>>> the two versions of the same cluster with the only difference being the emr
>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>> the same thing happened with the newer version having more stages and
>>> taking almost twice as long to finish.
>>> So I am pretty much at a loss here - could it be that it is not because
>>> of spark itself, but because of some difference introduced in the emr
>>> releases? At the moment I can't think of any other alternative besides it
>>> being a bug...
>>>
>>> Here are the two event logs:
>>>
>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>>> and my code is here:
>>> https://github.com/kgskgs/stars-spark3d
>>>
>>> I ran it like so on the clusters (after putting it on s3):
>>> spark-submit --deploy-mode cluster --py-files
>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>
>>> So yeah I was considering submitting a bug report, but in the guide it
>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>> I am missing something?
>>>
>>> Regards,
>>> Kalin
>>>
>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta

Hi,

I am pretty sure that AWS has released 5.28.1 with some bug fixes day
before yesterday.

Also please ensure that you are using s3:// instead of s3a:// or anything
like that.

On another note, Xiao, is not entirely right in mentioning about issues in
EMR not to be posted here, a large group of users use SPARK in Databricks,
GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have
always found that the Apache SPARK community takes care of each other and
answers questions to the largest user base, just like I did now. I think
that only Matei Zaharia can take such a sweeping call on what this entire
community is about.


Thanks and Regards,
Gourav Sengupta

On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov  wrote:

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov

Hi Xiao,

Thanks, I didn't know that. This
https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
implies that their fork is not used in emr 5.27. I tried that and it has
the same issue. But then again in their article they were comparing emr
5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
version of Spark locally and make the comparison that way.

Regards,
Kalin

On Wed, Jan 15, 2020 at 7:58 PM Xiao Li  wrote:

> EMR is having their own fork of Spark, called EMR runtime. They are not
> Apache Spark. You might need to talk with them instead of posting questions
> in the Apache Spark community.
>
> Cheers,
>
> Xiao
>
> Kalin Stoyanov  于2020年1月15日周三 上午9:53写道：
>
>> Hi all,
>>
>> First of all let me say that I am pretty new to Spark so this could be
>> entirely my fault somehow...
>> I noticed this when I was running a job on an amazon emr cluster with
>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>> version had more stages.
>> Then I decided to do a comparison in the same environment so I created
>> the two versions of the same cluster with the only difference being the emr
>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>> the same thing happened with the newer version having more stages and
>> taking almost twice as long to finish.
>> So I am pretty much at a loss here - could it be that it is not because
>> of spark itself, but because of some difference introduced in the emr
>> releases? At the moment I can't think of any other alternative besides it
>> being a bug...
>>
>> Here are the two event logs:
>>
>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>> and my code is here:
>> https://github.com/kgskgs/stars-spark3d
>>
>> I ran it like so on the clusters (after putting it on s3):
>> spark-submit --deploy-mode cluster --py-files
>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>
>> So yeah I was considering submitting a bug report, but in the guide it
>> said it's better to ask here first, so any ideas on what's going on? Maybe
>> I am missing something?
>>
>> Regards,
>> Kalin
>>
>

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy

Thanks Xiao, a more up to date publication in a conference like VLDB will
certainly turn the the tide for many of us trying to defend Spark's
Optimizer.

On Wed, Jan 15, 2020 at 9:39 AM Xiao Li  wrote:

> In the upcoming Spark 3.0, we introduced a new framework for Adaptive
> Query Execution in Catalyst. This can adjust the plans based on the runtime
> statistics. This is missing in Calcite based on my understanding.
>
> Catalyst is also very easy to enhance. We also use the dynamic programming
> approach in our cost-based join reordering. If needed, in the future, we
> also can improve the existing CBO and make it more general. The paper of
> Spark SQL was published 5 years ago. A lot of great contributions were made
> in the past 5 years.
>
> Cheers,
>
> Xiao
>
> Debajyoti Roy  于2020年1月15日周三 上午9:23写道：
>
>> Thanks all, and Matei.
>>
>> TL;DR of the conclusion for my particular case:
>> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
>> maintenance burden, it lacks the dynamic programming approach used by
>> Calcite[2] and risks falling into local minima.
>> Quantitatively, there is no reproducible benchmark, that fairly compares
>> Optimizer frameworks, apples to apples (excluding execution).
>>
>> References:
>> [1] -
>> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
>> [2] - https://arxiv.org/pdf/1802.10233.pdf
>>
>> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
>> wrote:
>>
>>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>>> parallel. Calcite 1.0 was only released in 2015. From a technical
>>> standpoint, building Catalyst in Scala also made it more concise and easier
>>> to extend than an optimizer written in Java (you can find various
>>> presentations about how Catalyst works).
>>>
>>> Matei
>>>
>>> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
>>> >
>>> > It's fairly common for adapters (Calcite's abstraction of a data
>>> > source) to push down predicates. However, the API certainly looks a
>>> > lot different than Catalyst's.
>>> > --
>>> > Michael Mior
>>> > mm...@apache.org
>>> >
>>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>>> >  a écrit :
>>> >>
>>> >> The implementation they chose supports push down predicates, Datasets
>>> and other features that are not available in Calcite:
>>> >>
>>> >> https://databricks.com/glossary/catalyst-optimizer
>>> >>
>>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker 
>>> wrote:
>>> >>>
>>> >>> Was there a qualitative or quantitative benchmark done before a
>>> design
>>> >>> decision was made not to use Calcite?
>>> >>>
>>> >>> Are there limitations (for heuristic based, cost based, * aware
>>> optimizer)
>>> >>> in Calcite, and frameworks built on top of Calcite? In the context
>>> of big
>>> >>> data / TCPH benchmarks.
>>> >>>
>>> >>> I was unable to dig up anything concrete from user group / Jira.
>>> Appreciate
>>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> >>> Spark/Catalyst.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks,
>>> >> Jason
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li

EMR is having their own fork of Spark, called EMR runtime. They are not
Apache Spark. You might need to talk with them instead of posting questions
in the Apache Spark community.

Cheers,

Xiao

Kalin Stoyanov  于2020年1月15日周三 上午9:53写道：

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov

Hi all,

First of all let me say that I am pretty new to Spark so this could be
entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark
2.4.4, and it got done slower than when I had ran it locally (on Spark
2.4.1). I checked out the event logs, and the one from the newer version
had more stages.
Then I decided to do a comparison in the same environment so I created the
two versions of the same cluster with the only difference being the emr
release, and hence the spark version(?) - first one was emr-5.24.1 with
Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
the same thing happened with the newer version having more stages and
taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of
spark itself, but because of some difference introduced in the emr
releases? At the moment I can't think of any other alternative besides it
being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files
s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
--name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
--outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said
it's better to ask here first, so any ideas on what's going on? Maybe I am
missing something?

Regards,
Kalin

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Xiao Li

In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query
Execution in Catalyst. This can adjust the plans based on the runtime
statistics. This is missing in Calcite based on my understanding.

Catalyst is also very easy to enhance. We also use the dynamic programming
approach in our cost-based join reordering. If needed, in the future, we
also can improve the existing CBO and make it more general. The paper of
Spark SQL was published 5 years ago. A lot of great contributions were made
in the past 5 years.

Cheers,

Xiao

Debajyoti Roy  于2020年1月15日周三 上午9:23写道：

> Thanks all, and Matei.
>
> TL;DR of the conclusion for my particular case:
> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
> maintenance burden, it lacks the dynamic programming approach used by
> Calcite[2] and risks falling into local minima.
> Quantitatively, there is no reproducible benchmark, that fairly compares
> Optimizer frameworks, apples to apples (excluding execution).
>
> References:
> [1] -
> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
> [2] - https://arxiv.org/pdf/1802.10233.pdf
>
> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
> wrote:
>
>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>> parallel. Calcite 1.0 was only released in 2015. From a technical
>> standpoint, building Catalyst in Scala also made it more concise and easier
>> to extend than an optimizer written in Java (you can find various
>> presentations about how Catalyst works).
>>
>> Matei
>>
>> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
>> >
>> > It's fairly common for adapters (Calcite's abstraction of a data
>> > source) to push down predicates. However, the API certainly looks a
>> > lot different than Catalyst's.
>> > --
>> > Michael Mior
>> > mm...@apache.org
>> >
>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>> >  a écrit :
>> >>
>> >> The implementation they chose supports push down predicates, Datasets
>> and other features that are not available in Calcite:
>> >>
>> >> https://databricks.com/glossary/catalyst-optimizer
>> >>
>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
>> >>>
>> >>> Was there a qualitative or quantitative benchmark done before a design
>> >>> decision was made not to use Calcite?
>> >>>
>> >>> Are there limitations (for heuristic based, cost based, * aware
>> optimizer)
>> >>> in Calcite, and frameworks built on top of Calcite? In the context of
>> big
>> >>> data / TCPH benchmarks.
>> >>>
>> >>> I was unable to dig up anything concrete from user group / Jira.
>> Appreciate
>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>> >>> Spark/Catalyst.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Jason
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy

Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and
maintenance burden, it lacks the dynamic programming approach used by
Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares
Optimizer frameworks, apples to apples (excluding execution).

References:
[1] -
https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
[2] - https://arxiv.org/pdf/1802.10233.pdf

On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
wrote:

> I’m pretty sure that Catalyst was built before Calcite, or at least in
> parallel. Calcite 1.0 was only released in 2015. From a technical
> standpoint, building Catalyst in Scala also made it more concise and easier
> to extend than an optimizer written in Java (you can find various
> presentations about how Catalyst works).
>
> Matei
>
> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
> >
> > It's fairly common for adapters (Calcite's abstraction of a data
> > source) to push down predicates. However, the API certainly looks a
> > lot different than Catalyst's.
> > --
> > Michael Mior
> > mm...@apache.org
> >
> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> >  a écrit :
> >>
> >> The implementation they chose supports push down predicates, Datasets
> and other features that are not available in Calcite:
> >>
> >> https://databricks.com/glossary/catalyst-optimizer
> >>
> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
> >>>
> >>> Was there a qualitative or quantitative benchmark done before a design
> >>> decision was made not to use Calcite?
> >>>
> >>> Are there limitations (for heuristic based, cost based, * aware
> optimizer)
> >>> in Calcite, and frameworks built on top of Calcite? In the context of
> big
> >>> data / TCPH benchmarks.
> >>>
> >>> I was unable to dig up anything concrete from user group / Jira.
> Appreciate
> >>> if any Catalyst veteran here can give me pointers. Trying to defend
> >>> Spark/Catalyst.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >> --
> >> Thanks,
> >> Jason
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Re: Why Apache Spark doesn't use Calcite?

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Re: Why Apache Spark doesn't use Calcite?

Re: Why Apache Spark doesn't use Calcite?

10 matches

Site Navigation

Mail list logo

Footer information