Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Hi Xiao, that is the right attitude, thanks a ton :) Hi Kalin, https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5281-relnotes EMR latest version should be available right out of the box, perhaps you can raise a quick AWS ticket and find out in case its release it getting delayed in your region or not. The release notes does mention that it fixes a few SPARK compatibility issues. Also working on the latest version of SPARK takes less than 10 seconds after you have downloaded and unzipped the file from APACHE SPARK. Besides that I am almost always sure that starting SPARK session in EMR using the following statement is always going to give the same performance and predictability. As Xiao mentions it might be better to first isolate the cause and replicate it before raising issues. (spark = SparkSession.builder.getOrCreate()) Thanks and Regards, Gourav Sengupta On Wed, Jan 15, 2020 at 9:10 PM Kalin Stoyanov wrote: > Hi all, > > @Enrico, I've added just the SQL query pages (+js dependencies etc.) in > the google drive - > https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing > That is what you had in mind right? They are different indeed. (For some > reason after I saved them off of the history server the graphs get drawn > twice, but that shouldn't matter) > > @Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a > cluster, so I can't check that for now; also I am using just s3:// > > @Xiao, Yes, I will try to run this locally as well, but installing new > versions of Spark won't be very fast and easy for me, so I won't be doing > it right away. > > Regards, > Kalin > > > On Wed, Jan 15, 2020 at 10:20 PM Xiao Li wrote: > >> If you can confirm that this is caused by Apache Spark, feel free to open >> a JIRA. In each release, I do not expect your queries should hit such a >> major performance regression. Also, please try the 3.0 preview releases. >> >> Thanks, >> >> Xiao >> >> Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: >> >>> Hi Xiao, >>> >>> Thanks, I didn't know that. This >>> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ >>> implies that their fork is not used in emr 5.27. I tried that and it has >>> the same issue. But then again in their article they were comparing emr >>> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest >>> version of Spark locally and make the comparison that way. >>> >>> Regards, >>> Kalin >>> >>> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li wrote: >>> EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. Cheers, Xiao Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > Hi all, > > First of all let me say that I am pretty new to Spark so this could be > entirely my fault somehow... > I noticed this when I was running a job on an amazon emr cluster with > Spark 2.4.4, and it got done slower than when I had ran it locally (on > Spark 2.4.1). I checked out the event logs, and the one from the newer > version had more stages. > Then I decided to do a comparison in the same environment so I created > the two versions of the same cluster with the only difference being the > emr > release, and hence the spark version(?) - first one was emr-5.24.1 with > Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure > enough, > the same thing happened with the newer version having more stages and > taking almost twice as long to finish. > So I am pretty much at a loss here - could it be that it is not > because of spark itself, but because of some difference introduced in the > emr releases? At the moment I can't think of any other alternative besides > it being a bug... > > Here are the two event logs: > > https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing > and my code is here: > https://github.com/kgskgs/stars-spark3d > > I ran it like so on the clusters (after putting it on s3): > spark-submit --deploy-mode cluster --py-files > s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py > --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 > --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ > > So yeah I was considering submitting a bug report, but in the guide it > said it's better to ask here first, so any ideas on what's going on? Maybe > I am missing something? > > Regards, > Kalin >
Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Hi all, @Enrico, I've added just the SQL query pages (+js dependencies etc.) in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing That is what you had in mind right? They are different indeed. (For some reason after I saved them off of the history server the graphs get drawn twice, but that shouldn't matter) @Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a cluster, so I can't check that for now; also I am using just s3:// @Xiao, Yes, I will try to run this locally as well, but installing new versions of Spark won't be very fast and easy for me, so I won't be doing it right away. Regards, Kalin On Wed, Jan 15, 2020 at 10:20 PM Xiao Li wrote: > If you can confirm that this is caused by Apache Spark, feel free to open > a JIRA. In each release, I do not expect your queries should hit such a > major performance regression. Also, please try the 3.0 preview releases. > > Thanks, > > Xiao > > Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: > >> Hi Xiao, >> >> Thanks, I didn't know that. This >> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ >> implies that their fork is not used in emr 5.27. I tried that and it has >> the same issue. But then again in their article they were comparing emr >> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest >> version of Spark locally and make the comparison that way. >> >> Regards, >> Kalin >> >> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li wrote: >> >>> EMR is having their own fork of Spark, called EMR runtime. They are not >>> Apache Spark. You might need to talk with them instead of posting questions >>> in the Apache Spark community. >>> >>> Cheers, >>> >>> Xiao >>> >>> Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: >>> Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages. Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish. So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug... Here are the two event logs: https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing and my code is here: https://github.com/kgskgs/stars-spark3d I ran it like so on the clusters (after putting it on s3): spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something? Regards, Kalin >>>
Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. Thanks, Xiao Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: > Hi Xiao, > > Thanks, I didn't know that. This > https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ > implies that their fork is not used in emr 5.27. I tried that and it has > the same issue. But then again in their article they were comparing emr > 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest > version of Spark locally and make the comparison that way. > > Regards, > Kalin > > On Wed, Jan 15, 2020 at 7:58 PM Xiao Li wrote: > >> EMR is having their own fork of Spark, called EMR runtime. They are not >> Apache Spark. You might need to talk with them instead of posting questions >> in the Apache Spark community. >> >> Cheers, >> >> Xiao >> >> Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: >> >>> Hi all, >>> >>> First of all let me say that I am pretty new to Spark so this could be >>> entirely my fault somehow... >>> I noticed this when I was running a job on an amazon emr cluster with >>> Spark 2.4.4, and it got done slower than when I had ran it locally (on >>> Spark 2.4.1). I checked out the event logs, and the one from the newer >>> version had more stages. >>> Then I decided to do a comparison in the same environment so I created >>> the two versions of the same cluster with the only difference being the emr >>> release, and hence the spark version(?) - first one was emr-5.24.1 with >>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, >>> the same thing happened with the newer version having more stages and >>> taking almost twice as long to finish. >>> So I am pretty much at a loss here - could it be that it is not because >>> of spark itself, but because of some difference introduced in the emr >>> releases? At the moment I can't think of any other alternative besides it >>> being a bug... >>> >>> Here are the two event logs: >>> >>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing >>> and my code is here: >>> https://github.com/kgskgs/stars-spark3d >>> >>> I ran it like so on the clusters (after putting it on s3): >>> spark-submit --deploy-mode cluster --py-files >>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py >>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 >>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ >>> >>> So yeah I was considering submitting a bug report, but in the guide it >>> said it's better to ask here first, so any ideas on what's going on? Maybe >>> I am missing something? >>> >>> Regards, >>> Kalin >>> >>
Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Hi, I am pretty sure that AWS has released 5.28.1 with some bug fixes day before yesterday. Also please ensure that you are using s3:// instead of s3a:// or anything like that. On another note, Xiao, is not entirely right in mentioning about issues in EMR not to be posted here, a large group of users use SPARK in Databricks, GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have always found that the Apache SPARK community takes care of each other and answers questions to the largest user base, just like I did now. I think that only Matei Zaharia can take such a sweeping call on what this entire community is about. Thanks and Regards, Gourav Sengupta On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov wrote: > Hi all, > > First of all let me say that I am pretty new to Spark so this could be > entirely my fault somehow... > I noticed this when I was running a job on an amazon emr cluster with > Spark 2.4.4, and it got done slower than when I had ran it locally (on > Spark 2.4.1). I checked out the event logs, and the one from the newer > version had more stages. > Then I decided to do a comparison in the same environment so I created the > two versions of the same cluster with the only difference being the emr > release, and hence the spark version(?) - first one was emr-5.24.1 with > Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, > the same thing happened with the newer version having more stages and > taking almost twice as long to finish. > So I am pretty much at a loss here - could it be that it is not because of > spark itself, but because of some difference introduced in the emr > releases? At the moment I can't think of any other alternative besides it > being a bug... > > Here are the two event logs: > > https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing > and my code is here: > https://github.com/kgskgs/stars-spark3d > > I ran it like so on the clusters (after putting it on s3): > spark-submit --deploy-mode cluster --py-files > s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py > --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 > --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ > > So yeah I was considering submitting a bug report, but in the guide it > said it's better to ask here first, so any ideas on what's going on? Maybe > I am missing something? > > Regards, > Kalin >
Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Hi Xiao, Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest version of Spark locally and make the comparison that way. Regards, Kalin On Wed, Jan 15, 2020 at 7:58 PM Xiao Li wrote: > EMR is having their own fork of Spark, called EMR runtime. They are not > Apache Spark. You might need to talk with them instead of posting questions > in the Apache Spark community. > > Cheers, > > Xiao > > Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > >> Hi all, >> >> First of all let me say that I am pretty new to Spark so this could be >> entirely my fault somehow... >> I noticed this when I was running a job on an amazon emr cluster with >> Spark 2.4.4, and it got done slower than when I had ran it locally (on >> Spark 2.4.1). I checked out the event logs, and the one from the newer >> version had more stages. >> Then I decided to do a comparison in the same environment so I created >> the two versions of the same cluster with the only difference being the emr >> release, and hence the spark version(?) - first one was emr-5.24.1 with >> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, >> the same thing happened with the newer version having more stages and >> taking almost twice as long to finish. >> So I am pretty much at a loss here - could it be that it is not because >> of spark itself, but because of some difference introduced in the emr >> releases? At the moment I can't think of any other alternative besides it >> being a bug... >> >> Here are the two event logs: >> >> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing >> and my code is here: >> https://github.com/kgskgs/stars-spark3d >> >> I ran it like so on the clusters (after putting it on s3): >> spark-submit --deploy-mode cluster --py-files >> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py >> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 >> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ >> >> So yeah I was considering submitting a bug report, but in the guide it >> said it's better to ask here first, so any ideas on what's going on? Maybe >> I am missing something? >> >> Regards, >> Kalin >> >
Re: Why Apache Spark doesn't use Calcite?
Thanks Xiao, a more up to date publication in a conference like VLDB will certainly turn the the tide for many of us trying to defend Spark's Optimizer. On Wed, Jan 15, 2020 at 9:39 AM Xiao Li wrote: > In the upcoming Spark 3.0, we introduced a new framework for Adaptive > Query Execution in Catalyst. This can adjust the plans based on the runtime > statistics. This is missing in Calcite based on my understanding. > > Catalyst is also very easy to enhance. We also use the dynamic programming > approach in our cost-based join reordering. If needed, in the future, we > also can improve the existing CBO and make it more general. The paper of > Spark SQL was published 5 years ago. A lot of great contributions were made > in the past 5 years. > > Cheers, > > Xiao > > Debajyoti Roy 于2020年1月15日周三 上午9:23写道: > >> Thanks all, and Matei. >> >> TL;DR of the conclusion for my particular case: >> Qualitatively, while Catalyst[1] tries to mitigate learning curve and >> maintenance burden, it lacks the dynamic programming approach used by >> Calcite[2] and risks falling into local minima. >> Quantitatively, there is no reproducible benchmark, that fairly compares >> Optimizer frameworks, apples to apples (excluding execution). >> >> References: >> [1] - >> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf >> [2] - https://arxiv.org/pdf/1802.10233.pdf >> >> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia >> wrote: >> >>> I’m pretty sure that Catalyst was built before Calcite, or at least in >>> parallel. Calcite 1.0 was only released in 2015. From a technical >>> standpoint, building Catalyst in Scala also made it more concise and easier >>> to extend than an optimizer written in Java (you can find various >>> presentations about how Catalyst works). >>> >>> Matei >>> >>> > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: >>> > >>> > It's fairly common for adapters (Calcite's abstraction of a data >>> > source) to push down predicates. However, the API certainly looks a >>> > lot different than Catalyst's. >>> > -- >>> > Michael Mior >>> > mm...@apache.org >>> > >>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin >>> > a écrit : >>> >> >>> >> The implementation they chose supports push down predicates, Datasets >>> and other features that are not available in Calcite: >>> >> >>> >> https://databricks.com/glossary/catalyst-optimizer >>> >> >>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker >>> wrote: >>> >>> >>> >>> Was there a qualitative or quantitative benchmark done before a >>> design >>> >>> decision was made not to use Calcite? >>> >>> >>> >>> Are there limitations (for heuristic based, cost based, * aware >>> optimizer) >>> >>> in Calcite, and frameworks built on top of Calcite? In the context >>> of big >>> >>> data / TCPH benchmarks. >>> >>> >>> >>> I was unable to dig up anything concrete from user group / Jira. >>> Appreciate >>> >>> if any Catalyst veteran here can give me pointers. Trying to defend >>> >>> Spark/Catalyst. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> >>> >>> - >>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> >> >>> >> >>> >> -- >>> >> Thanks, >>> >> Jason >>> > >>> > - >>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> > >>> >>>
Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. Cheers, Xiao Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > Hi all, > > First of all let me say that I am pretty new to Spark so this could be > entirely my fault somehow... > I noticed this when I was running a job on an amazon emr cluster with > Spark 2.4.4, and it got done slower than when I had ran it locally (on > Spark 2.4.1). I checked out the event logs, and the one from the newer > version had more stages. > Then I decided to do a comparison in the same environment so I created the > two versions of the same cluster with the only difference being the emr > release, and hence the spark version(?) - first one was emr-5.24.1 with > Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, > the same thing happened with the newer version having more stages and > taking almost twice as long to finish. > So I am pretty much at a loss here - could it be that it is not because of > spark itself, but because of some difference introduced in the emr > releases? At the moment I can't think of any other alternative besides it > being a bug... > > Here are the two event logs: > > https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing > and my code is here: > https://github.com/kgskgs/stars-spark3d > > I ran it like so on the clusters (after putting it on s3): > spark-submit --deploy-mode cluster --py-files > s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py > --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 > --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ > > So yeah I was considering submitting a bug report, but in the guide it > said it's better to ask here first, so any ideas on what's going on? Maybe > I am missing something? > > Regards, > Kalin >
Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]
Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages. Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish. So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug... Here are the two event logs: https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing and my code is here: https://github.com/kgskgs/stars-spark3d I ran it like so on the clusters (after putting it on s3): spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something? Regards, Kalin
Re: Why Apache Spark doesn't use Calcite?
In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query Execution in Catalyst. This can adjust the plans based on the runtime statistics. This is missing in Calcite based on my understanding. Catalyst is also very easy to enhance. We also use the dynamic programming approach in our cost-based join reordering. If needed, in the future, we also can improve the existing CBO and make it more general. The paper of Spark SQL was published 5 years ago. A lot of great contributions were made in the past 5 years. Cheers, Xiao Debajyoti Roy 于2020年1月15日周三 上午9:23写道: > Thanks all, and Matei. > > TL;DR of the conclusion for my particular case: > Qualitatively, while Catalyst[1] tries to mitigate learning curve and > maintenance burden, it lacks the dynamic programming approach used by > Calcite[2] and risks falling into local minima. > Quantitatively, there is no reproducible benchmark, that fairly compares > Optimizer frameworks, apples to apples (excluding execution). > > References: > [1] - > https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf > [2] - https://arxiv.org/pdf/1802.10233.pdf > > On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia > wrote: > >> I’m pretty sure that Catalyst was built before Calcite, or at least in >> parallel. Calcite 1.0 was only released in 2015. From a technical >> standpoint, building Catalyst in Scala also made it more concise and easier >> to extend than an optimizer written in Java (you can find various >> presentations about how Catalyst works). >> >> Matei >> >> > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: >> > >> > It's fairly common for adapters (Calcite's abstraction of a data >> > source) to push down predicates. However, the API certainly looks a >> > lot different than Catalyst's. >> > -- >> > Michael Mior >> > mm...@apache.org >> > >> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin >> > a écrit : >> >> >> >> The implementation they chose supports push down predicates, Datasets >> and other features that are not available in Calcite: >> >> >> >> https://databricks.com/glossary/catalyst-optimizer >> >> >> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker wrote: >> >>> >> >>> Was there a qualitative or quantitative benchmark done before a design >> >>> decision was made not to use Calcite? >> >>> >> >>> Are there limitations (for heuristic based, cost based, * aware >> optimizer) >> >>> in Calcite, and frameworks built on top of Calcite? In the context of >> big >> >>> data / TCPH benchmarks. >> >>> >> >>> I was unable to dig up anything concrete from user group / Jira. >> Appreciate >> >>> if any Catalyst veteran here can give me pointers. Trying to defend >> >>> Spark/Catalyst. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >>> >> >>> - >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>> >> >> >> >> >> >> -- >> >> Thanks, >> >> Jason >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >>
Re: Why Apache Spark doesn't use Calcite?
Thanks all, and Matei. TL;DR of the conclusion for my particular case: Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima. Quantitatively, there is no reproducible benchmark, that fairly compares Optimizer frameworks, apples to apples (excluding execution). References: [1] - https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf [2] - https://arxiv.org/pdf/1802.10233.pdf On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia wrote: > I’m pretty sure that Catalyst was built before Calcite, or at least in > parallel. Calcite 1.0 was only released in 2015. From a technical > standpoint, building Catalyst in Scala also made it more concise and easier > to extend than an optimizer written in Java (you can find various > presentations about how Catalyst works). > > Matei > > > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: > > > > It's fairly common for adapters (Calcite's abstraction of a data > > source) to push down predicates. However, the API certainly looks a > > lot different than Catalyst's. > > -- > > Michael Mior > > mm...@apache.org > > > > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin > > a écrit : > >> > >> The implementation they chose supports push down predicates, Datasets > and other features that are not available in Calcite: > >> > >> https://databricks.com/glossary/catalyst-optimizer > >> > >> On Mon, Jan 13, 2020 at 8:24 AM newroyker wrote: > >>> > >>> Was there a qualitative or quantitative benchmark done before a design > >>> decision was made not to use Calcite? > >>> > >>> Are there limitations (for heuristic based, cost based, * aware > optimizer) > >>> in Calcite, and frameworks built on top of Calcite? In the context of > big > >>> data / TCPH benchmarks. > >>> > >>> I was unable to dig up anything concrete from user group / Jira. > Appreciate > >>> if any Catalyst veteran here can give me pointers. Trying to defend > >>> Spark/Catalyst. > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >>> > >>> - > >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>> > >> > >> > >> -- > >> Thanks, > >> Jason > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >