External hive metastore (remote) managed tables
Hi, anyone knows the behavior of dropping managed tables in case of external hive meta store: Deletion of the data (e.g. from object store) happens from Spark sql or, the external hive metastore ? Confused by local mode and remote mode codes.
Re: Why Apache Spark doesn't use Calcite?
Thanks Xiao, a more up to date publication in a conference like VLDB will certainly turn the the tide for many of us trying to defend Spark's Optimizer. On Wed, Jan 15, 2020 at 9:39 AM Xiao Li wrote: > In the upcoming Spark 3.0, we introduced a new framework for Adaptive > Query Execution in Catalyst. This can adjust the plans based on the runtime > statistics. This is missing in Calcite based on my understanding. > > Catalyst is also very easy to enhance. We also use the dynamic programming > approach in our cost-based join reordering. If needed, in the future, we > also can improve the existing CBO and make it more general. The paper of > Spark SQL was published 5 years ago. A lot of great contributions were made > in the past 5 years. > > Cheers, > > Xiao > > Debajyoti Roy 于2020年1月15日周三 上午9:23写道: > >> Thanks all, and Matei. >> >> TL;DR of the conclusion for my particular case: >> Qualitatively, while Catalyst[1] tries to mitigate learning curve and >> maintenance burden, it lacks the dynamic programming approach used by >> Calcite[2] and risks falling into local minima. >> Quantitatively, there is no reproducible benchmark, that fairly compares >> Optimizer frameworks, apples to apples (excluding execution). >> >> References: >> [1] - >> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf >> [2] - https://arxiv.org/pdf/1802.10233.pdf >> >> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia >> wrote: >> >>> I’m pretty sure that Catalyst was built before Calcite, or at least in >>> parallel. Calcite 1.0 was only released in 2015. From a technical >>> standpoint, building Catalyst in Scala also made it more concise and easier >>> to extend than an optimizer written in Java (you can find various >>> presentations about how Catalyst works). >>> >>> Matei >>> >>> > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: >>> > >>> > It's fairly common for adapters (Calcite's abstraction of a data >>> > source) to push down predicates. However, the API certainly looks a >>> > lot different than Catalyst's. >>> > -- >>> > Michael Mior >>> > mm...@apache.org >>> > >>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin >>> > a écrit : >>> >> >>> >> The implementation they chose supports push down predicates, Datasets >>> and other features that are not available in Calcite: >>> >> >>> >> https://databricks.com/glossary/catalyst-optimizer >>> >> >>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker >>> wrote: >>> >>> >>> >>> Was there a qualitative or quantitative benchmark done before a >>> design >>> >>> decision was made not to use Calcite? >>> >>> >>> >>> Are there limitations (for heuristic based, cost based, * aware >>> optimizer) >>> >>> in Calcite, and frameworks built on top of Calcite? In the context >>> of big >>> >>> data / TCPH benchmarks. >>> >>> >>> >>> I was unable to dig up anything concrete from user group / Jira. >>> Appreciate >>> >>> if any Catalyst veteran here can give me pointers. Trying to defend >>> >>> Spark/Catalyst. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> >>> >>> - >>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> >> >>> >> >>> >> -- >>> >> Thanks, >>> >> Jason >>> > >>> > - >>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> > >>> >>>
Re: Why Apache Spark doesn't use Calcite?
Thanks all, and Matei. TL;DR of the conclusion for my particular case: Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima. Quantitatively, there is no reproducible benchmark, that fairly compares Optimizer frameworks, apples to apples (excluding execution). References: [1] - https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf [2] - https://arxiv.org/pdf/1802.10233.pdf On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia wrote: > I’m pretty sure that Catalyst was built before Calcite, or at least in > parallel. Calcite 1.0 was only released in 2015. From a technical > standpoint, building Catalyst in Scala also made it more concise and easier > to extend than an optimizer written in Java (you can find various > presentations about how Catalyst works). > > Matei > > > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: > > > > It's fairly common for adapters (Calcite's abstraction of a data > > source) to push down predicates. However, the API certainly looks a > > lot different than Catalyst's. > > -- > > Michael Mior > > mm...@apache.org > > > > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin > > a écrit : > >> > >> The implementation they chose supports push down predicates, Datasets > and other features that are not available in Calcite: > >> > >> https://databricks.com/glossary/catalyst-optimizer > >> > >> On Mon, Jan 13, 2020 at 8:24 AM newroyker wrote: > >>> > >>> Was there a qualitative or quantitative benchmark done before a design > >>> decision was made not to use Calcite? > >>> > >>> Are there limitations (for heuristic based, cost based, * aware > optimizer) > >>> in Calcite, and frameworks built on top of Calcite? In the context of > big > >>> data / TCPH benchmarks. > >>> > >>> I was unable to dig up anything concrete from user group / Jira. > Appreciate > >>> if any Catalyst veteran here can give me pointers. Trying to defend > >>> Spark/Catalyst. > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >>> > >>> - > >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>> > >> > >> > >> -- > >> Thanks, > >> Jason > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >
Spark Dataset transformations for time based events
Hope everyone is enjoying their holidays. If anyone here ran into these time based event transformation patterns or have a strong opinion about the approach please let me know / reply in SO: 1. Enrich using as-of-time: https://stackoverflow.com/questions/53928880/how-to-do-a-time-based-as-of-join-of-two-datasets-in-apache-spark 2. Snapshot of state with time to state with effective start and end time: https://stackoverflow.com/questions/53928372/given-dataset-of-state-snapshots-at-time-t-how-to-transform-it-into-dataset-with/53928400#53928400 Thanks in advance! Roy
Given events with start and end times, how to count the number of simultaneous events using Spark?
The problem statement and an approach to solve it using windows is described here: https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e Looking for more elegant/performant solutions, if they exist. TIA !