External hive metastore (remote) managed tables

2020-05-28 Thread Debajyoti Roy
Hi, anyone knows the behavior of dropping managed tables in case of
external hive meta store:

Deletion of the data (e.g. from object store) happens from Spark sql or,
the external hive metastore ?

Confused by local mode and remote mode codes.


Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks Xiao, a more up to date publication in a conference like VLDB will
certainly turn the the tide for many of us trying to defend Spark's
Optimizer.

On Wed, Jan 15, 2020 at 9:39 AM Xiao Li  wrote:

> In the upcoming Spark 3.0, we introduced a new framework for Adaptive
> Query Execution in Catalyst. This can adjust the plans based on the runtime
> statistics. This is missing in Calcite based on my understanding.
>
> Catalyst is also very easy to enhance. We also use the dynamic programming
> approach in our cost-based join reordering. If needed, in the future, we
> also can improve the existing CBO and make it more general. The paper of
> Spark SQL was published 5 years ago. A lot of great contributions were made
> in the past 5 years.
>
> Cheers,
>
> Xiao
>
> Debajyoti Roy  于2020年1月15日周三 上午9:23写道:
>
>> Thanks all, and Matei.
>>
>> TL;DR of the conclusion for my particular case:
>> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
>> maintenance burden, it lacks the dynamic programming approach used by
>> Calcite[2] and risks falling into local minima.
>> Quantitatively, there is no reproducible benchmark, that fairly compares
>> Optimizer frameworks, apples to apples (excluding execution).
>>
>> References:
>> [1] -
>> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
>> [2] - https://arxiv.org/pdf/1802.10233.pdf
>>
>> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
>> wrote:
>>
>>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>>> parallel. Calcite 1.0 was only released in 2015. From a technical
>>> standpoint, building Catalyst in Scala also made it more concise and easier
>>> to extend than an optimizer written in Java (you can find various
>>> presentations about how Catalyst works).
>>>
>>> Matei
>>>
>>> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
>>> >
>>> > It's fairly common for adapters (Calcite's abstraction of a data
>>> > source) to push down predicates. However, the API certainly looks a
>>> > lot different than Catalyst's.
>>> > --
>>> > Michael Mior
>>> > mm...@apache.org
>>> >
>>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>>> >  a écrit :
>>> >>
>>> >> The implementation they chose supports push down predicates, Datasets
>>> and other features that are not available in Calcite:
>>> >>
>>> >> https://databricks.com/glossary/catalyst-optimizer
>>> >>
>>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker 
>>> wrote:
>>> >>>
>>> >>> Was there a qualitative or quantitative benchmark done before a
>>> design
>>> >>> decision was made not to use Calcite?
>>> >>>
>>> >>> Are there limitations (for heuristic based, cost based, * aware
>>> optimizer)
>>> >>> in Calcite, and frameworks built on top of Calcite? In the context
>>> of big
>>> >>> data / TCPH benchmarks.
>>> >>>
>>> >>> I was unable to dig up anything concrete from user group / Jira.
>>> Appreciate
>>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> >>> Spark/Catalyst.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks,
>>> >> Jason
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>>


Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and
maintenance burden, it lacks the dynamic programming approach used by
Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares
Optimizer frameworks, apples to apples (excluding execution).

References:
[1] -
https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
[2] - https://arxiv.org/pdf/1802.10233.pdf

On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
wrote:

> I’m pretty sure that Catalyst was built before Calcite, or at least in
> parallel. Calcite 1.0 was only released in 2015. From a technical
> standpoint, building Catalyst in Scala also made it more concise and easier
> to extend than an optimizer written in Java (you can find various
> presentations about how Catalyst works).
>
> Matei
>
> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
> >
> > It's fairly common for adapters (Calcite's abstraction of a data
> > source) to push down predicates. However, the API certainly looks a
> > lot different than Catalyst's.
> > --
> > Michael Mior
> > mm...@apache.org
> >
> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> >  a écrit :
> >>
> >> The implementation they chose supports push down predicates, Datasets
> and other features that are not available in Calcite:
> >>
> >> https://databricks.com/glossary/catalyst-optimizer
> >>
> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
> >>>
> >>> Was there a qualitative or quantitative benchmark done before a design
> >>> decision was made not to use Calcite?
> >>>
> >>> Are there limitations (for heuristic based, cost based, * aware
> optimizer)
> >>> in Calcite, and frameworks built on top of Calcite? In the context of
> big
> >>> data / TCPH benchmarks.
> >>>
> >>> I was unable to dig up anything concrete from user group / Jira.
> Appreciate
> >>> if any Catalyst veteran here can give me pointers. Trying to defend
> >>> Spark/Catalyst.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >> --
> >> Thanks,
> >> Jason
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>


Spark Dataset transformations for time based events

2018-12-25 Thread Debajyoti Roy
Hope everyone is enjoying their holidays.

If anyone here ran into these time based event transformation patterns or
have a strong opinion about the approach please let me know / reply in SO:

   1. Enrich using as-of-time:
   
https://stackoverflow.com/questions/53928880/how-to-do-a-time-based-as-of-join-of-two-datasets-in-apache-spark
   2. Snapshot of state with time to state with effective start and end
   time:
   
https://stackoverflow.com/questions/53928372/given-dataset-of-state-snapshots-at-time-t-how-to-transform-it-into-dataset-with/53928400#53928400


Thanks in advance!
Roy


Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread Debajyoti Roy
The problem statement and an approach to solve it using windows is
described here:

https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e

Looking for more elegant/performant solutions, if they exist. TIA !