Re: Calcite Adapter Question

2020-01-22 Thread Michael Mior
A common approach to using the adapters is via a JDBC connection
string which specifies the required parameters for the adapter
(example below in Scala since I had the code handy).

  val connString =
"jdbc:calcite:schemaFactory=org.apache.calcite.adapter.cassandra.CassandraSchemaFactory;
schema.host=" +
   host + "; schema.port=" + port + ";
schema.keyspace=" + keyspaceName
  val connectionProps = new Properties()
  connectionProps.put("user", "admin")
  connectionProps.put("password", "admin")
  val conn = DriverManager.getConnection(connString, connectionProps)

However, I'm guessing this isn't quite what you want. Alternatively,
you can use the adapter's schema factory (e.g. CassandraSchemaFactory)
to construct a schema instance (CassandraSchema). From there you could
use Calcite's RelBuilder (or parse from SQL) to build a query. The
rules for pushdown should automatically be registered in Calcite's
planner if you're using Calcite. I'm not too familiar with Drill's
query planning, but I'm sure something similar would work.

--
Michael Mior
mm...@apache.org

Le ven. 17 janv. 2020 à 13:29, Charles Givre  a écrit :
>
> Hello Calcite Devs!
> My name is Charles Givre and I'm the PMC Chair for Apache Drill, which uses 
> Calcite for query planning among other things.  I'm working on extending the 
> number of systems that Drill can connect to and I saw that Calcite has a 
> number of adapters for various systems like Cassandra and Elasticsearch.
>
> Could anyone point me to some resources as to how these adapters can be used 
> (or extended) so that Drill could use them?
> Thank you very much!
> -- C


Re: [DISCUSS] [CALCITE-3271] EMIT syntax proposal for event-timestamp semantic windowing

2020-01-22 Thread Rui Wang
That makes sense. I will remove the aggregation constraint (i.e. EMIT
requires a GROUP BY). Let's say now EMIT should work on any query for
general purpose.

Because the above contains too much information, let me further summarize
critical points here and see if we could reach a consensus:

1. Do we agree EMIT should execute after FROM, but before any other
clauses, assuming EMIT works on any query?
My opinion is EMIT should execute after FROM. It actually matches what
Julian has said: "Think of it as executing the query at T, then executing
it at T+delta". Emit just controls how large the delta is. And all other
comparisons are just the following WHERE, GROUP BY, HAVING, ORDER BY,
LIMIT, etc. It will also match with DB cases, where EMIT produces a single
delta once that is from -inf to +inf on the timeline.


2. Can we support EMIT predicates rather than a list of fixed emit strategy?
To recap, pros and cons of EMIT predicates:
pros: 1)  extensible with several predefined functions. And if there is a
new need, it will very likely to define a function than defining
new keywords/syntax. 2) Easy to understand (think about it will be applied
to tables to decide when to emit rows).
cons: 1) Users will gain a lot of powers to write expressions.
pros and cons of special EMIT strategy syntax:
pros: 1) uses will not have a lot of power to write expressions as syntax
is fixed (they can tune a few parameters though)
cons: 1) had trouble explaining it to SQL people (it sounds like a hack).
2) there are 5 or more strategies known so we need a list that is longer
than what was proposed in paper. 3) Potential backward-compatible issues in
case emit strategies change.

Lastly, for the table evolving problem that Julian mentioned (e.g. see
(100, 6) retracted and (100, 8) added), I totally agree with it because of
the nature of streaming: you probably never know if data is complete when a
result emits, thus the result could be updated later.


-Rui


On Wed, Jan 22, 2020 at 11:29 AM Julian Hyde  wrote:

> In the SIGMOD paper, EMIT could be applied to any query. Think of it as
> executing the query at T, then executing it at T+delta, compare the
> results, and emit any added or removed records.
>
> Tying it to aggregate queries seems wrong. (If we have a special semantics
> for aggregate queries, how are we possibly going to make the semantics
> well-defined for, say, user-defined table functions?)
>
> Yes, I know aggregate queries have weird behavior. If you have computed
> ’select product, sum(amount) from orders group by product’, and a new order
> with (product 100, amount 2), then you are going to see (100, 6) retracted
> and (100, 8) added. But I think we have to live with that. Otherwise EMIT
> semantics get a lot more complicated.
>
> Julian
>
>
> > On Jan 21, 2020, at 1:24 PM, Rui Wang  wrote:
> >
> > I think there was a big missing in my summary about the position of EMIT
> > and the execution order (and I forgot about ORDER BY). Try to address
> them
> > here:
> >
> > SELECT
> >
> > [FROM TVF windowing] // windowing happen here
> >
> > [WHERE clause]
> >
> > [GROUP BY clause]
> >
> > [HAVING clause]
> >
> > [ORDER BY clause]
> >
> > [LIMIT clause]
> > [EMIT clause] // materialization latency
> >
> > The position of EMIT is indeed a bit confusing. As the right execution
> > order should be: FROM -> EMIT -> others like normal query. FROM is
> > continuously generating data and EMIT decide when to emit a part of data,
> > and then other clauses are applied to emitted data and update downstream.
> >
> > So at least two open questions:
> > 1. What should we use for EMIT? EMIT HAVING (can use aggregation columns
> > like COUNT(*)), EMIT WHERE (can only use single column alias like ORDER
> BY)
> > or EMIT AFTER (undefined yet if we want to support expressions, I hope we
> > do).
> > 2. Where is the EMIT clause? Maybe the most clear position is to put it
> > after FROM.
> >
> >
> > -Rui
> >
> >
> > On Tue, Jan 21, 2020 at 1:09 PM Rui Wang  wrote:
> >
> >>
> >>
> >> On Tue, Jan 21, 2020 at 12:34 PM Julian Hyde  wrote:
> >>
> >>> Does EMIT HAVING have anything to do with aggregate queries (GROUP BY
> >>> and HAVING), or is it just a coincidence that you use the same word,
> >>> HAVING?
> >>>
> >>
> >> EMIT HAVING is independent from HAVING, but EMIT HAVING does have a
> >> relationship to GROUP BY: EMIT HAVING requires a GROUP BY. It is a
> GROUP BY
> >> key, then apply EMIT HAVING expressions on sets specified by those keys.
> >> However we could loose the constraint to allow EMIT HAVING appears even
> >> without GROUP BY, which just means that apply emit control on the whole
> >> data set than control the emitting per group.
> >>
> >> In my opinion, the execution order is: grouping (GROUP BY) -> EMIT
> control
> >> (emit having to decide which part of data can be emitted) -> aggregation
> >> (normal having and other aggregations). For batch/classic DB workload,
> the
> >> EMIT step will always emit all data. So such 

Re: [DISCUSS] [CALCITE-3271] EMIT syntax proposal for event-timestamp semantic windowing

2020-01-22 Thread Julian Hyde
In the SIGMOD paper, EMIT could be applied to any query. Think of it as 
executing the query at T, then executing it at T+delta, compare the results, 
and emit any added or removed records.

Tying it to aggregate queries seems wrong. (If we have a special semantics for 
aggregate queries, how are we possibly going to make the semantics well-defined 
for, say, user-defined table functions?)

Yes, I know aggregate queries have weird behavior. If you have computed ’select 
product, sum(amount) from orders group by product’, and a new order with 
(product 100, amount 2), then you are going to see (100, 6) retracted and (100, 
8) added. But I think we have to live with that. Otherwise EMIT semantics get a 
lot more complicated.

Julian


> On Jan 21, 2020, at 1:24 PM, Rui Wang  wrote:
> 
> I think there was a big missing in my summary about the position of EMIT
> and the execution order (and I forgot about ORDER BY). Try to address them
> here:
> 
> SELECT
> 
> [FROM TVF windowing] // windowing happen here
> 
> [WHERE clause]
> 
> [GROUP BY clause]
> 
> [HAVING clause]
> 
> [ORDER BY clause]
> 
> [LIMIT clause]
> [EMIT clause] // materialization latency
> 
> The position of EMIT is indeed a bit confusing. As the right execution
> order should be: FROM -> EMIT -> others like normal query. FROM is
> continuously generating data and EMIT decide when to emit a part of data,
> and then other clauses are applied to emitted data and update downstream.
> 
> So at least two open questions:
> 1. What should we use for EMIT? EMIT HAVING (can use aggregation columns
> like COUNT(*)), EMIT WHERE (can only use single column alias like ORDER BY)
> or EMIT AFTER (undefined yet if we want to support expressions, I hope we
> do).
> 2. Where is the EMIT clause? Maybe the most clear position is to put it
> after FROM.
> 
> 
> -Rui
> 
> 
> On Tue, Jan 21, 2020 at 1:09 PM Rui Wang  wrote:
> 
>> 
>> 
>> On Tue, Jan 21, 2020 at 12:34 PM Julian Hyde  wrote:
>> 
>>> Does EMIT HAVING have anything to do with aggregate queries (GROUP BY
>>> and HAVING), or is it just a coincidence that you use the same word,
>>> HAVING?
>>> 
>> 
>> EMIT HAVING is independent from HAVING, but EMIT HAVING does have a
>> relationship to GROUP BY: EMIT HAVING requires a GROUP BY. It is a GROUP BY
>> key, then apply EMIT HAVING expressions on sets specified by those keys.
>> However we could loose the constraint to allow EMIT HAVING appears even
>> without GROUP BY, which just means that apply emit control on the whole
>> data set than control the emitting per group.
>> 
>> In my opinion, the execution order is: grouping (GROUP BY) -> EMIT control
>> (emit having to decide which part of data can be emitted) -> aggregation
>> (normal having and other aggregations). For batch/classic DB workload, the
>> EMIT step will always emit all data. So such idea is compatible with
>> existing DB users.
>> 
>> I happen to choose EMIT HAVING because the emit control is very similar to
>> HAVING (and some bits of WHERE) that the only difference is: HAVING is a
>> filter while EMIT HAVING control the emit. E.g apply HAVING expressions to
>> data means if pass this data to downstream or not. And applying EMIT
>> HAVING expressions means if pass this data to downstream now or later (or
>> discard it if the window closes).
>> 
>> If you think the idea actually causes confusion rather than convenience to
>> onboard people to use steaming sql, we can replace EMIT HAVING by EMIT
>> AFTER, per the original design.
>> 
>> I support the idea of latency controls, but I am nervous about
>>> allowing full expressions in the EMIT clause if we don't have to.
>>> 
>>> 
>> Yep. It's a design choice of allowing expressions or keeping a set of
>> dedicated SQL syntax for some emit strategies. If don't use extensible EMIT
>> expressions, we likely will need to provide a long list of syntax for
>> different emit strategies. For examples:
>> EMIT AFTER WATERMARK
>> EMIT AFTER DELAY
>> EMIT AFTER AFTER WATERMARK BUT LATE
>> EMIT AFTER COUNT
>> EMIT AFTER BEFORE WATERMARK
>> etc.
>> 
>> Again it's a design choice so I am open to both ideas.
>> 
>> However, personally I prefer the EMIT expressions idea because I found it
>> was very easy to explain EMIT expressions to SQL people who don't have much
>> streaming brackgroup. Basically you can say EMIT expressions are just
>> applied to rows of table from table-valued function. If there are GROUP BY,
>> each apply expression to each group accordingly. and the result of
>> expression indicates if it's ready to emit. This easiness is also mainly
>> from that we have introduced window start and window end into the table,
>> thus we should have all data we needed in table to write expressions
>> against them (except for processing time triggers).
>> 
>> The downside of expressions though is people will be allowed to write any
>> expression they want, and engines will take responsibility to validate
>> those.
>> 
>> 
>> 
>>> Aggregate queries have a 

Re: Calcite Adapter Question

2020-01-22 Thread Julian Hyde
I know that Drill uses Calcite for query planning, and therefore I expect that 
Drill’s adapter model is probably fairly similar to Calcite’s. In fact, I have 
long wished that we used the same adapter model, so that we could share 
adapters.

Drill already uses Calcite as a runtime library, so that shouldn’t be a problem.

There is a page that documents adapters[1]. The heart of each adapter is a 
schema factory[2], which provides the metadata necessary to validate and plan a 
query. You should probably write a meta-adapter that takes a Calcite schema 
factory and converts it into a Drill adapter.

Each adapter has a test, e.g. CassandraAdapterTest. It is a good idea to 
copy-paste that test into Drill’s test suite so that you know you are 
inheriting the same functionality.

Julian

[1] https://calcite.apache.org/docs/adapter.html 

[2] 
https://calcite.apache.org/apidocs/org/apache/calcite/schema/SchemaFactory.html 

 

> On Jan 17, 2020, at 10:52 AM, Andrei Sereda  > wrote:
> 
> Hi Charles,
> 
> There is some documentation here
> https://calcite.apache.org/docs/adapter.html 
>  it describes how to setup and
> use an adapter.
> 
> Note that you can only use those adapters within calcite runtime and not as
> a standalone library.
> 
> Typical adapter would expose collection (for Mongo), index (for ES), region
> (for Geode) etc.  as a calcite schema / table which you can query using
> relational algebra.
> 
> How exactly do you want to extend existing adapters ?
> 
> Hope that helps.
> 
> Regards,
> Andrei.
> 
> 
> 
> 
> On Fri, Jan 17, 2020 at 1:29 PM Charles Givre  > wrote:
> 
>> Hello Calcite Devs!
>> My name is Charles Givre and I'm the PMC Chair for Apache Drill, which
>> uses Calcite for query planning among other things.  I'm working on
>> extending the number of systems that Drill can connect to and I saw that
>> Calcite has a number of adapters for various systems like Cassandra and
>> Elasticsearch.
>> 
>> Could anyone point me to some resources as to how these adapters can be
>> used (or extended) so that Drill could use them?
>> Thank you very much!
>> -- C



Re: Calcite Adapter Question

2020-01-22 Thread Julian Hyde
I know that Drill uses Calcite for query planning, and therefore I expect that 
Drill’s adapter model is probably fairly similar to Calcite’s. In fact, I have 
long wished that we used the same adapter model, so that we could share 
adapters.

Drill already uses Calcite as a runtime library, so that shouldn’t be a problem.

There is a page that documents adapters[1]. The heart of each adapter is a 
schema factory[2], which provides the metadata necessary to validate and plan a 
query. You should probably write a meta-adapter that takes a Calcite schema 
factory and converts it into a Drill adapter.

Each adapter has a test, e.g. CassandraAdapterTest. It is a good idea to 
copy-paste that test into Drill’s test suite so that you know you are 
inheriting the same functionality.

Julian

[1] https://calcite.apache.org/docs/adapter.html 

[2] 
https://calcite.apache.org/apidocs/org/apache/calcite/schema/SchemaFactory.html 

 

> On Jan 17, 2020, at 10:52 AM, Andrei Sereda  wrote:
> 
> Hi Charles,
> 
> There is some documentation here
> https://calcite.apache.org/docs/adapter.html it describes how to setup and
> use an adapter.
> 
> Note that you can only use those adapters within calcite runtime and not as
> a standalone library.
> 
> Typical adapter would expose collection (for Mongo), index (for ES), region
> (for Geode) etc.  as a calcite schema / table which you can query using
> relational algebra.
> 
> How exactly do you want to extend existing adapters ?
> 
> Hope that helps.
> 
> Regards,
> Andrei.
> 
> 
> 
> 
> On Fri, Jan 17, 2020 at 1:29 PM Charles Givre  wrote:
> 
>> Hello Calcite Devs!
>> My name is Charles Givre and I'm the PMC Chair for Apache Drill, which
>> uses Calcite for query planning among other things.  I'm working on
>> extending the number of systems that Drill can connect to and I saw that
>> Calcite has a number of adapters for various systems like Cassandra and
>> Elasticsearch.
>> 
>> Could anyone point me to some resources as to how these adapters can be
>> used (or extended) so that Drill could use them?
>> Thank you very much!
>> -- C



Re: [DISCUSS] Towards Calcite 1.22

2020-01-22 Thread Andrei Sereda
Hello,

I would like to ask community if it is OK if 1.22 release gets delayed by
2-3 weeks ?

I have tried to book 1 week from work but, unfortunately, right now it is
not easy (things should be better in February).

Another option is to swap with somebody doing next (1.2[345]) release.

Please let me know what you think and apologies for the inconvenience.

Andrei.


On Wed, Dec 4, 2019 at 4:55 PM Julian Hyde  wrote:

> I don’t mind whether the release happens in December or January, but
> either way, let’s start burning down the backlog of PRs now.
>
>
> > On Dec 2, 2019, at 11:43 PM, Enrico Olivelli 
> wrote:
> >
> > Andrei
> >
> > Il mar 3 dic 2019, 08:21 Rui Wang  amaliu...@apache.org>> ha scritto:
> >
> >> Thank you for this notification.
> >>
> >> Please try to make CALCITE-3272[1] into 1.22 (change the fix version to
> >> 1.22 already).
> >>
> >>
> >> [1]: https://github.com/apache/calcite/pull/1587
> >>
> >> -Rui
> >>
> >> On Mon, Dec 2, 2019 at 6:35 PM Chunwei Lei 
> wrote:
> >>
> >>> Thank you for your work, Anderi.
> >>>
> >>> Let's get CALCITE-1581[1] into 1.22.
> >>>
> >>> +1 for release at early-mid January '20 (to have more time to review
> >> prs).
> >>>
> >>>
> >>> [1] https://github.com/apache/calcite/pull/1138
> >>>
> >>>
> >>> Best,
> >>> Chunwei
> >>>
> >>>
> >>> On Tue, Dec 3, 2019 at 8:02 AM Andrei Sereda  wrote:
> >>>
>  Hello,
> 
>  Calcite 1.21 was released about 3 months ago (2019-09) and it is time
> >> to
>  start preparation for 1.22.
> 
>  Current open issues and pull requests can be seen in [1] and [2].
> There
> >>> are
>  many PRs left from previous releases and it would be nice to review as
> >>> many
>  as possible. Please change "fix version" in JIRA to 1.22 if you would
> >>> like
>  the contribution be considered for this release. It is also helpful to
> >>> mark
>  PR with "LGTM-will-merge-soon" label so other contributors are aware
> of
>  your review.
> 
>  Committers please go over existing PRs and try to prioritize /
> finalize
>  them. Also let me know which changes (in your opinion) are ready or
> >>> should
>  be considered for this release. Don't forget that current policy of
>  frequent releases allows better work scheduling without blocking
> >> existing
>  release plan.
> 
>  In terms of dates, let's agree on release time frame late December '19
> >> or
>  early-mid January '20 ?
> >>
> >
> > I (from HerdDB community) would prefer late December if possible, as we
> are
> > stuck to an older 1.19 version of Calcite.
> > Current Calcite master is is great shape from our point of view
> >
> > Thanks for driving this
> > Enrico
> >
> >
> >
> >
> >
> >>>
>  Let me know if I have missed anything or if current plan is
> >> inconvenient.
> 
>  Thanks,
>  Andrei.
> 
>  [1]
> 
> >>>
> >>
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12333950
> <
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12333950
> >
>  [2] https://github.com/apache/calcite/pulls <
> https://github.com/apache/calcite/pulls>
>


[jira] [Created] (CALCITE-3754) Redis tests fail on ppc64le

2020-01-22 Thread AK97 (Jira)
AK97 created CALCITE-3754:
-

 Summary: Redis tests fail on ppc64le
 Key: CALCITE-3754
 URL: https://issues.apache.org/jira/browse/CALCITE-3754
 Project: Calcite
  Issue Type: Task
 Environment: arch: ppc64le
Reporter: AK97


I have been trying to build the Apache Calcite on ubuntu:16.04/ppc64le; 
however, the test cases are failing in Redis module with following error :

{code:java}
 Task :redis:test FAILED
Gradle Test Executor 15 STANDARD_OUT
/tmp/1579669916467-0/redis-server-2.8.19: 1: 
/tmp/1579669916467-0/redis-server-2.8.19: ELF: not found
/tmp/1579669916467-0/redis-server-2.8.19: 2: 
/tmp/1579669916467-0/redis-server-2.8.19: Syntax error: word unexpected 
(expecting ")")
FAILURE   0.1sec, org.apache.calcite.adapter.redis.RedisAdapterCaseBase > 
testSqlWithJoin()
RedisAdapterCaseBase > testSqlWithJoin() FAILED
java.lang.RuntimeException: Can't start redis server. Check logs for 
details.
at 
redis.embedded.AbstractRedisInstance.awaitRedisServerReady(AbstractRedisInstance.java:61)
at 
redis.embedded.AbstractRedisInstance.start(AbstractRedisInstance.java:39)
at redis.embedded.RedisServer.start(RedisServer.java:9)
at 
org.apache.calcite.adapter.redis.RedisCaseBase.createRedisServer(RedisCaseBase.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:675)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:125)
{code}
Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)