Sessionizing raw events / does Calcite support ARRAY_AGG?

2018-12-14 Thread Kenneth Knowles
Hello!

My use case is sessionizing raw events without an aggregation function.
Approximate code that I tried out:

SELECT ARRAY_AGG(ROW(...))
FROM ...
GROUP BY SESSION(...)

(followed by UNNEST to get the raw events, tagged with session info, back
out into a stream)

I get a parser error on the paren after ARRAY_AGG, presumably because it is
an identifier treated as a column name?

So I was digging through Calcite's code and my conclusion is that there is
no implementation of ARRAY_AGG. Is there an alternative? Is there another
way to use Calcite's streaming extensions to do sessionization of raw
events?

Kenn


Problems about subsets clause order for MATCH_RECOGNIZE

2018-12-14 Thread bupt_ljy
Hi all,
  It’s my first time to send emails to Calcite developers. It’s a really good 
project and many projects benefit from it.
  Now I’ve encountered a problem about the subsets for MATCH_RECOGNIZE in 
thetestMatchRecognizeSubset1() testing. From the results, I can tell 
that"subset stdn = (strt, down)"will be explained to "SUBSET \"STDN\" = 
(\"DOWN\", \"STRT\”)”, which confuses me. IMO, It’ll affect the result of 
functions like“FIRST” and“LAST”, which may not be what I want, although it 
works fine with“AVG” function.
  I’m not sure if this is a bug, or anyone can tell me how we arrive here?




Best,
Jiayi Liao

Re: Review Request (CALCITE-2689). ElasticSearch Adapter. Grouping on date / number fields.

2018-12-14 Thread Andrei Sereda
> Is the query still going to be inefficient if the user adds the IS NOT
NULL predicate?
It will be efficient. For explicit null check aggregates can be pushed to
ES since sentinel will never be used.

> Otherwise, I don't find it very problematic to expect the users to right
their queries correclty.
I think users will quickly forget to add IS NOT NULL and then complain
about slow queries. What to do when there are several grouping columns
(GROUP BY a, b, c) ?

I'll start working on Composite aggregations

to remedy this problem (requires ES 6.1+)


On Fri, Dec 14, 2018 at 5:23 AM Stamatis Zampetakis 
wrote:

> I have no strong opinion since I am not using the adapter.
>
> However, I have a small question regarding option 2.
>
> Is the query still going to be inefficient if the user adds the IS NOT NULL
> predicate?
> If yes, then I probably option 1 would be the best option for the moment.
> Otherwise, I don't find it
> very problematic to expect the users to right their queries correclty. The
> same happens with other database
> (e.g., Oracle) where sometimes you need to introduce NULL checks in the
> queries in order to use the
> existing indexes and have efficient queries.
>
> Best,
> Stamatis
>
>
> Στις Πέμ, 13 Δεκ 2018 στις 4:40 μ.μ., ο/η Andrei Sereda 
> έγραψε:
>
> > Thanks, Stamatis.
> >
> > I agree that sentinels are not the best approach but without pushing
> > aggregations to ES they're not very usable to elastic users.
> >
> > So there are two bad choices:
> > 1) (Low?) Probability of collisions triggering invalid results
> (surprises)
> > but efficient query.
> > 2)  Inefficient query (fetching whole index in memory)
> >
> > I hope we can drop support for ES 2.x soon and enforce ES >= 6.1. If so,
> > one can use  Composite aggregations
> > <
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> > >
> > which don't have these shortcomings.
> > What we can do now is have separate aggregation query for ES2 and ES6.
> >
> > What do you think?
> >
> > Regards,
> > Andrei.
> >
> >
> >
> > On Thu, Dec 13, 2018 at 6:16 AM Stamatis Zampetakis 
> > wrote:
> >
> > > Hi Andrei,
> > >
> > > I haven't gone entirely over the new PR, but I think there are cases
> > where
> > > the result of the queries are going to be wrong (when values collide
> with
> > > sentinels).
> > > Another approach would be to introduce a rule in order to push the
> > > aggregation in ElasticSearch *only* if the field in questions is *not*
> > > nullable. This can make
> > > queries a bit more verbose but it guarantees that there will be no
> > suprises
> > > to the end-user.
> > >
> > > select max(amount), date from orders group by amount, date ->
> Aggregation
> > > not pushed on ES
> > > select max(amount), date from orders where amount is not null and date
> is
> > > not null group by amount, date -> Aggregation pushed on ES
> > >
> > > Sorry for the late reply!
> > >
> > > Best,
> > > Stamatis
> > >
> > >
> > > Στις Κυρ, 2 Δεκ 2018 στις 2:01 π.μ., ο/η Julian Hyde  >
> > > έγραψε:
> > >
> > > > Interesting.
> > > >
> > > > One of the benefits that a SQL layer such as Calcite can bring is
> that
> > it
> > > > hides the details necessary to make operations like this work.
> > > >
> > > > Julian
> > > >
> > > >
> > > > > On Dec 1, 2018, at 5:22 AM, Kevin Risden 
> wrote:
> > > > >
> > > > > I haven't had a chance to review but saw that Elastic has the same
> > > issue
> > > > > with aggregations.
> > > > >
> > > > > https://github.com/elastic/elasticsearch/issues/35745
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > > On Wed, Nov 28, 2018, 20:46 Andrei Sereda  > > > >
> > > > >> Greetings ,
> > > > >> We have discovered an issue with ES aggregations when grouping on
> > > > >> non-textual fields (date, long). Currently the following sql fails
> > > > because
> > > > >> for missing value
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_missing_value_13
> > > > >>>
> > > > >> we inject __MISSING__ sentinel which is not date / number
> parseable
> > > (it
> > > > >> can’t be null either) :
> > > > >>
> > > > >> select max(amount), date from orders group by date -- special ES
> > type
> > > > >>
> > > > >> The solution is to make sentinel type specific :
> > > > >>
> > > > >>   1. Integer.MIN_VALUE for integers
> > > > >>   2. -12-31 for dates etc.
> > > > >>
> > > > >> For low cardinality types like boolean, byte, short I’m not sure
> > what
> > > > to do
> > > > >> since there is high probability of collision between missing field
> > and
> > > > >> actual value (eg. what value to choose for missing boolean?) :
> > > > >>
> > > > >> select max(amount), isActive from orders group by isActive --
> > boolean
> > > > type
> > > > >>

Re: Review Request (CALCITE-2689). ElasticSearch Adapter. Grouping on date / number fields.

2018-12-14 Thread Stamatis Zampetakis
I have no strong opinion since I am not using the adapter.

However, I have a small question regarding option 2.

Is the query still going to be inefficient if the user adds the IS NOT NULL
predicate?
If yes, then I probably option 1 would be the best option for the moment.
Otherwise, I don't find it
very problematic to expect the users to right their queries correclty. The
same happens with other database
(e.g., Oracle) where sometimes you need to introduce NULL checks in the
queries in order to use the
existing indexes and have efficient queries.

Best,
Stamatis


Στις Πέμ, 13 Δεκ 2018 στις 4:40 μ.μ., ο/η Andrei Sereda 
έγραψε:

> Thanks, Stamatis.
>
> I agree that sentinels are not the best approach but without pushing
> aggregations to ES they're not very usable to elastic users.
>
> So there are two bad choices:
> 1) (Low?) Probability of collisions triggering invalid results (surprises)
> but efficient query.
> 2)  Inefficient query (fetching whole index in memory)
>
> I hope we can drop support for ES 2.x soon and enforce ES >= 6.1. If so,
> one can use  Composite aggregations
> <
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> >
> which don't have these shortcomings.
> What we can do now is have separate aggregation query for ES2 and ES6.
>
> What do you think?
>
> Regards,
> Andrei.
>
>
>
> On Thu, Dec 13, 2018 at 6:16 AM Stamatis Zampetakis 
> wrote:
>
> > Hi Andrei,
> >
> > I haven't gone entirely over the new PR, but I think there are cases
> where
> > the result of the queries are going to be wrong (when values collide with
> > sentinels).
> > Another approach would be to introduce a rule in order to push the
> > aggregation in ElasticSearch *only* if the field in questions is *not*
> > nullable. This can make
> > queries a bit more verbose but it guarantees that there will be no
> suprises
> > to the end-user.
> >
> > select max(amount), date from orders group by amount, date -> Aggregation
> > not pushed on ES
> > select max(amount), date from orders where amount is not null and date is
> > not null group by amount, date -> Aggregation pushed on ES
> >
> > Sorry for the late reply!
> >
> > Best,
> > Stamatis
> >
> >
> > Στις Κυρ, 2 Δεκ 2018 στις 2:01 π.μ., ο/η Julian Hyde 
> > έγραψε:
> >
> > > Interesting.
> > >
> > > One of the benefits that a SQL layer such as Calcite can bring is that
> it
> > > hides the details necessary to make operations like this work.
> > >
> > > Julian
> > >
> > >
> > > > On Dec 1, 2018, at 5:22 AM, Kevin Risden  wrote:
> > > >
> > > > I haven't had a chance to review but saw that Elastic has the same
> > issue
> > > > with aggregations.
> > > >
> > > > https://github.com/elastic/elasticsearch/issues/35745
> > > >
> > > > Kevin Risden
> > > >
> > > > On Wed, Nov 28, 2018, 20:46 Andrei Sereda  > > >
> > > >> Greetings ,
> > > >> We have discovered an issue with ES aggregations when grouping on
> > > >> non-textual fields (date, long). Currently the following sql fails
> > > because
> > > >> for missing value
> > > >> <
> > > >>
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_missing_value_13
> > > >>>
> > > >> we inject __MISSING__ sentinel which is not date / number parseable
> > (it
> > > >> can’t be null either) :
> > > >>
> > > >> select max(amount), date from orders group by date -- special ES
> type
> > > >>
> > > >> The solution is to make sentinel type specific :
> > > >>
> > > >>   1. Integer.MIN_VALUE for integers
> > > >>   2. -12-31 for dates etc.
> > > >>
> > > >> For low cardinality types like boolean, byte, short I’m not sure
> what
> > > to do
> > > >> since there is high probability of collision between missing field
> and
> > > >> actual value (eg. what value to choose for missing boolean?) :
> > > >>
> > > >> select max(amount), isActive from orders group by isActive --
> boolean
> > > type
> > > >>
> > > >> Let me know if you solved this problem differently before. Composite
> > > >> aggregations
> > > >> <
> > > >>
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> > > >>>
> > > >> (available since 6.1) should help in future.
> > > >>
> > > >> PR: https://github.com/apache/calcite/pull/946
> > > >> JIRA: https://issues.apache.org/jira/browse/CALCITE-2689
> > > >>
> > > >> Many Thanks in Advance,
> > > >> Andrei.
> > > >>
> > >
> > >
> >
>


[jira] [Created] (CALCITE-2741) extend the native enummerable implemention of calcite to support Hive SQL

2018-12-14 Thread Lai Zhou (JIRA)
Lai Zhou created CALCITE-2741:
-

 Summary: extend the native  enummerable implemention of calcite to 
support Hive SQL 
 Key: CALCITE-2741
 URL: https://issues.apache.org/jira/browse/CALCITE-2741
 Project: Calcite
  Issue Type: New Feature
  Components: core
Reporter: Lai Zhou
Assignee: Julian Hyde


[~julianhyde],

I extended the native enummerable implemention of calcite to support Hive sql 
,include UDF、UDAF and all the SqlSpecialOperator,which inspired by apache 
Drills.

I modified the parser,type systems,and bridge the hive operator .

How do you think of supporting a direct implemention of hive sql like this?

I think it will be valueable when someone want to migrate his hive etl jobs to 
real-time scene.I opened a new issure to disccuss it, please see



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)