Re: question about implement a scalarFunction whose argument type is date or timestamp

2015-12-09 Thread Droopy Hoo
now the null value of Date or Timestamp type columns could not be queried.

the runtime exception is  cannot convert null to long

Using Integer or Long to represent Date and Timestamp may be a better
choice?

2015-11-20 4:41 GMT+08:00 Julian Hyde :

> You are definitely seeing a bug. You should be able to define a UDF with a
> java.sql.Date argument and have Calcite call it.
>
> I think the cause is that Calcite uses int internally to represent dates,
> and it is not correctly inserting code to translate int to Date. You have
> discovered a clever workaround — Calcite sees the Date version of your
> function, and knows that the parameter type is a SQL DATE, then generates
> code that uses the int version of the function.
>
> Can you please log a bug for this? We will try to fix for 1.6.
>
> Julian
>
>
> > On Nov 19, 2015, at 12:29 AM, DroopyHoo  wrote:
> >
> > Hi,
> >
> > I have met a problem when using calcite 1.4 to implement a
> scalarFunction for my udf.
> >
> > the field type is date  and I want to realize the MY_FUNC(date)
> >
> > *SQL : select MY_FUNC(date)  from table;*
> >
> > if I set my function's argument type to java.sql.Date, calcite will
> report Exception:
> >
> > _/Caused by: org.codehaus.commons.compiler.CompileException: Line 124,
> Column 45: No applicable constructor/method found for actual parameters
> "int"; /_
> >
> >
> > if I set the function's argument type to  int, calcite will report
> Exception:
> >
> > _/Caused by: org.apache.calcite.sql.validate.SqlValidatorException: No
> match found for function signature MY_FUNCTION()/_
> >
> >
> > In my opinion, Now, I implement both two functions (one the arg type is
> int, other's is Date ), and it can work, but I think whether there is
> another better choice ?  By the way , is date type must be transformed to
> integer type when process the query ?
> >
> > --
> > ---
> > Wei Hu
> >
>
>


Re: Changes for Grouping Sets in Calcite

2015-12-09 Thread Jesus Camacho Rodriguez
Thanks for your quick reply Julian.

Although I do not feel too strongly about one choice or another, I think we 
could use a single type, as it will simplify the GROUPING_ID and GROUPING 
functions implementation. However, I agree with you that we cannot limit the 
number of grouping columns. Thus, using BINARY as you suggested seems a logical 
choice.

Wrt Aggregate output, I was thinking on implementing it as [k0, k1, k2, g, 
agg1, agg2] : the only reason is that Hive implements it that way. However, I 
would like to understand why you think [g, k0, k1, k2, agg1, agg2] could be a 
better choice, I did not understand it fully. What do you mean with "fail 
fast"? And about discovering rules that do not handle GROUPING_ID properly?

--
Jesús




On 12/8/15, 8:16 PM, "Julian Hyde"  wrote:

>I am in favor of this change. The information in one GROUPING_ID column is the 
>same as the information in n GROUPING(i) columns, but it is easier to manage 
>one column than several.
>
>I screwed up when implementing 
>https://issues.apache.org/jira/browse/CALCITE-370. It would have been a lot 
>cheaper to fix this if we had realized during the review process that 
>indicator columns would be difficult to handle.
>
>It is important that we do not use a representation that limits the number of 
>grouping columns. BIGINT contains 64 bits, and therefore allows 64 grouping 
>columns, but that is not enough. We could use a representation that allows an 
>unlimited number of bits, say BINARY(ceil(n / 8)). Or we could simply state 
>that the client should not assume any particular data type: for 1 grouping 
>column the type might be TINYINT, for 10,000 grouping columns the type might 
>be BLOB. The GROUPING_ID function will convert the internal type to a number, 
>and the GROUPING(i) functions will extract the i’th bit.
>
>We will need to re-organize the output row-type of Aggregate. For an aggregate 
>with 3 group keys [k0, k1, k2], grouping sets (i.e. indicator = true), and 2 
>aggregate functions [agg1, agg2], the output is currently [k0, k1, k2, i0, k1, 
>k2, agg1, agg2] where [i0, i1, i2] are the indicator columns. Should the new 
>output be [k0, k1, k2, g, agg1, agg2] or should it be [g, k0, k1, k2, agg1, 
>agg2]? (In both cases g is the grouping_id column, which concatenates the 
>indicator columns into one value.) The latter is more different from the 
>current output format, but since the grouping_id column appears first it will 
>“fail fast”. We will discover more quickly that a particular rule is not 
>handling grouping_id properly.
>
>Julian
>
>
>
>> On Dec 8, 2015, at 5:50 AM, Jesus Camacho Rodriguez 
>>  wrote:
>> 
>> Hi,
>> 
>> I wanted to open a thread with some conversation about changes in Grouping 
>> Sets implementation in Calcite.
>> 
>> Grouping sets are currently implemented in Calcite using a bit to indicate 
>> each of the grouping columns. For instance, consider the following group by 
>> clause:
>> 
>> ...
>> GROUP BY CUBE (a, b)
>> ...
>> 
>> The generated Aggregate operator in Calcite will have a row schema 
>> consisting of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents 
>> whether x is participating in the group by clause. In particular, GROUPING 
>> returns 1 for aggregated or 0 for not aggregated in the result set.
>> 
>> In contrast, Hive's implementation stores a single number corresponding to 
>> the GROUPING bit vector associated with a row (this is the result of the 
>> GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the 
>> row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].
>> 
>> This difference is creating a mismatch between Calcite and Hive. Till now, 
>> we have solved it in the Hive side: we created our own GROUPING_ID function 
>> applied over those columns. However, we have some issues related to 
>> predicates pushdown, constant propagation, etc., that we need to continue 
>> solving as e.g. new rules are added to our optimizer. In short, this is 
>> making the code on the Hive side harder and harder to maintain.
>> 
>> We were wondering if it would make sense to change the implementation on the 
>> Calcite side. Which systems are using Calcite's grouping sets currently? Are 
>> you in favor/against this change? Why?
>> 
>> Thanks,
>> Jesús
>> 
>
>


Using Calcite for Apache Flink's higher-level languages

2015-12-09 Thread Stephan Ewen
Hi Calcite Folks!

The Apache Flink community is currently looking into how to use Calcite for
optimization of both batch and streaming programs.

We are looking to compile two different kinds of higher level APIs via
Calcite to Flink's APIs:
  - Table API (a LINQ-style DSL)
  - SQL

Our current thought is to use largely the same translation paths for both,
with different entry points into Calcite:
  - the Table API creates a Calcite operator tree directly
  - the SQL interface goes through the full stack, including parser, ...


>From what I have seen so far in Calcite, it looks pretty promising, with
its configurable and extensible rule set, and the pluggable schema/metadata
providers.

A few questions remain for us, to see how feasible this is:

1) Are DAG programs supported? The table API produces operator DAGs, rather
than pure trees. Do DAGs affect/limit the space explored by the optimizer
engine?

2) For streaming programs, we will probably want to add some custom syntax,
specific to Flink's windows. Is it possible to also customize the SQL
dialect of the parser?

These answers are quite crucial for us figure out how to best use Calcite
in our designs. Thanks for helping us...

Greetings,
Stephan


Re: Changes for Grouping Sets in Calcite

2015-12-09 Thread Julian Hyde
By “fail fast” I meant discovering rules that did not handle GROUPING_ID 
properly, without explicitly writing extra tests. By putting “g” at the start, 
all key and aggregate columns would be off-by-one, so it maximizes the chance 
of finding the problem (usually an assert due to a datatype mismatch, or data 
correctness error). If you put “g” after the key columns, only the aggregate 
columns will be off.

But since Hive puts it after the key columns, we would also have a greater 
chance of screwing up Hive. So I don’t think “fail fast” is such a good idea.

Julian
 

> On Dec 9, 2015, at 4:13 AM, Jesus Camacho Rodriguez 
>  wrote:
> 
> Thanks for your quick reply Julian.
> 
> Although I do not feel too strongly about one choice or another, I think we 
> could use a single type, as it will simplify the GROUPING_ID and GROUPING 
> functions implementation. However, I agree with you that we cannot limit the 
> number of grouping columns. Thus, using BINARY as you suggested seems a 
> logical choice.
> 
> Wrt Aggregate output, I was thinking on implementing it as [k0, k1, k2, g, 
> agg1, agg2] : the only reason is that Hive implements it that way. However, I 
> would like to understand why you think [g, k0, k1, k2, agg1, agg2] could be a 
> better choice, I did not understand it fully. What do you mean with "fail 
> fast"? And about discovering rules that do not handle GROUPING_ID properly?
> 
> --
> Jesús
> 
> 
> 
> 
> On 12/8/15, 8:16 PM, "Julian Hyde"  wrote:
> 
>> I am in favor of this change. The information in one GROUPING_ID column is 
>> the same as the information in n GROUPING(i) columns, but it is easier to 
>> manage one column than several.
>> 
>> I screwed up when implementing 
>> https://issues.apache.org/jira/browse/CALCITE-370. It would have been a lot 
>> cheaper to fix this if we had realized during the review process that 
>> indicator columns would be difficult to handle.
>> 
>> It is important that we do not use a representation that limits the number 
>> of grouping columns. BIGINT contains 64 bits, and therefore allows 64 
>> grouping columns, but that is not enough. We could use a representation that 
>> allows an unlimited number of bits, say BINARY(ceil(n / 8)). Or we could 
>> simply state that the client should not assume any particular data type: for 
>> 1 grouping column the type might be TINYINT, for 10,000 grouping columns the 
>> type might be BLOB. The GROUPING_ID function will convert the internal type 
>> to a number, and the GROUPING(i) functions will extract the i’th bit.
>> 
>> We will need to re-organize the output row-type of Aggregate. For an 
>> aggregate with 3 group keys [k0, k1, k2], grouping sets (i.e. indicator = 
>> true), and 2 aggregate functions [agg1, agg2], the output is currently [k0, 
>> k1, k2, i0, k1, k2, agg1, agg2] where [i0, i1, i2] are the indicator 
>> columns. Should the new output be [k0, k1, k2, g, agg1, agg2] or should it 
>> be [g, k0, k1, k2, agg1, agg2]? (In both cases g is the grouping_id column, 
>> which concatenates the indicator columns into one value.) The latter is more 
>> different from the current output format, but since the grouping_id column 
>> appears first it will “fail fast”. We will discover more quickly that a 
>> particular rule is not handling grouping_id properly.
>> 
>> Julian
>> 
>> 
>> 
>>> On Dec 8, 2015, at 5:50 AM, Jesus Camacho Rodriguez 
>>>  wrote:
>>> 
>>> Hi,
>>> 
>>> I wanted to open a thread with some conversation about changes in Grouping 
>>> Sets implementation in Calcite.
>>> 
>>> Grouping sets are currently implemented in Calcite using a bit to indicate 
>>> each of the grouping columns. For instance, consider the following group by 
>>> clause:
>>> 
>>> ...
>>> GROUP BY CUBE (a, b)
>>> ...
>>> 
>>> The generated Aggregate operator in Calcite will have a row schema 
>>> consisting of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) 
>>> represents whether x is participating in the group by clause. In 
>>> particular, GROUPING returns 1 for aggregated or 0 for not aggregated in 
>>> the result set.
>>> 
>>> In contrast, Hive's implementation stores a single number corresponding to 
>>> the GROUPING bit vector associated with a row (this is the result of the 
>>> GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the 
>>> row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].
>>> 
>>> This difference is creating a mismatch between Calcite and Hive. Till now, 
>>> we have solved it in the Hive side: we created our own GROUPING_ID function 
>>> applied over those columns. However, we have some issues related to 
>>> predicates pushdown, constant propagation, etc., that we need to continue 
>>> solving as e.g. new rules are added to our optimizer. In short, this is 
>>> making the code on the Hive side harder and harder to maintain.
>>> 
>>> We were wondering if it would make sense 

Re: compilation problem

2015-12-09 Thread Pengcheng Xiong
Thanks Josh! It works.

On Tue, Dec 8, 2015 at 9:00 PM, Josh Elser  wrote:

> Hi Pengcheng,
>
> The guidance is to use `mvn package` to build the code. The way the
> dependencies currently are shared across Calcite maven modules does not
> work with a `mvn compile` alone.
>
> If you run `mvn install` first, you would likely be able to run `mvn
> compile` successfully, but this is likely a step backwards. :)
>
> `mvn package` is the recommended entry point.
>
>
> Pengcheng Xiong wrote:
>
>> Hi all,
>>
>>   Recently I could not compile Calcite using "mvn compile" after I
>> clone
>> a new copy of Calcite but I can still "mvn package"
>>
>> [ERROR] Failed to execute goal on project calcite-avatica-server: Could
>> not
>> resolve dependencies for project
>> org.apache.calcite:calcite-avatica-server:jar:1.6.0-SNAPSHOT: Failure to
>> find org.apache.calcite:calcite-avatica:jar:tests:1.6.0-SNAPSHOT in
>> http://conjars.org/repo was cached in the local repository, resolution
>> will
>> not be reattempted until the update interval of conjars has elapsed or
>> updates are forced ->  [Help 1]
>>
>>   There is no org/apache/calcite in
>> http://conjars.org/repo/org/apache/
>>
>>   Did any one have similar experience and may I ask for a solution?
>> Thanks.
>>
>> Best
>> Pengcheng
>>
>>


Re: question about implement a scalarFunction whose argument type is date or timestamp

2015-12-09 Thread Julian Hyde
Calcite uses int, int, long to represent DATE, TIME, TIMESTAMP (respectively) 
if they are NOT NULL, and Integer, Integer, Long to represent them if they are 
nullable.

It seems likely that Calcite is deducing that the return type of the UDF is 
DATE NOT NULL. This is incorrect: it should be nullable DATE.

I have logged a feature request 
https://issues.apache.org/jira/browse/CALCITE-1013 to allow you to say that a 
UDF returns NOT NULL values.

But please also log this bug.

Julian



> On Dec 9, 2015, at 12:35 AM, Droopy Hoo  wrote:
> 
> now the null value of Date or Timestamp type columns could not be queried.
> 
> the runtime exception is  cannot convert null to long
> 
> Using Integer or Long to represent Date and Timestamp may be a better
> choice?
> 
> 2015-11-20 4:41 GMT+08:00 Julian Hyde :
> 
>> You are definitely seeing a bug. You should be able to define a UDF with a
>> java.sql.Date argument and have Calcite call it.
>> 
>> I think the cause is that Calcite uses int internally to represent dates,
>> and it is not correctly inserting code to translate int to Date. You have
>> discovered a clever workaround — Calcite sees the Date version of your
>> function, and knows that the parameter type is a SQL DATE, then generates
>> code that uses the int version of the function.
>> 
>> Can you please log a bug for this? We will try to fix for 1.6.
>> 
>> Julian
>> 
>> 
>>> On Nov 19, 2015, at 12:29 AM, DroopyHoo  wrote:
>>> 
>>> Hi,
>>> 
>>> I have met a problem when using calcite 1.4 to implement a
>> scalarFunction for my udf.
>>> 
>>> the field type is date  and I want to realize the MY_FUNC(date)
>>> 
>>> *SQL : select MY_FUNC(date)  from table;*
>>> 
>>> if I set my function's argument type to java.sql.Date, calcite will
>> report Exception:
>>> 
>>> _/Caused by: org.codehaus.commons.compiler.CompileException: Line 124,
>> Column 45: No applicable constructor/method found for actual parameters
>> "int"; /_
>>> 
>>> 
>>> if I set the function's argument type to  int, calcite will report
>> Exception:
>>> 
>>> _/Caused by: org.apache.calcite.sql.validate.SqlValidatorException: No
>> match found for function signature MY_FUNCTION()/_
>>> 
>>> 
>>> In my opinion, Now, I implement both two functions (one the arg type is
>> int, other's is Date ), and it can work, but I think whether there is
>> another better choice ?  By the way , is date type must be transformed to
>> integer type when process the query ?
>>> 
>>> --
>>> ---
>>> Wei Hu
>>> 
>> 
>> 



Re: A question regarding the INTERVAL_DAY_TIME

2015-12-09 Thread Julian Hyde
What would you expect the following to yield?

  VALUES CAST(INTERVAL '3' HOUR AS INTEGER)

Like all intervals (except year-month intervals) the interval is
internally represented in milliseconds, in this case 10,800,000, but
when converted to a string or a numeric it is measured in hours, i.e.
3.

Julian


On Wed, Dec 9, 2015 at 2:05 PM, Pengcheng Xiong  wrote:
> Hi all,
>
> While I was trying to debug something in Hive, I found some code in
> RexBuilder which makes me confused.
>
> L527 it says
>
> {code}
>
> final long multiplier =
>
>   literal.getType().getIntervalQualifier().getStartUnit()
>
>   .multiplier;
> {code}
>
> I saw it was committed
>
> commit 45b58ab07223cbb94c7cbbb9a1434a2b01a9497f
> Author: julianhyde 
> Date:   Tue Jan 7 14:57:17 2014 -0800
>
> Reformat org.eigenbase code.
>
> Could Julian or someone explain how this multiplier is used? It looks
> to me that it is redundant. I do not need that multiplier and "value"
> itself is enough. Thanks!
>
> Best
> Pengcheng


A question regarding the INTERVAL_DAY_TIME

2015-12-09 Thread Pengcheng Xiong
Hi all,

While I was trying to debug something in Hive, I found some code in
RexBuilder which makes me confused.

L527 it says

{code}

final long multiplier =

  literal.getType().getIntervalQualifier().getStartUnit()

  .multiplier;
{code}

I saw it was committed

commit 45b58ab07223cbb94c7cbbb9a1434a2b01a9497f
Author: julianhyde 
Date:   Tue Jan 7 14:57:17 2014 -0800

Reformat org.eigenbase code.

Could Julian or someone explain how this multiplier is used? It looks
to me that it is redundant. I do not need that multiplier and "value"
itself is enough. Thanks!

Best
Pengcheng


Re: A question regarding the INTERVAL_DAY_TIME

2015-12-09 Thread Julian Hyde
Remember that that piece of code is used only for literals. The value
used at run time might be represented differently than how it is
represented in the literal.

That piece of code optimizes casts. So if you write CAST(3 AS INTERVAL
MINUTE) - i.e. convert from an INTEGER literal to an INTERVAL MINUTE
value then it creates an INTERVAL MINUTE literal. What should the
value be? Well, all interval literals are stored in milliseconds, so
the value should be 180,000. Hence the multiplier.

When Hive is converting RexLiteral(INTERVAL_MINUTE, 18) back to
Hive format it should convert to however Hive represents interval
literals.

Julian


On Wed, Dec 9, 2015 at 3:43 PM, Pengcheng Xiong  wrote:
> Hi Julian,
>
> Thanks a lot for your response. However, I still have two questions
>
> (1) will milliseconds be enough? Hive would like to support nanoseconds and
> Vertica also supports microseconds
> https://my.vertica.com/docs/5.0/HTML/Master/9255.htm
> Thus, the internal representation may be better decided by user himself
> rather than fixed at milliseconds?
>
> (2) How about the other way for casting, i.e., casting char to an interval?
>
> select interval_day_time('2 1:2:3') from src limit 1;
>
> For example, Hive will translate this into a RexLiteral of
> 176523000.00, which is the millisecond representation of "2 1:2:3".
> Then when it goes through the ReduceExpressionRule, RexReplacer will call
> RexBuilder to make a cast. And in the L527 of RexBuilder, the value of
> 176523000.00 will be divided by the multiplier and becomes the new
> value. However, Hive does not want any multiplier as it is already in
> millisecond format... Or is there any way to specify the multiplier as 1?
>
> Thanks.
>
> Best
> Pengcheng
>
>
>
> On Wed, Dec 9, 2015 at 2:18 PM, Julian Hyde  wrote:
>
>> What would you expect the following to yield?
>>
>>   VALUES CAST(INTERVAL '3' HOUR AS INTEGER)
>>
>> Like all intervals (except year-month intervals) the interval is
>> internally represented in milliseconds, in this case 10,800,000, but
>> when converted to a string or a numeric it is measured in hours, i.e.
>> 3.
>>
>> Julian
>>
>>
>> On Wed, Dec 9, 2015 at 2:05 PM, Pengcheng Xiong  wrote:
>> > Hi all,
>> >
>> > While I was trying to debug something in Hive, I found some code in
>> > RexBuilder which makes me confused.
>> >
>> > L527 it says
>> >
>> > {code}
>> >
>> > final long multiplier =
>> >
>> >   literal.getType().getIntervalQualifier().getStartUnit()
>> >
>> >   .multiplier;
>> > {code}
>> >
>> > I saw it was committed
>> >
>> > commit 45b58ab07223cbb94c7cbbb9a1434a2b01a9497f
>> > Author: julianhyde 
>> > Date:   Tue Jan 7 14:57:17 2014 -0800
>> >
>> > Reformat org.eigenbase code.
>> >
>> > Could Julian or someone explain how this multiplier is used? It looks
>> > to me that it is redundant. I do not need that multiplier and "value"
>> > itself is enough. Thanks!
>> >
>> > Best
>> > Pengcheng
>>