Re: Template SQL

2023-04-12 Thread Soumyadeep Mukhopadhyay
Hi Askar,

I have gone through your email a couple of times. I am trying to understand
it bit by bit.

I have a request, please feel free to say no, is there any code base where
I can refer and understand how to implement my own filter rule?
What I realised I asked wrong is I wanted to implement "(HAVING COUNT(*) >
100)" as a global rule, if there are any aggregation queries.

It is a bit difficult for me to understand where I should start.
For example, if I follow this https://github.com/zabetak/calcite-tutorial where
shall I start looking at?
In the below image I feel like some kind of comparison is going on, should
I devise my code like this? (in order to implement something like "(HAVING
COUNT(*) > 100)")
[image: Screenshot 2023-04-11 at 5.38.43 PM.png]

Also is there any documentation I can go through regarding how I can
traverse through the AST?
Probably I am not looking in the right places but so far I could only go
through Tab9 code examples (or the documentation provided by the Calcite
website) and things did not seem to be clear.
I realise it is probably a lot to ask, so whatever you share will be a lot
of help for me.

Thanks again for your time, patience and help!

With regards,
Soumyadeep Mukhopadhyay.

On Fri, Apr 7, 2023 at 8:33 PM Askar Bozcan  wrote:

> Hey,
> You can use Planner.reset(). Note that it has mutable state inside, so do
> not reuse the same planner instance in any multi-threading environment.
>
> (I am assuming you have access to table metadata so that you will be able
> to convert your SqlNode tree into RelNode tree, relational expression tree)
> - Only return results above a certain threshold when using GROUP BY, for
> example (HAVING COUNT(col1) > 100).
>
> I'm not quite sure I understand this question is HAVING part of the query?
>
> - Restrict the column on which joins can happen, or else throw an error
> (almost like analysisException in Spark)
>
> Do you have access to table/schema metadata?
> If you do:
> 1) Convert your parsed syntax tree (SqlNode) into a logical relational tree
> (RelNode).
> Watch this tutorial by Stamatis:
> https://www.youtube.com/watch?v=p1O3E33FIs8.
> It will explain way better the usage than I can in an email. (And it what
> introduced me to Calcite's basics :))
> 2) Traverse your relational tree by implementing RelShuttle
> <
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/RelShuttle.html
> >.
> (Look up Visitor pattern to better understand how this recursive traversal
> works)
> 3) Ignore any RelNode's which are not LogicalJoin.
> 4) When you encounter LogicalJoin, traverse its children with getLeft and
> getRight
> 5) When you encounter LogicalTableScan,  You can get its RelOptTable with
> getTable, and then RelOptTable.getRowType() to find the fields.
> (Not 100% about step 5, but should be close to it)
>
>
> - Restrict the columns that can be passed in a select statement or else
> throw an error (like in the point above)
> Same logic as above. But instead of TableScan, look at the root
> LogicalProject.
>
> All of the above should be doable with just the parse tree (SqlNode) and
> without access to table metadata using SqlShuttle; however, it's easier and
> less error-prone with relational tree IMO.
>
> - Askar
>
>
> On Thu, 6 Apr 2023 at 23:31, Soumyadeep Mukhopadhyay <
> soumyamy...@gmail.com>
> wrote:
>
> > Hey Askar,
> >
> > It worked exactly how you suggested.
> >
> > ```
> >
> > private val sqlQuery: String = "SELECT list_of_columns FROM table_name
> > WHERE predicate_column = 'predicate_value'"
> > private val frameworkConfig: FrameworkConfig =
> > Frameworks.newConfigBuilder.build()
> > private val planner: Planner = Frameworks.getPlanner(frameworkConfig)
> > private val planner2: Planner = Frameworks.getPlanner(frameworkConfig)
> > private val planner3: Planner = Frameworks.getPlanner(frameworkConfig)
> > private val sqlNode: SqlNode = planner.parse(sqlQuery)
> > println(sqlNode.getKind)
> > private val sqlSelectStmt: SqlSelect = sqlNode.asInstanceOf[SqlSelect]
> >
> > private val setSelectColumnsQuery = "SELECT age"
> > private val selectList =
> >
> planner2.parse(setSelectColumnsQuery).asInstanceOf[SqlSelect].getSelectList
> > private val setFromTableQuery = "SELECT employee"
> > private val fromTable =
> > planner3.parse(setFromTableQuery).asInstanceOf[SqlSelect].getSelectList
> >
> > sqlSelectStmt.setSelectList(selectList)
> > sqlSelectStmt.setFrom(fromTable)
> > private val finalQuery = sqlSelectStmt.asInstanceOf[SqlNode]
> > println(finalQuery.toSqlString(SnowflakeSqlDialect.DEFAULT))
> >
> > ```
> >
> > The only caveat I see is the need for a new planner for every new query.
> > Should I do something else or is this expected?
> >
> > On a different note, I wanted to ask about how I can write my own rules.
> Is
> > it possible to enforce following rules:
> > - Only return results above a certain threshold when using GROUP BY, for
> > example (HAVING COUNT(col1) > 100).
> > - Restrict the column on 

[jira] [Created] (CALCITE-5644) Implement BigQuery CONTAINS_SUBSTR

2023-04-12 Thread Tanner Clary (Jira)
Tanner Clary created CALCITE-5644:
-

 Summary: Implement BigQuery CONTAINS_SUBSTR
 Key: CALCITE-5644
 URL: https://issues.apache.org/jira/browse/CALCITE-5644
 Project: Calcite
  Issue Type: Task
Reporter: Tanner Clary


BigQuery offers the 
[CONTAINS_SUBSTR|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#contains_substr]
 function that returns {{TRUE}} if a substring is present in an expression and 
{{FALSE}} if it is not. A basic example of this may be seen in [1]. 
The expression can take many forms (more info in the linked doc) which makes 
its implementation more complex than other string functions that only accept 
arguments of type {{STRING}}. For instance, the expression to be searched can 
be a column or table reference. 
The function also has an optional third argument called {{json_scope}} where 
the user can indicate the scope of JSON data (keys, values, or both) to be 
searched for the substring.
 I am curious if anyone has thoughts on how the search of rows or tables could 
be implemented. I have a basic implementation (that supports expressions of 
type {{STRING}} and nothing else) that I will open a draft PR for as a starting 
point. To me, the challenge is implementing the additional features like the 
{{JSON_SCOPE}} argument (seen in [2]) and performing a cross field search as 
seen in [3]. 

[1] {{SELECT CONTAINS_SUBSTR("hello", "he");}} would return {{TRUE}}. 

[2] 
{{SELECT *
FROM Recipes
WHERE CONTAINS_SUBSTR(
  (SELECT AS STRUCT Recipes.* EXCEPT (Lunch, Dinner)),
  'potato'
);}} would return: 
/*---+-+--*
 | Breakfast | Lunch| Dinner
|
 +---+-+--+
 | Potato pancakes | Toasted cheese sandwich | Beef stroganoff  |
 *---+-+--*/





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Sharing the load of reviewing PRs

2023-04-12 Thread Stamatis Zampetakis
For a long time this has been one of the main issues of the project
and I am happy to see discussions to address this issue.

I would like to mention that as a contributor, I am, and always have
been very grateful to people reviewing my work.
The fact that I became a committer of this project is mainly due to
Julian and Vladimir Sitnikov who reviewed and merged many of my PRs.
I would definitely like to help and make other contributors feel the
same but I cannot really commit to specific volume and deadlines
spanning several months.

I have the feeling that we don't need the PR manager role. The
assignment work can be done by bots (e.g., [1]) if needed and we
already have our quarterly stats for reporting purposes.
If we want to put a human behind this then it makes more sense for
this person to be the release manager; this should be the one nagging
people for advancing the work and moving towards the release.

Regarding reviews coming from non-committers, I am not sure it's
possible to do the assignment in GitHub. It's not a big deal though;
for me a simple comment that I am going to review would be sufficient.
Alternatively, we could consider adopting an equivalent workflow in
JIRA and potentially introducing a new state "IN REVIEW"; don't think
it is necessary.
No matter the choice we should ensure that we have a trackable way to
recognise "non-committer" reviewers but I think both GitHub (e.g.,
"is:pr reviewed-by:julianhyde is:closed") and JIRA offer the necessary
filters;
Others projects tend to include such info in the commit message we
could also opt for this if we deem necessary.

As an immediate action I would encourage everyone willing to help to
go to the open PRs on GitHub and either assign some PRs to themselves
(in case of committers) or leave a comment about their intention
(non-committers).

In the meantime we can iterate on this till we reach consensus.

Best,
Stamatis

[1] https://github.com/apps/auto-assign


On Wed, Apr 12, 2023 at 10:49 AM Ruben Q L  wrote:
>
> Hello,
>
> I understand Julian's frustration. We all know that reviewing PRs is a
> recurring problem, and it is not the first time we discuss potential
> solutions, see e.g. the discussion a year ago [1] (also started by Julian)
> where several ideas were mentioned: automatic assignment, emulate the RM
> process onto the reviewing process (quite similar to the current proposal),
> ...but in the end no change was implemented, and the problem remains.
>
> I agree that something must be done in order to revert the situation.
>
> In my opinion, one of the main factors is that the vast majority of PRs
> (even the ones that get merged) are never assigned. This lack of assignee
> can be seen as if the PR is "no-one's responsibility", so we should try
> somehow to assign each PR to someone, and make that person accountable for
> the PR's progression.
> I think we could try Julian's idea of having a pool of reviewers and a PR
> manager (taken from the pool, rotating this position every month or every
> two months). Personally, I would not set hard deadlines (e.g. something
> must be done within 3 days), because we are all volunteers and, even if we
> are all trying to do our best here, it may happen that a certain week we
> are busy with other personal or professional stuff. In the end, I think it
> should be the PR manager's responsibility to ping the assigned reviewer if
> a PR is not progressing after a reasonable period of time, ask them for an
> update, maybe even involve a different reviewer / re-assign the PR as a
> last resource.
>
> Of course, it must remain clear that, even if we implement this approach,
> anyone is still free (and encouraged) to participate in any PR review. Even
> if someone is not the assigned reviewer, they can chime in and contribute
> to the review nevertheless.
> Also, I think another sensible rule would be: if someone from the reviewers
> pool submits a PR, the PR manager will need to assign it to a different
> person.
>
> One last comment, I have the impression that with this initiative we would
> be moving towards a "better done than perfect" approach. Calcite is a vast
> project, with many different modules, and it could happen (it *will*
> happen) that a certain reviewer gets assigned a PR concerning a part of the
> project that they are not familiar with. Of course, one way of becoming
> (progressively) familiar with unknown parts of the project is by reviewing
> this type of PRs, but that takes time. I guess it would be possible for the
> assignee to try to ping and involve other reviewers with more experience in
> that area, but at the end of the day, it would be the assignee's
> responsibility to review and merge some piece of code that might be a bit
> obscure to them. This might lead to suboptimal or even incorrect code being
> inadvertently merged into the main branch. This is not a catastrophe (it
> can already happen with the current approach), and we will detect and
> correct these 

Re: [DISCUSS] Sharing the load of reviewing PRs

2023-04-12 Thread Ruben Q L
Hello,

I understand Julian's frustration. We all know that reviewing PRs is a
recurring problem, and it is not the first time we discuss potential
solutions, see e.g. the discussion a year ago [1] (also started by Julian)
where several ideas were mentioned: automatic assignment, emulate the RM
process onto the reviewing process (quite similar to the current proposal),
...but in the end no change was implemented, and the problem remains.

I agree that something must be done in order to revert the situation.

In my opinion, one of the main factors is that the vast majority of PRs
(even the ones that get merged) are never assigned. This lack of assignee
can be seen as if the PR is "no-one's responsibility", so we should try
somehow to assign each PR to someone, and make that person accountable for
the PR's progression.
I think we could try Julian's idea of having a pool of reviewers and a PR
manager (taken from the pool, rotating this position every month or every
two months). Personally, I would not set hard deadlines (e.g. something
must be done within 3 days), because we are all volunteers and, even if we
are all trying to do our best here, it may happen that a certain week we
are busy with other personal or professional stuff. In the end, I think it
should be the PR manager's responsibility to ping the assigned reviewer if
a PR is not progressing after a reasonable period of time, ask them for an
update, maybe even involve a different reviewer / re-assign the PR as a
last resource.

Of course, it must remain clear that, even if we implement this approach,
anyone is still free (and encouraged) to participate in any PR review. Even
if someone is not the assigned reviewer, they can chime in and contribute
to the review nevertheless.
Also, I think another sensible rule would be: if someone from the reviewers
pool submits a PR, the PR manager will need to assign it to a different
person.

One last comment, I have the impression that with this initiative we would
be moving towards a "better done than perfect" approach. Calcite is a vast
project, with many different modules, and it could happen (it *will*
happen) that a certain reviewer gets assigned a PR concerning a part of the
project that they are not familiar with. Of course, one way of becoming
(progressively) familiar with unknown parts of the project is by reviewing
this type of PRs, but that takes time. I guess it would be possible for the
assignee to try to ping and involve other reviewers with more experience in
that area, but at the end of the day, it would be the assignee's
responsibility to review and merge some piece of code that might be a bit
obscure to them. This might lead to suboptimal or even incorrect code being
inadvertently merged into the main branch. This is not a catastrophe (it
can already happen with the current approach), and we will detect and
correct these mistakes; I'm just mentioning that they might become a bit
more frequent with the proposed approach (and we should all face them with
a constructive and positive attitude). In any case, I have the impression
that with the new idea the pros outweigh the cons, so we could give it a
try.

Best,
Ruben

[1] https://lists.apache.org/thread/30pf1o0vlcn7y3bhlcht1wdmvmxyvghn



On Wed, Apr 12, 2023 at 3:13 AM Chunwei Lei  wrote:

> Thanks Julian for sharing the proposal. I am +1 for it. I have been busy in
> the past few months, so I have only had a quick look at the new JIRA.
> However, I will have more time in the coming months, and I would be more
> than happy to review any pull requests.
>
>
>
> Best,
> Chunwei
>
>
> On Tue, Apr 11, 2023 at 10:22 PM Jiajun Xie 
> wrote:
>
> > Thank Julian for your idea.
> >  Your plan helps to motivate new contributors.
> >
> > “If there is no response to my PR,
> > I will be disappointed or even give up on continuing to contribute.”
> >
> > I hope that every contributor will be encouraged,
> > and I also hope that the Calcite community will become stronger and
> > stronger.
> >
> > +1, I am willing to join the pool of reviews.
> >
> > On Tue, 11 Apr 2023 at 13:20, Benchao Li  wrote:
> >
> > > Thanks Julian for starting the discussion!
> > >
> > > I'm spending my spare time to contribute to Calcite, usually at
> weekends,
> > > and sometimes in the break of weekdays, hence my time would be limited
> > > because the spare time may varies. Review work is not that simple for
> me
> > > because Calcite has many complicated components and evolves many years
> > > which means we need track a lot of background. I'm still learning some
> > part
> > > while doing the review work.
> > >
> > > The complexity of PRs varies a lot, simple ones would be easier to get
> in
> > > because it only cost me minutes to hours to review. But the complex
> ones,
> > > usually I need to spend more time to understand the background, new
> > design,
> > > the effect to the whole project, and the future direction we want to
> > take.
> > > These kinds of PRs may be preempted 

[jira] [Created] (CALCITE-5643) Query with CROSS JOIN UNNEST(array) executed by BigQuery adaptor isn't supported

2023-04-12 Thread Jia-Xuan Liu (Jira)
Jia-Xuan Liu created CALCITE-5643:
-

 Summary: Query with CROSS JOIN UNNEST(array) executed by BigQuery 
adaptor isn't supported
 Key: CALCITE-5643
 URL: https://issues.apache.org/jira/browse/CALCITE-5643
 Project: Calcite
  Issue Type: Bug
Affects Versions: 1.34.0
Reporter: Jia-Xuan Liu


Given a query like
{code:sql}
WITH
 tmp1 (custkey, name, orders) AS (
   SELECT
 c_custkey
   , c_name
   , array_agg(o_orderkey ORDER BY o_orderkey ASC) orders
   FROM
 (bq.customer
   LEFT JOIN bq.orders ON (o_custkey = c_custkey))
   GROUP BY c_custkey, c_name
) 
SELECT
  t.custkey
, t.name
, array_agg(o.o_orderstatus ORDER BY o.o_orderkey ASC)
FROM
  ((tmp1 t
CROSS JOIN UNNEST(t.orders) u (uc))
LEFT JOIN bq.orders o ON (u.uc = o.o_orderkey))
GROUP BY t.custkey, t.name
{code}
`bq` is a JdbcSchema connected with BigQuery.

It fails with exception:
{code:java}
Caused by: java.lang.IllegalStateException: Unable to implement 
EnumerableAggregate(group=[{0, 1}], EXPR$2=[ARRAY_AGG($5) WITHIN GROUP ([4])]): 
rowcount = 251.2225, cumulative cost = {7580.5628125 rows, 32182.2752199209 
cpu, 0.0 io}, id = 394
  EnumerableMergeJoin(condition=[=($3, $4)], joinType=[left]): rowcount = 
2512.225, cumulative cost = {7297.9375 rows, 32182.2752199209 cpu, 0.0 io}, id 
= 392
EnumerableSort(sort0=[$3], dir0=[ASC]): rowcount = 158.5, cumulative cost = 
{4247.2125 rows, 23621.968885142334 cpu, 0.0 io}, id = 383
  EnumerableCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{2}]): rowcount = 158.5, cumulative cost = {4088.7125 rows, 
1140.15 cpu, 0.0 io}, id = 381
EnumerableAggregate(group=[{0, 1}], orders=[ARRAY_AGG($2) WITHIN GROUP 
([2])]): rowcount = 158.5, cumulative cost = {2281.8125 rows, 680.5 cpu, 0.0 
io}, id = 372
  JdbcToEnumerableConverter: rowcount = 1585.0, cumulative cost = 
{2103.5 rows, 680.5 cpu, 0.0 io}, id = 370
JdbcJoin(condition=[=($3, $0)], joinType=[left]): rowcount = 
1585.0, cumulative cost = {1945.0 rows, 522.0 cpu, 0.0 io}, id = 368
  JdbcProject(c_custkey=[$0], c_name=[$1]): rowcount = 100.0, 
cumulative cost = {180.0 rows, 261.0 cpu, 0.0 io}, id = 363
JdbcTableScan(table=[[bq, customer]]): rowcount = 100.0, 
cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 1
  JdbcProject(o_orderkey=[$0], o_custkey=[$1]): rowcount = 100.0, 
cumulative cost = {180.0 rows, 261.0 cpu, 0.0 io}, id = 366
JdbcTableScan(table=[[bq, orders]]): rowcount = 100.0, 
cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 3
EnumerableUncollect: rowcount = 1.0, cumulative cost = 
{2.9004 rows, 2.9004 cpu, 0.0 io}, id = 379
  JdbcToEnumerableConverter: rowcount = 1.0, cumulative cost = 
{1.9001 rows, 1.9001 cpu, 0.0 io}, id = 377
JdbcProject(orders=[$cor0.orders]): rowcount = 1.0, cumulative cost 
= {1.8 rows, 1.8 cpu, 0.0 io}, id = 375
  JdbcValues(tuples=[[{ 0 }]]): rowcount = 1.0, cumulative cost = 
{1.0 rows, 1.0 cpu, 0.0 io}, id = 263
JdbcToEnumerableConverter: rowcount = 100.0, cumulative cost = {280.0 rows, 
8560.306334778565 cpu, 0.0 io}, id = 390
  JdbcSort(sort0=[$0], dir0=[ASC]): rowcount = 100.0, cumulative cost = 
{270.0 rows, 8550.306334778565 cpu, 0.0 io}, id = 388
JdbcProject(o_orderkey=[$0], o_orderstatus=[$2]): rowcount = 100.0, 
cumulative cost = {180.0 rows, 261.0 cpu, 0.0 io}, id = 386
  JdbcTableScan(table=[[bq, orders]]): rowcount = 100.0, cumulative 
cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 3

at 
org.apache.calcite.adapter.enumerable.EnumerableRelImplementor.implementRoot(EnumerableRelImplementor.java:117)
at 
org.apache.calcite.adapter.enumerable.EnumerableInterpretable.toBindable(EnumerableInterpretable.java:112)
at 
org.apache.calcite.prepare.CalcitePrepareImpl$CalcitePreparingStmt.implement(CalcitePrepareImpl.java:1159)
at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:324)
at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:220)
at 
org.apache.calcite.prepare.CalcitePrepareImpl.prepare2_(CalcitePrepareImpl.java:665)
at 
org.apache.calcite.prepare.CalcitePrepareImpl.prepare_(CalcitePrepareImpl.java:519)
at 
org.apache.calcite.prepare.CalcitePrepareImpl.prepareSql(CalcitePrepareImpl.java:487)
at 
org.apache.calcite.jdbc.CalciteConnectionImpl.parseQuery(CalciteConnectionImpl.java:236)
at 
org.apache.calcite.jdbc.CalciteMetaImpl.prepareAndExecute(CalciteMetaImpl.java:621)
at 
org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:677)
at 
org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:157)
... 2 more
Suppressed: java.lang.UnsupportedOperationException
  

[jira] [Created] (CALCITE-5642) Add SHA256, SHA512 functions

2023-04-12 Thread Dan Zou (Jira)
Dan Zou created CALCITE-5642:


 Summary: Add SHA256, SHA512 functions
 Key: CALCITE-5642
 URL: https://issues.apache.org/jira/browse/CALCITE-5642
 Project: Calcite
  Issue Type: New Feature
Reporter: Dan Zou


Add SHA256, SHA512 functions and enable them for BigQuery.

SHA256: Computes the hash of the input using the [SHA-256 
algorithm|https://en.wikipedia.org/wiki/SHA-2]. This function returns 32 bytes.

SHA512: Computes the hash of the input using the [SHA-512 
algorithm|https://en.wikipedia.org/wiki/SHA-2]. This function returns 64 bytes.

See more details in [BigQuery 
Doc|https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#sha256]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)