Re: Template SQL
Hi Askar, I have gone through your email a couple of times. I am trying to understand it bit by bit. I have a request, please feel free to say no, is there any code base where I can refer and understand how to implement my own filter rule? What I realised I asked wrong is I wanted to implement "(HAVING COUNT(*) > 100)" as a global rule, if there are any aggregation queries. It is a bit difficult for me to understand where I should start. For example, if I follow this https://github.com/zabetak/calcite-tutorial where shall I start looking at? In the below image I feel like some kind of comparison is going on, should I devise my code like this? (in order to implement something like "(HAVING COUNT(*) > 100)") [image: Screenshot 2023-04-11 at 5.38.43 PM.png] Also is there any documentation I can go through regarding how I can traverse through the AST? Probably I am not looking in the right places but so far I could only go through Tab9 code examples (or the documentation provided by the Calcite website) and things did not seem to be clear. I realise it is probably a lot to ask, so whatever you share will be a lot of help for me. Thanks again for your time, patience and help! With regards, Soumyadeep Mukhopadhyay. On Fri, Apr 7, 2023 at 8:33 PM Askar Bozcan wrote: > Hey, > You can use Planner.reset(). Note that it has mutable state inside, so do > not reuse the same planner instance in any multi-threading environment. > > (I am assuming you have access to table metadata so that you will be able > to convert your SqlNode tree into RelNode tree, relational expression tree) > - Only return results above a certain threshold when using GROUP BY, for > example (HAVING COUNT(col1) > 100). > > I'm not quite sure I understand this question is HAVING part of the query? > > - Restrict the column on which joins can happen, or else throw an error > (almost like analysisException in Spark) > > Do you have access to table/schema metadata? > If you do: > 1) Convert your parsed syntax tree (SqlNode) into a logical relational tree > (RelNode). > Watch this tutorial by Stamatis: > https://www.youtube.com/watch?v=p1O3E33FIs8. > It will explain way better the usage than I can in an email. (And it what > introduced me to Calcite's basics :)) > 2) Traverse your relational tree by implementing RelShuttle > < > https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/RelShuttle.html > >. > (Look up Visitor pattern to better understand how this recursive traversal > works) > 3) Ignore any RelNode's which are not LogicalJoin. > 4) When you encounter LogicalJoin, traverse its children with getLeft and > getRight > 5) When you encounter LogicalTableScan, You can get its RelOptTable with > getTable, and then RelOptTable.getRowType() to find the fields. > (Not 100% about step 5, but should be close to it) > > > - Restrict the columns that can be passed in a select statement or else > throw an error (like in the point above) > Same logic as above. But instead of TableScan, look at the root > LogicalProject. > > All of the above should be doable with just the parse tree (SqlNode) and > without access to table metadata using SqlShuttle; however, it's easier and > less error-prone with relational tree IMO. > > - Askar > > > On Thu, 6 Apr 2023 at 23:31, Soumyadeep Mukhopadhyay < > soumyamy...@gmail.com> > wrote: > > > Hey Askar, > > > > It worked exactly how you suggested. > > > > ``` > > > > private val sqlQuery: String = "SELECT list_of_columns FROM table_name > > WHERE predicate_column = 'predicate_value'" > > private val frameworkConfig: FrameworkConfig = > > Frameworks.newConfigBuilder.build() > > private val planner: Planner = Frameworks.getPlanner(frameworkConfig) > > private val planner2: Planner = Frameworks.getPlanner(frameworkConfig) > > private val planner3: Planner = Frameworks.getPlanner(frameworkConfig) > > private val sqlNode: SqlNode = planner.parse(sqlQuery) > > println(sqlNode.getKind) > > private val sqlSelectStmt: SqlSelect = sqlNode.asInstanceOf[SqlSelect] > > > > private val setSelectColumnsQuery = "SELECT age" > > private val selectList = > > > planner2.parse(setSelectColumnsQuery).asInstanceOf[SqlSelect].getSelectList > > private val setFromTableQuery = "SELECT employee" > > private val fromTable = > > planner3.parse(setFromTableQuery).asInstanceOf[SqlSelect].getSelectList > > > > sqlSelectStmt.setSelectList(selectList) > > sqlSelectStmt.setFrom(fromTable) > > private val finalQuery = sqlSelectStmt.asInstanceOf[SqlNode] > > println(finalQuery.toSqlString(SnowflakeSqlDialect.DEFAULT)) > > > > ``` > > > > The only caveat I see is the need for a new planner for every new query. > > Should I do something else or is this expected? > > > > On a different note, I wanted to ask about how I can write my own rules. > Is > > it possible to enforce following rules: > > - Only return results above a certain threshold when using GROUP BY, for > > example (HAVING COUNT(col1) > 100). > > - Restrict the column on
[jira] [Created] (CALCITE-5644) Implement BigQuery CONTAINS_SUBSTR
Tanner Clary created CALCITE-5644: - Summary: Implement BigQuery CONTAINS_SUBSTR Key: CALCITE-5644 URL: https://issues.apache.org/jira/browse/CALCITE-5644 Project: Calcite Issue Type: Task Reporter: Tanner Clary BigQuery offers the [CONTAINS_SUBSTR|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#contains_substr] function that returns {{TRUE}} if a substring is present in an expression and {{FALSE}} if it is not. A basic example of this may be seen in [1]. The expression can take many forms (more info in the linked doc) which makes its implementation more complex than other string functions that only accept arguments of type {{STRING}}. For instance, the expression to be searched can be a column or table reference. The function also has an optional third argument called {{json_scope}} where the user can indicate the scope of JSON data (keys, values, or both) to be searched for the substring. I am curious if anyone has thoughts on how the search of rows or tables could be implemented. I have a basic implementation (that supports expressions of type {{STRING}} and nothing else) that I will open a draft PR for as a starting point. To me, the challenge is implementing the additional features like the {{JSON_SCOPE}} argument (seen in [2]) and performing a cross field search as seen in [3]. [1] {{SELECT CONTAINS_SUBSTR("hello", "he");}} would return {{TRUE}}. [2] {{SELECT * FROM Recipes WHERE CONTAINS_SUBSTR( (SELECT AS STRUCT Recipes.* EXCEPT (Lunch, Dinner)), 'potato' );}} would return: /*---+-+--* | Breakfast | Lunch| Dinner | +---+-+--+ | Potato pancakes | Toasted cheese sandwich | Beef stroganoff | *---+-+--*/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] Sharing the load of reviewing PRs
For a long time this has been one of the main issues of the project and I am happy to see discussions to address this issue. I would like to mention that as a contributor, I am, and always have been very grateful to people reviewing my work. The fact that I became a committer of this project is mainly due to Julian and Vladimir Sitnikov who reviewed and merged many of my PRs. I would definitely like to help and make other contributors feel the same but I cannot really commit to specific volume and deadlines spanning several months. I have the feeling that we don't need the PR manager role. The assignment work can be done by bots (e.g., [1]) if needed and we already have our quarterly stats for reporting purposes. If we want to put a human behind this then it makes more sense for this person to be the release manager; this should be the one nagging people for advancing the work and moving towards the release. Regarding reviews coming from non-committers, I am not sure it's possible to do the assignment in GitHub. It's not a big deal though; for me a simple comment that I am going to review would be sufficient. Alternatively, we could consider adopting an equivalent workflow in JIRA and potentially introducing a new state "IN REVIEW"; don't think it is necessary. No matter the choice we should ensure that we have a trackable way to recognise "non-committer" reviewers but I think both GitHub (e.g., "is:pr reviewed-by:julianhyde is:closed") and JIRA offer the necessary filters; Others projects tend to include such info in the commit message we could also opt for this if we deem necessary. As an immediate action I would encourage everyone willing to help to go to the open PRs on GitHub and either assign some PRs to themselves (in case of committers) or leave a comment about their intention (non-committers). In the meantime we can iterate on this till we reach consensus. Best, Stamatis [1] https://github.com/apps/auto-assign On Wed, Apr 12, 2023 at 10:49 AM Ruben Q L wrote: > > Hello, > > I understand Julian's frustration. We all know that reviewing PRs is a > recurring problem, and it is not the first time we discuss potential > solutions, see e.g. the discussion a year ago [1] (also started by Julian) > where several ideas were mentioned: automatic assignment, emulate the RM > process onto the reviewing process (quite similar to the current proposal), > ...but in the end no change was implemented, and the problem remains. > > I agree that something must be done in order to revert the situation. > > In my opinion, one of the main factors is that the vast majority of PRs > (even the ones that get merged) are never assigned. This lack of assignee > can be seen as if the PR is "no-one's responsibility", so we should try > somehow to assign each PR to someone, and make that person accountable for > the PR's progression. > I think we could try Julian's idea of having a pool of reviewers and a PR > manager (taken from the pool, rotating this position every month or every > two months). Personally, I would not set hard deadlines (e.g. something > must be done within 3 days), because we are all volunteers and, even if we > are all trying to do our best here, it may happen that a certain week we > are busy with other personal or professional stuff. In the end, I think it > should be the PR manager's responsibility to ping the assigned reviewer if > a PR is not progressing after a reasonable period of time, ask them for an > update, maybe even involve a different reviewer / re-assign the PR as a > last resource. > > Of course, it must remain clear that, even if we implement this approach, > anyone is still free (and encouraged) to participate in any PR review. Even > if someone is not the assigned reviewer, they can chime in and contribute > to the review nevertheless. > Also, I think another sensible rule would be: if someone from the reviewers > pool submits a PR, the PR manager will need to assign it to a different > person. > > One last comment, I have the impression that with this initiative we would > be moving towards a "better done than perfect" approach. Calcite is a vast > project, with many different modules, and it could happen (it *will* > happen) that a certain reviewer gets assigned a PR concerning a part of the > project that they are not familiar with. Of course, one way of becoming > (progressively) familiar with unknown parts of the project is by reviewing > this type of PRs, but that takes time. I guess it would be possible for the > assignee to try to ping and involve other reviewers with more experience in > that area, but at the end of the day, it would be the assignee's > responsibility to review and merge some piece of code that might be a bit > obscure to them. This might lead to suboptimal or even incorrect code being > inadvertently merged into the main branch. This is not a catastrophe (it > can already happen with the current approach), and we will detect and > correct these
Re: [DISCUSS] Sharing the load of reviewing PRs
Hello, I understand Julian's frustration. We all know that reviewing PRs is a recurring problem, and it is not the first time we discuss potential solutions, see e.g. the discussion a year ago [1] (also started by Julian) where several ideas were mentioned: automatic assignment, emulate the RM process onto the reviewing process (quite similar to the current proposal), ...but in the end no change was implemented, and the problem remains. I agree that something must be done in order to revert the situation. In my opinion, one of the main factors is that the vast majority of PRs (even the ones that get merged) are never assigned. This lack of assignee can be seen as if the PR is "no-one's responsibility", so we should try somehow to assign each PR to someone, and make that person accountable for the PR's progression. I think we could try Julian's idea of having a pool of reviewers and a PR manager (taken from the pool, rotating this position every month or every two months). Personally, I would not set hard deadlines (e.g. something must be done within 3 days), because we are all volunteers and, even if we are all trying to do our best here, it may happen that a certain week we are busy with other personal or professional stuff. In the end, I think it should be the PR manager's responsibility to ping the assigned reviewer if a PR is not progressing after a reasonable period of time, ask them for an update, maybe even involve a different reviewer / re-assign the PR as a last resource. Of course, it must remain clear that, even if we implement this approach, anyone is still free (and encouraged) to participate in any PR review. Even if someone is not the assigned reviewer, they can chime in and contribute to the review nevertheless. Also, I think another sensible rule would be: if someone from the reviewers pool submits a PR, the PR manager will need to assign it to a different person. One last comment, I have the impression that with this initiative we would be moving towards a "better done than perfect" approach. Calcite is a vast project, with many different modules, and it could happen (it *will* happen) that a certain reviewer gets assigned a PR concerning a part of the project that they are not familiar with. Of course, one way of becoming (progressively) familiar with unknown parts of the project is by reviewing this type of PRs, but that takes time. I guess it would be possible for the assignee to try to ping and involve other reviewers with more experience in that area, but at the end of the day, it would be the assignee's responsibility to review and merge some piece of code that might be a bit obscure to them. This might lead to suboptimal or even incorrect code being inadvertently merged into the main branch. This is not a catastrophe (it can already happen with the current approach), and we will detect and correct these mistakes; I'm just mentioning that they might become a bit more frequent with the proposed approach (and we should all face them with a constructive and positive attitude). In any case, I have the impression that with the new idea the pros outweigh the cons, so we could give it a try. Best, Ruben [1] https://lists.apache.org/thread/30pf1o0vlcn7y3bhlcht1wdmvmxyvghn On Wed, Apr 12, 2023 at 3:13 AM Chunwei Lei wrote: > Thanks Julian for sharing the proposal. I am +1 for it. I have been busy in > the past few months, so I have only had a quick look at the new JIRA. > However, I will have more time in the coming months, and I would be more > than happy to review any pull requests. > > > > Best, > Chunwei > > > On Tue, Apr 11, 2023 at 10:22 PM Jiajun Xie > wrote: > > > Thank Julian for your idea. > > Your plan helps to motivate new contributors. > > > > “If there is no response to my PR, > > I will be disappointed or even give up on continuing to contribute.” > > > > I hope that every contributor will be encouraged, > > and I also hope that the Calcite community will become stronger and > > stronger. > > > > +1, I am willing to join the pool of reviews. > > > > On Tue, 11 Apr 2023 at 13:20, Benchao Li wrote: > > > > > Thanks Julian for starting the discussion! > > > > > > I'm spending my spare time to contribute to Calcite, usually at > weekends, > > > and sometimes in the break of weekdays, hence my time would be limited > > > because the spare time may varies. Review work is not that simple for > me > > > because Calcite has many complicated components and evolves many years > > > which means we need track a lot of background. I'm still learning some > > part > > > while doing the review work. > > > > > > The complexity of PRs varies a lot, simple ones would be easier to get > in > > > because it only cost me minutes to hours to review. But the complex > ones, > > > usually I need to spend more time to understand the background, new > > design, > > > the effect to the whole project, and the future direction we want to > > take. > > > These kinds of PRs may be preempted
[jira] [Created] (CALCITE-5643) Query with CROSS JOIN UNNEST(array) executed by BigQuery adaptor isn't supported
Jia-Xuan Liu created CALCITE-5643: - Summary: Query with CROSS JOIN UNNEST(array) executed by BigQuery adaptor isn't supported Key: CALCITE-5643 URL: https://issues.apache.org/jira/browse/CALCITE-5643 Project: Calcite Issue Type: Bug Affects Versions: 1.34.0 Reporter: Jia-Xuan Liu Given a query like {code:sql} WITH tmp1 (custkey, name, orders) AS ( SELECT c_custkey , c_name , array_agg(o_orderkey ORDER BY o_orderkey ASC) orders FROM (bq.customer LEFT JOIN bq.orders ON (o_custkey = c_custkey)) GROUP BY c_custkey, c_name ) SELECT t.custkey , t.name , array_agg(o.o_orderstatus ORDER BY o.o_orderkey ASC) FROM ((tmp1 t CROSS JOIN UNNEST(t.orders) u (uc)) LEFT JOIN bq.orders o ON (u.uc = o.o_orderkey)) GROUP BY t.custkey, t.name {code} `bq` is a JdbcSchema connected with BigQuery. It fails with exception: {code:java} Caused by: java.lang.IllegalStateException: Unable to implement EnumerableAggregate(group=[{0, 1}], EXPR$2=[ARRAY_AGG($5) WITHIN GROUP ([4])]): rowcount = 251.2225, cumulative cost = {7580.5628125 rows, 32182.2752199209 cpu, 0.0 io}, id = 394 EnumerableMergeJoin(condition=[=($3, $4)], joinType=[left]): rowcount = 2512.225, cumulative cost = {7297.9375 rows, 32182.2752199209 cpu, 0.0 io}, id = 392 EnumerableSort(sort0=[$3], dir0=[ASC]): rowcount = 158.5, cumulative cost = {4247.2125 rows, 23621.968885142334 cpu, 0.0 io}, id = 383 EnumerableCorrelate(correlation=[$cor0], joinType=[inner], requiredColumns=[{2}]): rowcount = 158.5, cumulative cost = {4088.7125 rows, 1140.15 cpu, 0.0 io}, id = 381 EnumerableAggregate(group=[{0, 1}], orders=[ARRAY_AGG($2) WITHIN GROUP ([2])]): rowcount = 158.5, cumulative cost = {2281.8125 rows, 680.5 cpu, 0.0 io}, id = 372 JdbcToEnumerableConverter: rowcount = 1585.0, cumulative cost = {2103.5 rows, 680.5 cpu, 0.0 io}, id = 370 JdbcJoin(condition=[=($3, $0)], joinType=[left]): rowcount = 1585.0, cumulative cost = {1945.0 rows, 522.0 cpu, 0.0 io}, id = 368 JdbcProject(c_custkey=[$0], c_name=[$1]): rowcount = 100.0, cumulative cost = {180.0 rows, 261.0 cpu, 0.0 io}, id = 363 JdbcTableScan(table=[[bq, customer]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 1 JdbcProject(o_orderkey=[$0], o_custkey=[$1]): rowcount = 100.0, cumulative cost = {180.0 rows, 261.0 cpu, 0.0 io}, id = 366 JdbcTableScan(table=[[bq, orders]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 3 EnumerableUncollect: rowcount = 1.0, cumulative cost = {2.9004 rows, 2.9004 cpu, 0.0 io}, id = 379 JdbcToEnumerableConverter: rowcount = 1.0, cumulative cost = {1.9001 rows, 1.9001 cpu, 0.0 io}, id = 377 JdbcProject(orders=[$cor0.orders]): rowcount = 1.0, cumulative cost = {1.8 rows, 1.8 cpu, 0.0 io}, id = 375 JdbcValues(tuples=[[{ 0 }]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 263 JdbcToEnumerableConverter: rowcount = 100.0, cumulative cost = {280.0 rows, 8560.306334778565 cpu, 0.0 io}, id = 390 JdbcSort(sort0=[$0], dir0=[ASC]): rowcount = 100.0, cumulative cost = {270.0 rows, 8550.306334778565 cpu, 0.0 io}, id = 388 JdbcProject(o_orderkey=[$0], o_orderstatus=[$2]): rowcount = 100.0, cumulative cost = {180.0 rows, 261.0 cpu, 0.0 io}, id = 386 JdbcTableScan(table=[[bq, orders]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 3 at org.apache.calcite.adapter.enumerable.EnumerableRelImplementor.implementRoot(EnumerableRelImplementor.java:117) at org.apache.calcite.adapter.enumerable.EnumerableInterpretable.toBindable(EnumerableInterpretable.java:112) at org.apache.calcite.prepare.CalcitePrepareImpl$CalcitePreparingStmt.implement(CalcitePrepareImpl.java:1159) at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:324) at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:220) at org.apache.calcite.prepare.CalcitePrepareImpl.prepare2_(CalcitePrepareImpl.java:665) at org.apache.calcite.prepare.CalcitePrepareImpl.prepare_(CalcitePrepareImpl.java:519) at org.apache.calcite.prepare.CalcitePrepareImpl.prepareSql(CalcitePrepareImpl.java:487) at org.apache.calcite.jdbc.CalciteConnectionImpl.parseQuery(CalciteConnectionImpl.java:236) at org.apache.calcite.jdbc.CalciteMetaImpl.prepareAndExecute(CalciteMetaImpl.java:621) at org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:677) at org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:157) ... 2 more Suppressed: java.lang.UnsupportedOperationException
[jira] [Created] (CALCITE-5642) Add SHA256, SHA512 functions
Dan Zou created CALCITE-5642: Summary: Add SHA256, SHA512 functions Key: CALCITE-5642 URL: https://issues.apache.org/jira/browse/CALCITE-5642 Project: Calcite Issue Type: New Feature Reporter: Dan Zou Add SHA256, SHA512 functions and enable them for BigQuery. SHA256: Computes the hash of the input using the [SHA-256 algorithm|https://en.wikipedia.org/wiki/SHA-2]. This function returns 32 bytes. SHA512: Computes the hash of the input using the [SHA-512 algorithm|https://en.wikipedia.org/wiki/SHA-2]. This function returns 64 bytes. See more details in [BigQuery Doc|https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#sha256] -- This message was sent by Atlassian Jira (v8.20.10#820010)