[jira] [Comment Edited] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

Kyle McCarthy (Jira) Sun, 19 Jan 2020 14:59:28 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019047#comment-17019047
 ]


Kyle McCarthy edited comment on ARROW-7480 at 1/19/20 10:58 PM:
----------------------------------------------------------------

@[~andygrove] sorry for tagging you directly, but I was looking for some 
clarification wasn't sure the right way to ask.

In the SQL planner, I am a little confused how the column numbers are derived 
for group by and order by expressions. Using this statement:

 {{SELECT first_name FROM person ORDER BY first_name}}

It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person 
projection=None". I was expecting it to produce "Sort: #1...". Are the column 
numbers for sorts and groups relative to the column position in the projection 
rather than the table?

Edit: I am thinking that it should actually be using the column position in the 
table... At least it looks like the projection push down optimizer is expecting 
for them to be relative to the table. 

With the aggregate_test_100 data the query

{{SELECT c1, AVG(c12) FROM aggregate_test_100 GROUP BY c1, c2}}

has a table scan with the projection _TableScan: aggregate_test_100 
projection=Some([0, 1, 2, 11]_, which generates the wrong table schema.

 

Thanks!


was (Author: kylemccarthy):
@[~andygrove] sorry for tagging you directly, but I was looking for some 
clarification wasn't sure the right way to ask.

In the SQL planner, I am a little confused how the column numbers are derived 
for group by and order by expressions. Using this statement:

 {{SELECT first_name FROM person ORDER BY first_name}}

It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person 
projection=None". I was expecting it to produce "Sort: #1...". Are the column 
numbers for sorts and groups relative to the column position in the projection 
rather than the table?

Thanks!

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7480
>                 URL: https://issues.apache.org/jira/browse/ARROW-7480
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Kyle McCarthy
>            Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c1",
>             data_type: Utf8,
>             nullable: false,
>         },
>         Field {
>             name: "c13",
>             data_type: Utf8,
>             nullable: false,
>         },
>         Field {
>             name: "MIN",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c0",
>             data_type: Utf8,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> ----
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others (Postgres) will require that all the columns must be in the GROUP BY 
> to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

Reply via email to