[jira] [Assigned] (ARROW-6666) [Rust] [DataFusion] Implement string literal expression

2020-02-20 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy reassigned ARROW-:


Assignee: (was: Kyle McCarthy)

> [Rust] [DataFusion] Implement string literal expression
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Implement string literal expression in the new physical query plan. It is 
> already implemented in the code that executed directly from the logical plan 
> so it should largely be a copy and paste exercise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2020-01-19 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019047#comment-17019047
 ] 

Kyle McCarthy edited comment on ARROW-7480 at 1/19/20 10:58 PM:


@[~andygrove] sorry for tagging you directly, but I was looking for some 
clarification wasn't sure the right way to ask.

In the SQL planner, I am a little confused how the column numbers are derived 
for group by and order by expressions. Using this statement:

 {{SELECT first_name FROM person ORDER BY first_name}}

It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person 
projection=None". I was expecting it to produce "Sort: #1...". Are the column 
numbers for sorts and groups relative to the column position in the projection 
rather than the table?

Edit: I am thinking that it should actually be using the column position in the 
table... At least it looks like the projection push down optimizer is expecting 
for them to be relative to the table. 

With the aggregate_test_100 data the query

{{SELECT c1, AVG(c12) FROM aggregate_test_100 GROUP BY c1, c2}}

has a table scan with the projection _TableScan: aggregate_test_100 
projection=Some([0, 1, 2, 11]_, which generates the wrong table schema.

 

Thanks!


was (Author: kylemccarthy):
@[~andygrove] sorry for tagging you directly, but I was looking for some 
clarification wasn't sure the right way to ask.

In the SQL planner, I am a little confused how the column numbers are derived 
for group by and order by expressions. Using this statement:

 {{SELECT first_name FROM person ORDER BY first_name}}

It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person 
projection=None". I was expecting it to produce "Sort: #1...". Are the column 
numbers for sorts and groups relative to the column position in the projection 
rather than the table?

Thanks!

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c1",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "c13",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "MIN",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> 
> 

[jira] [Commented] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2020-01-19 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019047#comment-17019047
 ] 

Kyle McCarthy commented on ARROW-7480:
--

@[~andygrove] sorry for tagging you directly, but I was looking for some 
clarification wasn't sure the right way to ask.

In the SQL planner, I am a little confused how the column numbers are derived 
for group by and order by expressions. Using this statement:

 {{SELECT first_name FROM person ORDER BY first_name}}

It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person 
projection=None". I was expecting it to produce "Sort: #1...". Are the column 
numbers for sorts and groups relative to the column position in the projection 
rather than the table?

Thanks!

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c1",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "c13",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "MIN",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> 
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others (Postgres) will require that all the columns must be in the GROUP BY 
> to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2020-01-01 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006516#comment-17006516
 ] 

Kyle McCarthy commented on ARROW-7480:
--

I can work on a fix for this, I would like some feedback from some of the other 
contributors on what to do with the second scenario since different DBs handle 
the case differently. 

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c1",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "c13",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "MIN",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> 
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others (Postgres) will require that all the columns must be in the GROUP BY 
> to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6666) [Rust] [DataFusion] Implement string literal expression

2020-01-01 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy reassigned ARROW-:


Assignee: Kyle McCarthy

> [Rust] [DataFusion] Implement string literal expression
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Implement string literal expression in the new physical query plan. It is 
> already implemented in the code that executed directly from the logical plan 
> so it should largely be a copy and paste exercise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2019-12-30 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy updated ARROW-7480:
-
Description: 
There are two scenarios that cause problems but are related to the queries with 
aggregate expressions and the SQL planner. The aggregate_test_100 dataset is 
used for both of the queries. 

At a high level, the issue is basically that queries containing aggregate 
expressions may generate the wrong schema.

 

*Scenario 1*

Columns are grouped by but not selected.

Query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
Error:
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
While the error is an ArrowError, it actually looks like it is because the 
wrong schema is generated. In the src/sql/planner.rs file the impl for SqlToRel 
is defined. In the sql_to_rel method, it checks if the query contains aggregate 
expressions, and if it does it generates the schema from the columns included 
in group expressions and aggregate expressions.

This in turn generates the following schema:
{code:java}
Schema {
fields: [
Field {
name: "c1",
data_type: Utf8,
nullable: false,
},
Field {
name: "c13",
data_type: Utf8,
nullable: false,
},
Field {
name: "MIN",
data_type: Float64,
nullable: true,
},
],
metadata: {},
}{code}
I am not super familiar with how DataFusion works under the hood, but I would 
assume that this schema is actually correct for the Aggregate logical plan, but 
isn't projecting the data correctly thus resulting in the wrong query result 
schema? 

 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}
This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others (Postgres) will require that all the columns must be in the GROUP BY to 
be used in an aggregate function.

  was:
There are two scenarios that cause problems but are related to the queries with 
aggregate expressions and the SQL planner.

 

*Scenario 1*

Columns are grouped by but not selected.
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
You can reproduce with the aggregate_test_100 dataset with the query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}
This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others will require that all the columns must be in the GROUP BY to be used in 
an aggregate function.


> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  

[jira] [Updated] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2019-12-30 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy updated ARROW-7480:
-
Description: 
There are two scenarios that cause problems but are related to the queries with 
aggregate expressions and the SQL planner.

 

*Scenario 1*

Columns are grouped by but not selected.
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
You can reproduce with the aggregate_test_100 dataset with the query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}
This should actually be Field(c0, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others will require that all the columns must be in the GROUP BY to be used in 
an aggregate function.

  was:
**There are two scenarios that cause problems but are related to the queries 
with aggregate expressions and the SQL planner.

*Scenario 1*

Columns are grouped by but not selected.

 
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
 

You can reproduce with the aggregate_test_100 dataset with the query:

 
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

 

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query 

 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:

 
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others will require that all the columns must be in the GROUP BY to be used in 
an aggregate function.


> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> You can reproduce with the aggregate_test_100 dataset with the query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c0, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> 
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others will require that all the columns must be in the GROUP BY to be used 
> in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2019-12-30 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy updated ARROW-7480:
-
Description: 
There are two scenarios that cause problems but are related to the queries with 
aggregate expressions and the SQL planner.

 

*Scenario 1*

Columns are grouped by but not selected.
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
You can reproduce with the aggregate_test_100 dataset with the query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}
This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others will require that all the columns must be in the GROUP BY to be used in 
an aggregate function.

  was:
There are two scenarios that cause problems but are related to the queries with 
aggregate expressions and the SQL planner.

 

*Scenario 1*

Columns are grouped by but not selected.
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
You can reproduce with the aggregate_test_100 dataset with the query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}
This should actually be Field(c0, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others will require that all the columns must be in the GROUP BY to be used in 
an aggregate function.


> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> 
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Kyle McCarthy
>Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> You can reproduce with the aggregate_test_100 dataset with the query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> 
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others will require that all the columns must be in the GROUP BY to be used 
> in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

2019-12-30 Thread Kyle McCarthy (Jira)
Kyle McCarthy created ARROW-7480:


 Summary: [Rust] [DataFusion] Query fails/incorrect when aggregated 
+ grouped columns don't match the selected columns
 Key: ARROW-7480
 URL: https://issues.apache.org/jira/browse/ARROW-7480
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Kyle McCarthy


**There are two scenarios that cause problems but are related to the queries 
with aggregate expressions and the SQL planner.

*Scenario 1*

Columns are grouped by but not selected.

 
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields 
in schema")){noformat}
 

You can reproduce with the aggregate_test_100 dataset with the query:

 
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

 

Columns are selected, but not grouped or part of an aggregate function. This 
query actually will run, but the wrong schema is produced.

Query 

 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:

 
{code:java}
Schema {
fields: [
Field {
name: "c0",
data_type: Utf8,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
Field {
name: "c1",
data_type: Float64,
nullable: true,
},
],
metadata: {},
} {code}

Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
others will require that all the columns must be in the GROUP BY to be used in 
an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7478) [Rust] [DataFusion] Group by expression ignored unless paired with aggregate expression

2019-12-30 Thread Kyle McCarthy (Jira)
Kyle McCarthy created ARROW-7478:


 Summary: [Rust] [DataFusion] Group by expression ignored unless 
paired with aggregate expression
 Key: ARROW-7478
 URL: https://issues.apache.org/jira/browse/ARROW-7478
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Kyle McCarthy


GROUP BY expressions are ignored unless the query also contains an aggregate 
expression.

To reproduce you can execute the following query on the aggregate_test_100 data 
set.
{code:java}
SELECT c2 FROM aggregate_test_100 GROUP BY c2{code}
*Expected:*

{{1}}
 {{2}}
 {{3}}
 {{4}}
 {{5}}

*Actual:*

{{2}}
 {{5}}
 {{...98 more rows...}}

 

The order of the expected isn't necessarily correct since the query doesn't 
contain an ORDER BY expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-11-10 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971263#comment-16971263
 ] 

Kyle McCarthy edited comment on ARROW-6947 at 11/11/19 12:25 AM:
-

I would like to provide some assistance with implementing UDFs. Is the idea to 
implement them similarly to gandiva and generate LLVM IR or WebAssembly? Or to 
take a different approach and provide a trait for users to implement for their 
UDF and then register?


was (Author: kylemccarthy):
I would like to provide some assistance with implementing UDFs. Is the idea to 
implement them similarly to gandiva and generate LLVM IR or WebAssembly? Or to 
provide a trait for users to implement for their UDF and then register?

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-11-10 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971263#comment-16971263
 ] 

Kyle McCarthy commented on ARROW-6947:
--

I would like to provide some assistance with implementing UDFs. Is the idea to 
implement them similarly to gandiva and generate LLVM IR or WebAssembly? Or to 
provide a trait for users to implement for their UDF and then register?

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-11-10 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy updated ARROW-6947:
-
Comment: was deleted

(was: I am curious to see if you have any ideas about how this would work. I 
have been working on a PoC, but will probably need to make some design 
decisions and would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type: 
{code:java}
Box) -> Result{code}
so if a users defines a function such as
{code:java}
fn length(s: String) -> usize{code}
we would wrap that and return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. )

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-10-23 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321
 ] 

Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:59 PM:
-

I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type:

 
{code:java}
Box) -> Result{code}
 

so if a users defines a function such as
{code:java}
fn length(s: String) -> usize{code}
we would wrap that and return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 


was (Author: kylemccarthy):
I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type:

```Box) -> Result```

so if a users defines a function such as

```fn length(s: String) -> usize```

we would wrap that and return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-10-23 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321
 ] 

Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:59 PM:
-

I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type: 
{code:java}
Box) -> Result{code}
so if a users defines a function such as
{code:java}
fn length(s: String) -> usize{code}
we would wrap that and return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 


was (Author: kylemccarthy):
I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type:

 
{code:java}
Box) -> Result{code}
 

so if a users defines a function such as
{code:java}
fn length(s: String) -> usize{code}
we would wrap that and return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-10-23 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321
 ] 

Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:58 PM:
-

I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type:

```Box) -> Result```

so if a users defines a function such as

```fn length(s: String) -> usize```

we would wrap that and return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 


was (Author: kylemccarthy):
I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type ```Box) -> Result```, so if a users defines a 
function such as ```fn length(s: String) -> usize``` we would wrap that and 
return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-10-23 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321
 ] 

Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:58 PM:
-

I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type ```Box) -> Result```, so if a users defines a 
function such as ```fn length(s: String) -> usize``` we would wrap that and 
return our ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 


was (Author: kylemccarthy):
I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type `Box) -> Result`, so if a users defines a function 
such as `fn length(s: String) -> usize` we would wrap that and return our 
ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2019-10-23 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321
 ] 

Kyle McCarthy commented on ARROW-6947:
--

I am curious to see if you have any ideas about how this would work. I have 
been working on a PoC, but will probably need to make some design decisions and 
would like to see if they align with yours.

At a high level, I see this working by composing a UDF with some general 
ScalarFunction type. Right now I have the ScalarFunction with type `Box) -> Result`, so if a users defines a function 
such as `fn length(s: String) -> usize` we would wrap that and return our 
ScalarFunction.

I think that the composed functions need to be associated with some "static" 
metadata, similar to the FunctionMeta in the logical plan. I think we would 
want to know the DataType of the arguments that the function expects and if 
they are optional, as well as the return type and if it is fallible/infallible.

If the UDF accepts and returns primitive rust types, generating that meta data 
should be pretty straight forward. However, if the UDF takes/returns 
ScalarValues the user would have to specifically provide the metadata.

We would be able to generate most of the data for the logical plan's 
FunctionMeta but would still need the function name and the field names for the 
args.

As of right now, I haven't done anything related to Aggregate UDFs or actually 
registering them with the ExecutionContext. 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6666) [Rust] [DataFusion] Implement string literal expression

2019-10-14 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951347#comment-16951347
 ] 

Kyle McCarthy commented on ARROW-:
--

Does this require for rust's arrow to implement a 
[StringType|[https://arrow.apache.org/docs/cpp/api/datatype.html#classarrow_1_1_string_type]]
 similar to the one in the C++ implementation? 

> [Rust] [DataFusion] Implement string literal expression
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Implement string literal expression in the new physical query plan. It is 
> already implemented in the code that executed directly from the logical plan 
> so it should largely be a copy and paste exercise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-14 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy updated ARROW-6659:
-
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>  Labels: pull-request-available
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-11 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949779#comment-16949779
 ] 

Kyle McCarthy edited comment on ARROW-6659 at 10/11/19 8:32 PM:


I am happy to help - and I would prefer to do it how you wanted/the right way! 
I am fairly unfamiliar with the codebase so I am really just learning it by 
working through the open tasks, so this may be a dumb question.

How does the LogicalPlan and partition count actually work together. From the 
tests it looks like the partition count is related to the batch size? If so 
that would mean that every LogicalPlan would have the same partition count 
right?

Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL 
Planner is creating a Logical Aggregate it would create:
{code:java}
Aggregate { Merge { Aggregate ( aggregate_input ) } }{code}
? If so that definitely makes sense to me, but I am still not totally sure how 
the partition count would work into this.

Thank you for your patience!


was (Author: kylemccarthy):
I am happy to help - and I would prefer to do it how you wanted/the right way! 
I am fairly unfamiliar with the codebase so I am really just learning it by 
working through the open tasks, so this may be a dumb question.

How does the LogicalPlan and partition count actually work together. From the 
tests it looks like the partition count is related to the batch size? If so 
that would mean that every LogicalPlan would have the same partition count 
right?

Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL 
Planner is creating a Logical Aggregate it would create: `Aggregate { Merge { 
Aggregate ( aggregate_input ) } }`? If so that definitely makes sense to me, 
but I am still not totally sure how the partition count would work into this.

Thank you for your patience!

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-11 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949779#comment-16949779
 ] 

Kyle McCarthy commented on ARROW-6659:
--

I am happy to help - and I would prefer to do it how you wanted/the right way! 
I am fairly unfamiliar with the codebase so I am really just learning it by 
working through the open tasks, so this may be a dumb question.

How does the LogicalPlan and partition count actually work together. From the 
tests it looks like the partition count is related to the batch size? If so 
that would mean that every LogicalPlan would have the same partition count 
right?

Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL 
Planner is creating a Logical Aggregate it would create: `Aggregate { Merge { 
Aggregate ( aggregate_input ) } }`? If so that definitely makes sense to me, 
but I am still not totally sure how the partition count would work into this.

Thank you for your patience!

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-10 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948961#comment-16948961
 ] 

Kyle McCarthy commented on ARROW-6659:
--

Does the LogicalPlan need to know about partitioning? I was able to just move 
the logic into the physical query planner and it seems to work... I am not sure 
if this is what you had in mind though. [You can see what I mean 
here|https://github.com/kyle-mccarthy/arrow/blob/3596cf417b8420b51f265c980aace7292c6134d8/rust/datafusion/src/execution/context.rs#L275].

Originally I was thinking about adding a `LogicalPlan::Merge` variant and 
adding some boolean value onto the `LogicalPlan::Aggregate` to indicate if the 
input was partitioned. It seemed like it could just overcomplicate the process 
though. 

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-10 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy reassigned ARROW-6659:


Assignee: Kyle McCarthy

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-09 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947901#comment-16947901
 ] 

Kyle McCarthy edited comment on ARROW-6659 at 10/9/19 6:03 PM:
---

Do you have a specific solution in mind? I was thinking that this could be done 
by moving some of the logic out from the partitions method in the 
HashAggregateExec to the create_physical_plan on the ExecutionContext. I was 
also thinking that it it probably could work with generics.


was (Author: kylemccarthy):
Do you have a specific solution in mind? I was thinking that this could be done 
by pulling some of the logic out from the partitions method in the 
HashAggregateExec, but also it probably could work with generics.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-09 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947901#comment-16947901
 ] 

Kyle McCarthy commented on ARROW-6659:
--

Do you have a specific solution in mind? I was thinking that this could be done 
by pulling some of the logic out from the partitions method in the 
HashAggregateExec, but also it probably could work with generics.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6696) [Rust] [DataFusion] Implement simple math operations in physical query plan

2019-10-07 Thread Kyle McCarthy (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle McCarthy reassigned ARROW-6696:


Assignee: Kyle McCarthy

> [Rust] [DataFusion] Implement simple math operations in physical query plan
> ---
>
> Key: ARROW-6696
> URL: https://issues.apache.org/jira/browse/ARROW-6696
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Update BinaryExpr to support simple math operations such as +, -, *, / using 
> compute kernels where possible.
> See the original implementation when executing directly from the logical plan 
> for inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6696) [Rust] [DataFusion] Implement simple math operations in physical query plan

2019-10-07 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946282#comment-16946282
 ] 

Kyle McCarthy commented on ARROW-6696:
--

Are you wanting to include the modulo operation? Or do you want to stick to the 
basics (plus, minus, multiply, divide)? 

> [Rust] [DataFusion] Implement simple math operations in physical query plan
> ---
>
> Key: ARROW-6696
> URL: https://issues.apache.org/jira/browse/ARROW-6696
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Update BinaryExpr to support simple math operations such as +, -, *, / using 
> compute kernels where possible.
> See the original implementation when executing directly from the logical plan 
> for inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6744) Export JsonEqual trait in the array module

2019-09-30 Thread Kyle McCarthy (Jira)
Kyle McCarthy created ARROW-6744:


 Summary: Export JsonEqual trait in the array module
 Key: ARROW-6744
 URL: https://issues.apache.org/jira/browse/ARROW-6744
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Kyle McCarthy


ARROW-5901 added checking for array equality with JSON arrays. This added the 
JsonEqual trait bound to the Array trait but it isn't exported making it 
private.

The JsonEqual is a public trait, but the equal module is private and the 
JsonEqual trait isn't exported like the ArrayEqual trait.

AFAIK this makes it impossible to implement your own arrays that are bound by 
the Array trait.

I suggest that JsonEqual is exported with pub use like the ArrayEqual trait 
from the array module. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)