[jira] [Commented] (ARROW-10275) [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish

2021-04-26 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332301#comment-17332301
 ] 

Andrew Lamb commented on ARROW-10275:
-

Migrated to github: https://github.com/apache/arrow-datafusion/issues/107

> [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish
> ---
>
> Key: ARROW-10275
> URL: https://issues.apache.org/jira/browse/ARROW-10275
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 2.0.0
> Environment: Ubuntu 20.04
>Reporter: Josh Taylor
>Priority: Minor
>
> Group by with a high cardinality (columns with lots of unique values) don't 
> seem to finish.
> I've tried with both datafusion-cli and this:
> [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs]
> When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to 
> stall. I've tried with limit but it doesn't work either.
> My parquet file: 
> [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]
> datafusion-cli:
> {code:java}
> CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
> select O_ORDERKEY from something group by O_ORDERKEY;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10275) [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish

2020-10-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212682#comment-17212682
 ] 

Andy Grove commented on ARROW-10275:


I have seen the same behavior. We have mostly been testing hash aggregates with 
queries that produce low cardinality results and will need to spend time 
testing for high cardinality results and see how we can optimize this.

> [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish
> ---
>
> Key: ARROW-10275
> URL: https://issues.apache.org/jira/browse/ARROW-10275
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 2.0.0
> Environment: Ubuntu 20.04
>Reporter: Josh Taylor
>Priority: Minor
>
> Group by with a high cardinality (columns with lots of unique values) don't 
> seem to finish.
> I've tried with both datafusion-cli and this:
> [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs]
> When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to 
> stall. I've tried with limit but it doesn't work either.
> My parquet file: 
> [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]
> datafusion-cli:
> {code:java}
> CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
> select O_ORDERKEY from something group by O_ORDERKEY;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)