[jira] [Resolved] (ARROW-9899) [Rust] [DataFusion] Switch from Box --> SchemaRef (Arc) to be consistent with the rest of Arrow

2020-09-03 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9899.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8086
[https://github.com/apache/arrow/pull/8086]

> [Rust] [DataFusion] Switch from Box --> SchemaRef (Arc) to be 
> consistent with the rest of Arrow
> ---
>
> Key: ARROW-9899
> URL: https://issues.apache.org/jira/browse/ARROW-9899
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The idea is to  use SchemaRef (which is an Arc) instead of 
> Box inside Datafusion to be consistent with the rest of the arrow 
> implementation, avoid so many copies, and make the code simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9916) [RUST] Avoid cloning ArrayData in several places

2020-09-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9916.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8113
[https://github.com/apache/arrow/pull/8113]

> [RUST] Avoid cloning ArrayData in several places
> 
>
> Key: ARROW-9916
> URL: https://issues.apache.org/jira/browse/ARROW-9916
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I noticed this while benchmarking improvements in ARROW-9895. A flamegraph 
> showed a significant amount of time spent in Arc::clone/atomic_add followed 
> by Arc::drop/atomic_sub
>  The Array trait has two methods for accessing ArrayData, `.data()` which 
> clones an `Arc` and `.data_ref()` which only borrows the data. In 
> many places borrow can be used instead of clone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9916) [RUST] Avoid cloning ArrayData in several places

2020-09-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-9916:
-

Assignee: Jörn Horstmann

> [RUST] Avoid cloning ArrayData in several places
> 
>
> Key: ARROW-9916
> URL: https://issues.apache.org/jira/browse/ARROW-9916
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I noticed this while benchmarking improvements in ARROW-9895. A flamegraph 
> showed a significant amount of time spent in Arc::clone/atomic_add followed 
> by Arc::drop/atomic_sub
>  The Array trait has two methods for accessing ArrayData, `.data()` which 
> clones an `Arc` and `.data_ref()` which only borrows the data. In 
> many places borrow can be used instead of clone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9887) [Rust] [DataFusion] Add support for complex return types of built-in functions

2020-08-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9887.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8080
[https://github.com/apache/arrow/pull/8080]

> [Rust] [DataFusion] Add support for complex return types of built-in functions
> --
>
> Key: ARROW-9887
> URL: https://issues.apache.org/jira/browse/ARROW-9887
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9886) [Rust] [DataFusion] Simplify code to test cast

2020-08-31 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9886.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8077
[https://github.com/apache/arrow/pull/8077]

> [Rust] [DataFusion] Simplify code to test cast
> --
>
> Key: ARROW-9886
> URL: https://issues.apache.org/jira/browse/ARROW-9886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We have 3 tests with similar functionality, but that only vary on the types 
> they test. Let's create a macro to apply to all of them, so that the tests 
> are equivalent and DRY.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9845) [Rust] [Parquet] serde_json is only used in tests but isn't in dev-dependencies

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9845.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8087
[https://github.com/apache/arrow/pull/8087]

> [Rust] [Parquet] serde_json is only used in tests but isn't in 
> dev-dependencies
> ---
>
> Key: ARROW-9845
> URL: https://issues.apache.org/jira/browse/ARROW-9845
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Benjamin Kimock
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.0.1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is resolved by moving the dependency out of dependencies and into to 
> dev-dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9891.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8089
[https://github.com/apache/arrow/pull/8089]

> [Rust] [DataFusion] Make math functions support f32
> ---
>
> Key: ARROW-9891
> URL: https://issues.apache.org/jira/browse/ARROW-9891
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)).
> The goal of this issue is to make the operation be cast(g(f32) AS f64) 
> instead.
> Since computations on f32 are faster than on f64, this is a simple 
> optimization.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9908) [Rust] Support temporal data types in JSON reader

2020-09-07 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9908.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8124
[https://github.com/apache/arrow/pull/8124]

> [Rust] Support temporal data types in JSON reader
> -
>
> Key: ARROW-9908
> URL: https://issues.apache.org/jira/browse/ARROW-9908
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Christoph Schulze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently the JSON reader does not support any temporal data types. Columns 
> with *numerical* data should be interpretable as temporal type when defined 
> accordingly in the schema. Currently this would throw an error with a 
> misleading message ("struct types are not yet supported").
> related issue:
> https://issues.apache.org/jira/browse/ARROW-4803 focuses on parsing temporal 
> data based on strings inputs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function

2020-09-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9944.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8142
[https://github.com/apache/arrow/pull/8142]

> [Rust] Implement TO_TIMESTAMP function
> --
>
> Key: ARROW-9944
> URL: https://issues.apache.org/jira/browse/ARROW-9944
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement the TO_TIMESTAMP function, as described in 
> https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9837) [Rust] Add provider for variable

2020-09-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9837.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8135
[https://github.com/apache/arrow/pull/8135]

> [Rust] Add provider for variable
> 
>
> Key: ARROW-9837
> URL: https://issues.apache.org/jira/browse/ARROW-9837
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: qingcheng wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Select @@version;
> @@version is a variable, and if we want to get its value, we should get it 
> from outside the system,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9821.
---
Resolution: Fixed

Issue resolved by pull request 8097
[https://github.com/apache/arrow/pull/8097]

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9988) [Rust] [DataFusion] Added std::ops to logical expressions

2020-09-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9988.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8182
[https://github.com/apache/arrow/pull/8182]

> [Rust] [DataFusion] Added std::ops to logical expressions
> -
>
> Key: ARROW-9988
> URL: https://issues.apache.org/jira/browse/ARROW-9988
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> So that we can write {{col("a") + col("b")}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9986) [Rust][DataFusion] TO_TIMESTAMP function erroneously requires fractional seconds when no timezone is present

2020-09-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9986.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8179
[https://github.com/apache/arrow/pull/8179]

> [Rust][DataFusion] TO_TIMESTAMP function erroneously requires fractional 
> seconds when no timezone is present
> 
>
> Key: ARROW-9986
> URL: https://issues.apache.org/jira/browse/ARROW-9986
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Reported by [~jhorstmann] here: 
> https://github.com/apache/arrow/pull/8161#issuecomment-691468844
> >One (not directly related) issue I noticed while trying this out, is that 
> >the local patterns seem to require the millisecond part, while for utc 
> >timestamps with "Z" they are optional:
> Both of the following timestamps should be supported, but only the one with 
> an explicit timestamp is:
> {code}
> > select to_timestamp('2020-09-12T10:30:00') from test limit 1;
> ArrowError(ExternalError(General("Error parsing \'2020-09-12T10:30:00\' as 
> timestamp")))
> > select to_timestamp('2020-09-12T10:30:00Z') from test limit 1;
> +---+
> | totimestamp(Utf8("2020-09-12T10:30:00Z")) |
> +---+
> | 15999066000   |
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment

2020-09-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9848.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8174
[https://github.com/apache/arrow/pull/8174]

> [Rust] Implement changes to ensure flatbuffer alignment
> ---
>
> Key: ARROW-9848
> URL: https://issues.apache.org/jira/browse/ARROW-9848
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> See ARROW-6313, changes were made to all IPC implementations except for Rust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9961) [Rust][DataFusion] to_timestamp function parses timestamp without timezone offset as UTC rather than local

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9961.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8161
[https://github.com/apache/arrow/pull/8161]

> [Rust][DataFusion] to_timestamp function parses timestamp without timezone 
> offset as UTC rather than local
> --
>
> Key: ARROW-9961
> URL: https://issues.apache.org/jira/browse/ARROW-9961
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> ARROW-9944 added a TO_TIMESTAMP function that supports parsing timestamps 
> without a specified timezone, such as {{2020-09-08T13:42:29.190855}}
> Such timestamps are supposed to be interpreted as in the local timezone, but 
> instead are interpreted as UTC. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9979.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8168
[https://github.com/apache/arrow/pull/8168]

> [Rust] Fix arrow crate clippy lints
> ---
>
> Key: ARROW-9979
> URL: https://issues.apache.org/jira/browse/ARROW-9979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This fixes many clippy lints, but not all. It takes hours to address lints, 
> ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9950.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8144
[https://github.com/apache/arrow/pull/8144]

> [Rust] [DataFusion] Allow UDF usage without registry
> 
>
> Key: ARROW-9950
> URL: https://issues.apache.org/jira/browse/ARROW-9950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a functionality relevant only for the DataFrame API.
> Sometimes a UDF declaration happens during planning, and it makes it very 
> expressive when the user does not have to access the registry at all to plan 
> the UDF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9954.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8155
[https://github.com/apache/arrow/pull/8155]

> [Rust] [DataFusion] Simplify code of aggregate planning
> ---
>
> Key: ARROW-9954
> URL: https://issues.apache.org/jira/browse/ARROW-9954
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195638#comment-17195638
 ] 

Andy Grove commented on ARROW-10002:


Thanks [~batmanaod] this looks really interesting.

[~paddyhoran] [~nevime]  [~sunchao]  [~alamb] [~jorgecarleitao]  [~jhorstmann] 
will likely be interested in this

> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of `default fn` in the codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: 
> there has been further discussion and ideas for resolving the soundness 
> issue, but to my knowledge no definitive 
> action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9888) [Rust] [DataFusion] ExecutionContext can not be shared between threads

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9888.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8082
[https://github.com/apache/arrow/pull/8082]

> [Rust] [DataFusion] ExecutionContext can not be shared between threads
> --
>
> Key: ARROW-9888
> URL: https://issues.apache.org/jira/browse/ARROW-9888
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As suggested by Jorge on  https://github.com/apache/arrow/pull/8079
> The high level idea is to allow ExecutionContext on multi-threaded 
> environments such as Python.
> The two use-cases:
> 1. when a project is planning a complex number of plans that depend on a 
> common set of sources and UDFs, it would be nice to be able to multi-thread 
> the planning. This is particularly important when planning requires reading 
> remote metadata to formulate themselves (e.g. when the source is in s3 with 
> many partitions). Metadata reading is often slow and network bounded, which 
> makes threads suitable for these workloads. If multi-threading is not 
> possible, either each plan needs to read the metadata independently (one 
> context per plan) or planning must be sequential (with lots of network 
> waiting).
> 2. when creating bindings to programming languages that support 
> multi-threading, it would be nice for the ExecutionContext to be thread safe, 
> so that we can more easily integrate with those languages.
> The code might look like:
> {code}
> alamb@MacBook-Pro rust % git diff
> diff --git a/rust/datafusion/src/execution/context.rs 
> b/rust/datafusion/src/execution/context.rs
> index 5f8aa342e..7374b0a78 100644
> --- a/rust/datafusion/src/execution/context.rs
> +++ b/rust/datafusion/src/execution/context.rs
> @@ -460,7 +460,7 @@ mod tests {
>  use arrow::array::{ArrayRef, Int32Array};
>  use arrow::compute::add;
>  use std::fs::File;
> -use std::io::prelude::*;
> +use std::{sync::Mutex, io::prelude::*};
>  use tempdir::TempDir;
>  use test::*;
>  
> @@ -928,6 +928,28 @@ mod tests {
>  Ok(())
>  }
>  
> +#[test]
> +fn send_context_to_threads() -> Result<()> {
> +// ensure that ExecutionContext's can be read by multiple threads 
> concurrently
> +let tmp_dir = TempDir::new("send_context_to_threads")?;
> +let partition_count = 4;
> +let mut ctx = Arc::new(Mutex::new(create_ctx(_dir, 
> partition_count)?));
> +
> +let threads: Vec>> = (0..2)
> +.map(|_| { ctx.clone() })
> +.map(|ctx_clone| thread::spawn(move || {
> +let ctx = ctx_clone.lock().expect("Locked context");
> +// Ensure we can create logical plan code on a separate 
> thread.
> +ctx.create_logical_plan("SELECT c1, c2 FROM test WHERE c1 > 
> 0 AND c1 < 3")
> +}))
> +.collect();
> +
> +for thread in threads {
> +thread.join().expect("Failed to join thread")?;
> +}
> +Ok(())
> +}
> +
>  #[test]
>  fn scalar_udf() -> Result<()> {
>  let schema = Schema::new(vec![
> {code}
> At the moment, Rust refuses to compile this example (and also refuses to 
> share ExecutionContexts between threads) due to the following (namely that 
> there are several `dyn` objects that are also not marked as Send + Sync:
> {code}
>Compiling datafusion v2.0.0-SNAPSHOT 
> (/Users/alamb/Software/arrow/rust/datafusion)
> error[E0277]: `(dyn execution::physical_plan::PhysicalPlanner + 'static)` 
> cannot be sent between threads safely
>--> datafusion/src/execution/context.rs:940:30
> |
> 940 | .map(|ctx_clone| thread::spawn(move || {
> |  ^ `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)` cannot be sent between 
> threads safely
> | 
>::: 
> /Users/alamb/.rustup/toolchains/nightly-2020-04-22-x86_64-apple-darwin/lib/rustlib/src/rust/src/libstd/thread/mod.rs:616:8
> |
> 616 | F: Send + 'static,
> | required by this bound in `std::thread::spawn`
> |
> = help: the trait `std::marker::Send` is not implemented for `(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)`
> = note: required because of the requirements on the impl of 
> `std::marker::Send` for `std::sync::Arc<(dyn 
> execution::physical_plan::PhysicalPlanner + 'static)>`
> = note: required because it appears within the type 
> `std::option::Option execution::physical_plan::PhysicalPlanner + 

[jira] [Resolved] (ARROW-9900) [Rust][DataFusion] Use Arc<> instead of Box<> in LogicalPlan

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9900.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8098
[https://github.com/apache/arrow/pull/8098]

> [Rust][DataFusion] Use Arc<> instead of Box<> in LogicalPlan
> 
>
> Key: ARROW-9900
> URL: https://issues.apache.org/jira/browse/ARROW-9900
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The idea is to continue to simplify the code and improve performance: the 
> inputs to nodes are often copied and using Box requires unnecessary deep 
> copies



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9892) [Rust] [DataFusion] Add support for concat

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9892.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8090
[https://github.com/apache/arrow/pull/8090]

> [Rust] [DataFusion] Add support for concat
> --
>
> Key: ARROW-9892
> URL: https://issues.apache.org/jira/browse/ARROW-9892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> So that we can concatenate strings together.
> {{pub fn concat(args: Vec) -> Expr}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9583) [Rust] Offset is mishandled in arithmetic and boolean compute kernels

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9583.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7854
[https://github.com/apache/arrow/pull/7854]

> [Rust] Offset is mishandled in arithmetic and boolean compute kernels
> -
>
> Key: ARROW-9583
> URL: https://issues.apache.org/jira/browse/ARROW-9583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Several compute kernels create the resulting ArrayData with the same offset 
> of one of the operands. Instead this offset should be 0 since the buffer is 
> freshly constructed with the correct len.
> Example of one failing test:
>  
> {code:java}
> #[test]
> fn test_primitive_array_add_sliced() {
> let a = Int32Array::from(vec![0, 0, 0, 5, 6, 7, 8, 9, 0]);
> let b = Int32Array::from(vec![0, 0, 0, 6, 7, 8, 9, 8, 0]);
> let a = a.slice(3, 5);
> let b = b.slice(3, 5);
> let a = a.as_any().downcast_ref::().unwrap();
> let b = b.as_any().downcast_ref::().unwrap();
> assert_eq!(5, a.value(0));
> assert_eq!(6, b.value(0));
> let c = add(, ).unwrap();
> assert_eq!(5, c.len());
> assert_eq!(11, c.value(0));
> assert_eq!(13, c.value(1));
> assert_eq!(15, c.value(2));
> assert_eq!(17, c.value(3));
> assert_eq!(17, c.value(4));
> }
>  {code}
> Additionally, the boolean kernels seem to require that both operands have the 
> same offset. This shouldn't be needed, but it seems that the simd 
> implementation requires that the offset is a multiple of 8 (bits) so that the 
> operation works correctly on whole bytes. The scalar implementation should be 
> fine with any offset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9900) [Rust][DataFusion] Use Arc<> instead of Box<> in LogicalPlan

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9900:
--
Component/s: Rust - DataFusion
 Rust

> [Rust][DataFusion] Use Arc<> instead of Box<> in LogicalPlan
> 
>
> Key: ARROW-9900
> URL: https://issues.apache.org/jira/browse/ARROW-9900
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The idea is to continue to simplify the code and improve performance: the 
> inputs to nodes are often copied and using Box requires unnecessary deep 
> copies



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9885) [Rust] [DataFusion] Simplify code of type coercion for binary types

2020-09-02 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9885.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8076
[https://github.com/apache/arrow/pull/8076]

> [Rust] [DataFusion] Simplify code of type coercion for binary types
> ---
>
> Key: ARROW-9885
> URL: https://issues.apache.org/jira/browse/ARROW-9885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The function `numerical_coercion` only uses the operator `op` for its error 
> formatting. But the function's intent can be simply generalized to "coerce 
> two types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9821.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8085
[https://github.com/apache/arrow/pull/8085]

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-9821:
-

Assignee: Andrew Lamb

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9889:
--
Component/s: Rust - DataFusion
 Rust

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9889) [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with "Unsupported logical plan variant"

2020-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9889.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8083
[https://github.com/apache/arrow/pull/8083]

> [Rust][DataFusion] Datafusion CLI: CREATE EXTERNAL TABLE errors with 
> "Unsupported logical plan variant"
> ---
>
> Key: ARROW-9889
> URL: https://issues.apache.org/jira/browse/ARROW-9889
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When I try to make an external table using the DataFusion CLI I get an error:
> h3. Reproducer:
> # Check out master
> # Build via {{cd arrow/rust; cargo run --bin datafusion-cli}}
> # run this query: {{create external table test(c1 boolean) stored as CSV 
> location '/tmp/foo'}}
> *Expected Result*: An external table is created successfully
> *Actual Result*: An error is reported:
> {code}
> >create external table test(c1 boolean) stored as CSV location '/tmp/foo';
> General("Unsupported logical plan variant")
> >
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9751) [Rust] [DataFusion] Extend UDFs to accept more than one type per argument

2020-09-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9751.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7967
[https://github.com/apache/arrow/pull/7967]

> [Rust] [DataFusion] Extend UDFs to accept more than one type per argument
> -
>
> Key: ARROW-9751
> URL: https://issues.apache.org/jira/browse/ARROW-9751
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Most math functions accept float32 and float64, `length` will accept Utf8 and 
> lists soon, etc.
> The goal of this story is to allow UDFs to accept more than one datatype.
> Design: the accepted datatypes should be a vector ordered by "faster/smaller" 
> to "slower/larger" (cpu/memory). When the plan reaches a UDF, we try to cast 
> the input expression like before, from "faster/smaller" to "slower/larger".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9895) [RUST] Improve sort kernels

2020-09-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9895.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8092
[https://github.com/apache/arrow/pull/8092]

> [RUST] Improve sort kernels
> ---
>
> Key: ARROW-9895
> URL: https://issues.apache.org/jira/browse/ARROW-9895
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Followup from my mailing list post:
> {quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32
> and Float64 data types are not supported because the implementation
> relies on the OrdArray trait. This trait is not implemented because
> f64/f32 only implements PartialOrd. The sort function for a single
> column (sort_to_indices) has some special logic which looks like it
> wants to treats NaN the same as null, but I'm also not convinced this
> is the correct way. For example postgres does the following
> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)
> "In order to allow floating-point values to be sorted and used in
> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> than all non-NaN values."
> I propose to do the same in an OrdArray impl for
> Float64Array/Float32Array and then simplifying the sort_to_indices
> function accordingly.
> 2. Sorting for dictionary encoded strings. The problem here is that
> DictionaryArray does not have a generic parameter for the value type
> so it is not currently possible to only implement OrdArray for string
> dictionaries. Again for the single column case, the value data type
> could be checked and a sort could be implemented by looking up each
> key in the dictionary. An optimization could be to check the is_sorted
> flag of DictionaryArray (which does not seem to be used really) and
> then directly sort by the keys. For the general case I see roughly to
> options
> - Somehow implement an OrdArray view of the dictionary array. This
> could be easier if OrdArray did not extend Array but was a completely
> separate trait.
> - Change the lexicographic sort impl to not use dynamic calls but
> instead sort multiple times. So for a query `ORDER BY a, b`, first
> sort by b and afterwards sort again by a. With a stable sort
> implementation this should result in the same ordering. I'm curious
> about the performance, it could avoid dynamic method calls for each
> comparison, but it would process the indices vector multiple times.
> {quote}
> My plan is to open a draft PR with the following changes:
>  - {{sort_to_indices}} further splits up float64/float32 inputs into 
> nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices 
> according to the sort options. Nans are distinct from null and sort greater 
> than any other valid value
> - implement a sort method for dictionary arrays with string values. this 
> kernel checks the {{is_ordered}} flag and sorts just by the keys if it is 
> set, it will look up the string values otherwise
> - for the lexical sort use case the above kernel are not used, instead the 
> {{OrdArray}} trait is used. To make that more flexible and allow wrapping 
> arrays with differend ordering behavior I will make it no longer extend 
> {{Array}} and instead only contain the {{cmp_value}} method
> - string dictionary sorting can then be implemented with a wrapper struct 
> {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}}
> - NaN aware sorting of floats can also be implemented with a wrapper struct 
> and trait implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-10215:
--

 Summary: [Rust] [DataFusion] Rename "Source" typedef
 Key: ARROW-10215
 URL: https://issues.apache.org/jira/browse/ARROW-10215
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 2.0.0


The name "Source" for this type doesn't make sense to me. I would like to 
discuss alternate names for it.
{code:java}
type Source = Box; {code}
My first thoughts are:
 * RecordBatchIterator
 * RecordBatchStream
 * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210358#comment-17210358
 ] 

Andy Grove commented on ARROW-10226:


Here is a test case to reproduce the issue. I uploaded the parquet file to 
dropbox. It is ~100MB.

[https://www.dropbox.com/s/6cpz1h9juxl4c7t/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet?dl=0]

[~jorgecarleitao] Thanks for the offer of help. I don't know much time we 
should spend on this but if you have the time to take a look at least to 
confirm the test also fails for you, that would be an extra data point. 
{code:java}
#[test]
fn foo() {
use arrow::array::Array;
use crate::arrow::arrow_reader::ArrowReader;

let file = std::fs::File::open(

"/mnt/tpch/debug/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet").unwrap();
let file_reader = Rc::new(SerializedFileReader::new(file).unwrap());
let metadata = file_reader
.metadata
.file_metadata()
.key_value_metadata()
.as_ref()
.unwrap();


let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
let schema = arrow_reader.get_schema().unwrap();
let projection = vec![4, 5, 6, 7, 8, 9, 10];
let mut batch_reader =
arrow_reader.get_record_reader_by_columns(projection, 40960).unwrap();

while let Some(batch) = batch_reader.next() {
let batch = batch.unwrap();

let mut n = 0;
match batch.column(4).as_any().downcast_ref::() {
Some(l_returnflag) => {
for i in 0..batch.num_rows() {
if l_returnflag.is_valid(i) {
if l_returnflag.value(i).len() > 1 {
n = n + 1;
}
}
}
}
None => println!("l_returnflag is not a string")
}
println!("{} bad values in batch", n);
assert_eq!(n, 0);
}
}
 {code}

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210341#comment-17210341
 ] 

Andy Grove commented on ARROW-10226:


{code:java}
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 
bad values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49880 
bad values in batch

part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 
bad values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49979 
bad values in batch

part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 374998 
bad values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50031 
bad values in batch

part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375002 
bad values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50110 
bad values in batch {code}

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210328#comment-17210328
 ] 

Andy Grove commented on ARROW-10226:


Just tracking progress with debugging this. The issue is that the projection is 
behaving differently PER BATCH within these Parquet files. We expect 
l_returnflag to be a single char but sometimes the parquet reader is returning 
the contents of the l_comment field instead.
{code:java}
 
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: A
[/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: s among the fluffily r
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: eposits a
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: y ironic foxes above t
{code}
 

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10226:
---
Summary: [Rust] [Parquet] Parquet reader reading wrong columns in some 
batches within a parquet file  (was: [Rust] [DataFusion] TPC-H query 1 no 
longer completes for 100GB dataset)

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-10226.
--
Resolution: Fixed

Although Spark produces the correct result when I run an aggregate query 
against this parquet file, it too shows bad values when I just query the 
l_returnflag column so it appears that the files are corrupt and Spark skips 
the bad rows when building the aggregate? I will keep looking into this but I 
no longer think this is a bug that we need to spend time on.

 

fyi [~jorgecarleitao]

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10226:
---
Priority: Major  (was: Blocker)

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210314#comment-17210314
 ] 

Andy Grove commented on ARROW-10226:


[~npr] Sure, I changed to major, but my plan was to resolve the issue before we 
release tomorrow.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210516#comment-17210516
 ] 

Andy Grove commented on ARROW-10240:


Great idea [~jhorstmann] . Do you want me to take care of this or are you 
planning on working on it? I could do this tonight.

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Minor
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10015) [Rust] Implement SIMD for aggregate kernel sum

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10015.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8370
[https://github.com/apache/arrow/pull/8370]

> [Rust] Implement SIMD for aggregate kernel sum
> --
>
> Key: ARROW-10015
> URL: https://issues.apache.org/jira/browse/ARROW-10015
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, our aggregations are made in a simple loop. However, as described 
> [here|https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html], 
> horizontal operations can also be SIMDed, reports of 2.7x speedups.
> The goal of this improvement is to support SIMD for the "sum", for primitive 
> types.
> The code to modify is in 
> [here|https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/aggregate.rs].
>  A good indication that this issue is completed is when the script
> {{cargo bench --bench aggregate_kernels && cargo bench --bench 
> aggregate_kernels --features simd}}
> yields a speed-up.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210531#comment-17210531
 ] 

Andy Grove commented on ARROW-10240:


On second thoughts, I might not be able to get to this right away.

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Minor
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10164:
---
Component/s: Rust

> [Rust] Add support for DictionaryArray types to cast kernels
> 
>
> Key: ARROW-10164
> URL: https://issues.apache.org/jira/browse/ARROW-10164
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This ticket tracks the work to support casting to/from DictionaryArray's, (my 
> usecase is DictionaryArray's with a Utf8 dictionary). 
> There is prototype work on 
> https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10164.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8346
[https://github.com/apache/arrow/pull/8346]

> [Rust] Add support for DictionaryArray types to cast kernels
> 
>
> Key: ARROW-10164
> URL: https://issues.apache.org/jira/browse/ARROW-10164
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This ticket tracks the work to support casting to/from DictionaryArray's, (my 
> usecase is DictionaryArray's with a Utf8 dictionary). 
> There is prototype work on 
> https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10043) [Rust] [DataFusion] Introduce support for DISTINCT by partially implementing COUNT(DISTINCT)

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10043.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8222
[https://github.com/apache/arrow/pull/8222]

> [Rust] [DataFusion] Introduce support for DISTINCT by partially implementing 
> COUNT(DISTINCT)
> 
>
> Key: ARROW-10043
> URL: https://issues.apache.org/jira/browse/ARROW-10043
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust, Rust - DataFusion
>Reporter: Daniel Russo
>Assignee: Daniel Russo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> I am unsure where support for {{DISTINCT}} may be on the DataFusion roadmap, 
> so I've filed this with the "Wish" type and "Minor" priority to reflect that 
> this is a proposal:
> Introduce {{DISTINCT}} into DataFusion by partially implementing 
> {{COUNT(DISTINCT)}}. The ultimate goal is to fully support the {{DISTINCT}} 
> keyword, but to get implementation started, limit the scope of this work to:
>  * the {{COUNT()}} aggregate function
>  * a single expression in {{COUNT()}}, i.e., {{COUNT(DISTINCT c1)}}, but not 
> {{COUNT(DISTINCT c1, c2)}}
>  * only queries with a {{GROUP BY}} clause
>  * integer types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210600#comment-17210600
 ] 

Andy Grove commented on ARROW-10242:


Hi [~joshx]  and thanks for the bug report. I was unable to reproduce the issue 
on any of the parquet data sets that I usually test with, but they are simple 
data sets containing primitive types. My first guess here is that there is 
something in the files that DataFusion doesn't support and the error message is 
being suppressed, but this is just a guess. Do your files contain nested types?

 

Do you see any other errors before the disconnected channel error?

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10186:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] Tests fail when following instructions in README
> ---
>
> Key: ARROW-10186
> URL: https://issues.apache.org/jira/browse/ARROW-10186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> If I follow the instructions from the README and set the test paths as 
> follows, some of the IPC tests fail with "no such file or directory".
> {code:java}
> export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=../testing/data  {code}
> If I change them to absolute paths as follows then the tests pass:
> {code:java}
> export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9847) [Rust] Inconsistent use of import arrow:: vs crate::arrow::

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9847:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] Inconsistent use of import arrow:: vs crate::arrow::
> ---
>
> Key: ARROW-9847
> URL: https://issues.apache.org/jira/browse/ARROW-9847
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Both the DataFusion and Parquet crates have a mix of "import arrow::" and 
> "import crate::arrow::" and we should standardize on one or the other.
>  
> Which ever standard we use should be enforced in build.rs so CI fails on PRs 
> that do not follow the standard.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10242:
--

Assignee: Andy Grove

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210603#comment-17210603
 ] 

Andy Grove commented on ARROW-9553:
---

Actually it has two separate dependencies on arrow, in  [dependencies] and  
[dev-dependencies] and a different format in each.

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210602#comment-17210602
 ] 

Andy Grove commented on ARROW-9553:
---

The release-test script is looking for this pattern:
{code:java}
["-arrow = { path = \"../arrow\", version = \"#{@snapshot_version}\" }",
 "+arrow = { path = \"../arrow\", version = \"#{@release_version}\" }"]
{code}
The parquet cargo.toml does not match:
{code:java}
arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true } 
{code}

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10293) [Rust] [DataFusion] Fix benchmarks

2020-10-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10293.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8452
[https://github.com/apache/arrow/pull/8452]

> [Rust] [DataFusion] Fix benchmarks
> --
>
> Key: ARROW-10293
> URL: https://issues.apache.org/jira/browse/ARROW-10293
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> They are only benchmarking planning, not execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10295) [Rist] [DataFusion] Simplify accumulators

2020-10-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10295.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8456
[https://github.com/apache/arrow/pull/8456]

> [Rist] [DataFusion] Simplify accumulators
> -
>
> Key: ARROW-10295
> URL: https://issues.apache.org/jira/browse/ARROW-10295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Replace Rc> by Box<>.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9911) [Rust][DataFusion] SELECT with no FROM clause should produce a single row of output

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9911:
--
Component/s: Rust - DataFusion
 Rust

> [Rust][DataFusion] SELECT  with no FROM clause should produce a 
> single row of output
> 
>
> Key: ARROW-9911
> URL: https://issues.apache.org/jira/browse/ARROW-9911
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Minor
>
> This is somewhat of a special case, but it is useful for demonstration / 
> testing expressions. 
> A select expression with no where clause, such as "select 1" should produce a 
> single row. Today datafusion accepts the query but produces no rows.
> Actual output:
> {code}
> arrow/rust$ cargo run --release  --bin datafusion-cli 
> Finished release [optimized] target(s) in 0.25s
>  Running `target/release/datafusion-cli`
> > select 1 ;
> 0 rows in set. Query took 0 seconds.
> {code}
> Expected output is a single row, with the value 1. Here is an example using 
> SQLLite :
> {code}
> $ sqlite3 
> SQLite version 3.28.0 2019-04-15 14:49:49
> Enter ".help" for usage hints.
> Connected to a transient in-memory database.
> Use ".open FILENAME" to reopen on a persistent database.
> sqlite> select 1;
> 1
> sqlite> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4804) [Rust] Read temporal values from CSV

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-4804:
--
Labels: beginner  (was: )

> [Rust] Read temporal values from CSV
> 
>
> Key: ARROW-4804
> URL: https://issues.apache.org/jira/browse/ARROW-4804
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: beginner
>
> CSV reader should support reading temporal values.
> Should support timestamp, date and time, with sane defaults provided for 
> schema inference.
> To keep inference performant. user should provide a Vec of which 
> columns to try convert to a temporal array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4803) [Rust] Read temporal values from JSON

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-4803:
--
Labels: beginner  (was: )

> [Rust] Read temporal values from JSON
> -
>
> Key: ARROW-4803
> URL: https://issues.apache.org/jira/browse/ARROW-4803
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: beginner
>
> Ability to parse strings that look like timestamps to timestamp type. Need to 
> consider whether only timestamp type should be supported as most JSON 
> libraries stick to ISO8601. It might also be inefficient to use regex for 
> timestamps, so the user should provide a hint of which columns to convert to 
> timestamps



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9911) [Rust][DataFusion] SELECT with no FROM clause should produce a single row of output

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9911:
--
Labels: beginner  (was: )

> [Rust][DataFusion] SELECT  with no FROM clause should produce a 
> single row of output
> 
>
> Key: ARROW-9911
> URL: https://issues.apache.org/jira/browse/ARROW-9911
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Minor
>  Labels: beginner
>
> This is somewhat of a special case, but it is useful for demonstration / 
> testing expressions. 
> A select expression with no where clause, such as "select 1" should produce a 
> single row. Today datafusion accepts the query but produces no rows.
> Actual output:
> {code}
> arrow/rust$ cargo run --release  --bin datafusion-cli 
> Finished release [optimized] target(s) in 0.25s
>  Running `target/release/datafusion-cli`
> > select 1 ;
> 0 rows in set. Query took 0 seconds.
> {code}
> Expected output is a single row, with the value 1. Here is an example using 
> SQLLite :
> {code}
> $ sqlite3 
> SQLite version 3.28.0 2019-04-15 14:49:49
> Enter ".help" for usage hints.
> Connected to a transient in-memory database.
> Use ".open FILENAME" to reopen on a persistent database.
> sqlite> select 1;
> 1
> sqlite> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10181:
---
Summary: [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)  (was: 
[Rust] Arrow tests fail to compile on Raspberry Pi (ARM))

> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
>  
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-05 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208480#comment-17208480
 ] 

Andy Grove commented on ARROW-10187:


I was able to run the DataFusion examples though, despite these test failures.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10181:
--

Assignee: Andy Grove

> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
>  
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10181:
---
Description: 
Raspberry Pi still tends to use 32-bit operating systems although there is a 
beta 64 bit version of Raspbian. It would be nice to be able to at least 
disable these tests when runnign on 32-bit. 
{code:java}
error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:421:25
|
421 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: `#[deny(overflowing_literals)]` on by default
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:422:29
|
422 | assert_eq!(ceil(10, 100), 1);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:423:25
|
423 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`
 {code}

  was:
 
{code:java}
error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:421:25
|
421 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: `#[deny(overflowing_literals)]` on by default
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:422:29
|
422 | assert_eq!(ceil(10, 100), 1);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:423:25
|
423 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`
 {code}


> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Raspberry Pi still tends to use 32-bit operating systems although there is a 
> beta 64 bit version of Raspbian. It would be nice to be able to at least 
> disable these tests when runnign on 32-bit. 
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10186:
---
Description: 
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".
{code:java}
export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=../testing/data  {code}
If I change them to absolute paths as follows then the tests pass:
{code:java}
export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
 

  was:
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".
{code:java}
export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=../testing/data  {code}
If I change them to relative paths as follows then the tests pass:
{code:java}
export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
 


> [Rust] Tests fail when following instructions in README
> ---
>
> Key: ARROW-10186
> URL: https://issues.apache.org/jira/browse/ARROW-10186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> If I follow the instructions from the README and set the test paths as 
> follows, some of the IPC tests fail with "no such file or directory".
> {code:java}
> export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=../testing/data  {code}
> If I change them to absolute paths as follows then the tests pass:
> {code:java}
> export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10186:
--

 Summary: [Rust] Tests fail when following instructions in README
 Key: ARROW-10186
 URL: https://issues.apache.org/jira/browse/ARROW-10186
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".

```bash

 export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
 export ARROW_TEST_DATA=../testing/data

```

If I change them to relative paths as follows then the tests pass:

 

```bash

export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data

export ARROW_TEST_DATA=`pwd`/../testing/data

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10186:
---
Description: 
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".
{code:java}
export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=../testing/data  {code}
If I change them to relative paths as follows then the tests pass:
{code:java}
export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
 

  was:
If I follow the instructions from the README and set the test paths as follows, 
some of the IPC tests fail with "no such file or directory".

```bash

 export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
 export ARROW_TEST_DATA=../testing/data

```

If I change them to relative paths as follows then the tests pass:

 

```bash

export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data

export ARROW_TEST_DATA=`pwd`/../testing/data

```

 


> [Rust] Tests fail when following instructions in README
> ---
>
> Key: ARROW-10186
> URL: https://issues.apache.org/jira/browse/ARROW-10186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> If I follow the instructions from the README and set the test paths as 
> follows, some of the IPC tests fail with "no such file or directory".
> {code:java}
> export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=../testing/data  {code}
> If I change them to relative paths as follows then the tests pass:
> {code:java}
> export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data
> export ARROW_TEST_DATA=`pwd`/../testing/data  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208479#comment-17208479
 ] 

Andy Grove commented on ARROW-10188:


Thanks [~jorgecarleitao] .. my mistake, I had set the PARQUET_TEST_DATA path 
relative to the wrong directory in the terminal window where I was running the 
client. The flight example works for me now.

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10187:
--

 Summary: [Rust] Test failures on 32 bit ARM (Raspberry Pi)
 Key: ARROW-10187
 URL: https://issues.apache.org/jira/browse/ARROW-10187
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


Perhaps these failures are to be expected and perhaps we can't really support 
32 bit?

 
{code:java}
 array::array::tests::test_primitive_array_from_vec stdout 
thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
'assertion failed: `(left == right)`
  left: `144`,
 right: `104`', arrow/src/array/array.rs:2383:9 
array::array::tests::test_primitive_array_from_vec_option stdout 
thread 'array::array::tests::test_primitive_array_from_vec_option' panicked at 
'assertion failed: `(left == right)`
  left: `224`,
 right: `176`', arrow/src/array/array.rs:2409:9 
array::null::tests::test_null_array stdout 
thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
`(left == right)`
  left: `64`,
 right: `32`', arrow/src/array/null.rs:134:9 
array::union::tests::test_dense_union_i32 stdout 
thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
failed: `(left == right)`
  left: `1024`,
 right: `768`', arrow/src/array/union.rs:704:9 memory::tests::test_allocate 
stdout 
thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left == 
right)`
  left: `0`,
 right: `32`', arrow/src/memory.rs:243:13
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (ARM)

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10181:
--

 Summary: [Rust] Arrow tests fail to compile on Raspberry Pi (ARM)
 Key: ARROW-10181
 URL: https://issues.apache.org/jira/browse/ARROW-10181
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
 Fix For: 2.0.0


 
{code:java}
error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:421:25
|
421 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: `#[deny(overflowing_literals)]` on by default
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:422:29
|
422 | assert_eq!(ceil(10, 100), 1);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`error: literal out of range for `usize`
   --> arrow/src/util/bit_util.rs:423:25
|
423 | assert_eq!(ceil(100, 10), 10);
| ^^^
|
= note: the literal `100` does not fit into the type `usize` whose 
range is `0..=4294967295`
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-05 Thread Andy Grove (Jira)
Andy Grove created ARROW-10188:
--

 Summary: [Rust] [DataFusion] Some examples are broken
 Key: ARROW-10188
 URL: https://issues.apache.org/jira/browse/ARROW-10188
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 2.0.0


The flight server example fails with "No such file or directory".

The dataframe example produces an empty result set.

The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8735) [Rust] [Parquet] Parquet crate fails to compile on Arm architecture

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8735.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8338
[https://github.com/apache/arrow/pull/8338]

> [Rust] [Parquet] Parquet crate fails to compile on Arm architecture
> ---
>
> Key: ARROW-8735
> URL: https://issues.apache.org/jira/browse/ARROW-8735
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm trying to compile the project in Raspbian, on a Raspberry Pi and the 
> build fails:
> {code:java}
> error[E0308]: mismatched types
>   --> /home/pi/git/arrow/rust/parquet/src/util/hash_util.rs:26:37
>|
> 26 | fn hash_(data: &[u8], seed: u32) -> u32 {
>|-^^^ expected `u32`, found `()`
>||
>|implicitly returns `()` as its body has no tail or `return` expression
>  {code}
> This method is only implemented for x86, x86_64 and aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-05 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10187:
---
Fix Version/s: (was: 2.0.0)

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-06 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208807#comment-17208807
 ] 

Andy Grove commented on ARROW-10187:


[~nevi_me] [~vertexclique] I'd be interested in your opinions on this one.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-06 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10188.

Resolution: Fixed

Issue resolved by pull request 8355
[https://github.com/apache/arrow/pull/8355]

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-06 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208805#comment-17208805
 ] 

Andy Grove commented on ARROW-10187:


If these tests really are specific to 64 bit platforms then we could use 
conditional compilation and only compile them when target_pointer_width == 64.

See 
[https://doc.rust-lang.org/reference/conditional-compilation.html#target_pointer_width]
 for more information.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10167) [Rust] Support display of DictionaryArrays in sql.rs

2020-10-06 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10167.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8333
[https://github.com/apache/arrow/pull/8333]

> [Rust] Support display of DictionaryArrays in sql.rs
> 
>
> Key: ARROW-10167
> URL: https://issues.apache.org/jira/browse/ARROW-10167
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When I try to display a DictionaryArray values, I get either a ??? in sql.rs
> This ticket tracks adding proper support for printing these types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10191.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8330
[https://github.com/apache/arrow/pull/8330]

> [Rust] [Parquet] Add roundtrip tests for single column batches
> --
>
> Key: ARROW-10191
> URL: https://issues.apache.org/jira/browse/ARROW-10191
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> To aid with test coverage and picking up information loss during Parquet and 
> Arrow roundtrips, we can add tests that assert that all supported Arrow 
> datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10251:
---
Description: 
MemTable::load() should load partitions in parallel using async tasks, rather 
than loading one partition at a time.

Also, we should make batch size configurable. It is currently hard-coded to 
1024*1024 which can be quite inefficient.

  was:
MemTable::load() should load partitions in parallel using async tasks, rather 
than loading onw partition at a time.

Also, we should make batch size configurable. It is currently hard-coded to 
1024*1024 which can be quite inefficient.


> [Rust] [DataFusion] MemTable::load() should load partitions in parallel
> ---
>
> Key: ARROW-10251
> URL: https://issues.apache.org/jira/browse/ARROW-10251
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 3.0.0
>
>
> MemTable::load() should load partitions in parallel using async tasks, rather 
> than loading one partition at a time.
> Also, we should make batch size configurable. It is currently hard-coded to 
> 1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-09 Thread Andy Grove (Jira)
Andy Grove created ARROW-10251:
--

 Summary: [Rust] [DataFusion] MemTable::load() should load 
partitions in parallel
 Key: ARROW-10251
 URL: https://issues.apache.org/jira/browse/ARROW-10251
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


MemTable::load() should load partitions in parallel using async tasks, rather 
than loading onw partition at a time.

Also, we should make batch size configurable. It is currently hard-coded to 
1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10275) [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish

2020-10-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212682#comment-17212682
 ] 

Andy Grove commented on ARROW-10275:


I have seen the same behavior. We have mostly been testing hash aggregates with 
queries that produce low cardinality results and will need to spend time 
testing for high cardinality results and see how we can optimize this.

> [Rust] [Datafusion] GROUP BY with a high cardinality doesn't seem to finish
> ---
>
> Key: ARROW-10275
> URL: https://issues.apache.org/jira/browse/ARROW-10275
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 2.0.0
> Environment: Ubuntu 20.04
>Reporter: Josh Taylor
>Priority: Minor
>
> Group by with a high cardinality (columns with lots of unique values) don't 
> seem to finish.
> I've tried with both datafusion-cli and this:
> [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs]
> When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to 
> stall. I've tried with limit but it doesn't work either.
> My parquet file: 
> [https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]
> datafusion-cli:
> {code:java}
> CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
> select O_ORDERKEY from something group by O_ORDERKEY;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212683#comment-17212683
 ] 

Andy Grove commented on ARROW-10226:


I did get to the bottom of why this happened for me. When I converted TPC-H CSV 
data to Parquet I accidentally combined all of the tables when I intended to 
just do this for lineitem. As a result, my lineitem Parquet files were a 
combination of all the tables with varying schema.

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-10 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10240.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8409
[https://github.com/apache/arrow/pull/8409]

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10251.

Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8428
[https://github.com/apache/arrow/pull/8428]

> [Rust] [DataFusion] MemTable::load() should load partitions in parallel
> ---
>
> Key: ARROW-10251
> URL: https://issues.apache.org/jira/browse/ARROW-10251
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> MemTable::load() should load partitions in parallel using async tasks, rather 
> than loading one partition at a time.
> Also, we should make batch size configurable. It is currently hard-coded to 
> 1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10251:
--

Assignee: Andy Grove

> [Rust] [DataFusion] MemTable::load() should load partitions in parallel
> ---
>
> Key: ARROW-10251
> URL: https://issues.apache.org/jira/browse/ARROW-10251
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> MemTable::load() should load partitions in parallel using async tasks, rather 
> than loading one partition at a time.
> Also, we should make batch size configurable. It is currently hard-coded to 
> 1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10271.

Resolution: Fixed

Issue resolved by pull request 8433
[https://github.com/apache/arrow/pull/8433]

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Assignee: Neville Dipale
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10226:
---
Description: 
I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and when 
I try and run the TPC-H benchmark, it never completes and eventually uses up 
all 64 GB RAM.

I can run Spark against the data  set and the query completes in 24 seconds, 
which IIRC is how long it took before.

It is possible that something is odd on my environment, but it is also 
possible/likely that this is a real bug.

I am investigating this and will update the Jira once I know more.

I also went back to old commits that were working for me before and they show 
the same issue so I don't think this is related to a recent code change.

  was:
I re-installed my desktop a few days ago and when I try and run the TPC-H 
benchmark, it never completes and eventually uses up all 64 GB RAM.

I can run Spark against the data  set and the query completes in 24 seconds, 
which IIRC is how long it took before.

It is possible that something is odd on my environment, but it is also 
possible/likely that this is a real bug.

I am investigating this and will update the Jira once I know more.


> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209937#comment-17209937
 ] 

Andy Grove commented on ARROW-10226:


The query also returns the wrong results ... grouping by l_comment (high 
cardinality) instead of l_returnflag (low cardinality)

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-10226:
--

 Summary: [Rust] [DataFusion] TPC-H query 1 no longer completes for 
100GB dataset
 Key: ARROW-10226
 URL: https://issues.apache.org/jira/browse/ARROW-10226
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


I re-installed my desktop a few days ago and when I try and run the TPC-H 
benchmark, it never completes and eventually uses up all 64 GB RAM.

I can run Spark against the data  set and the query completes in 24 seconds, 
which IIRC is how long it took before.

It is possible that something is odd on my environment, but it is also 
possible/likely that this is a real bug.

I am investigating this and will update the Jira once I know more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209945#comment-17209945
 ] 

Andy Grove commented on ARROW-10226:


Query works fine against tbl files but not against parquet files (it's reading 
the wrong columns somehow). Spark works fine so the issue is not with the 
Parquet files. Really odd to find this now.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10300) [Rust] Parquet/CSV TPC-H data

2020-10-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10300:
--

Assignee: Andy Grove

> [Rust] Parquet/CSV TPC-H data
> -
>
> Key: ARROW-10300
> URL: https://issues.apache.org/jira/browse/ARROW-10300
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Remi Dettai
>Assignee: Andy Grove
>Priority: Minor
>
> The TPC-H benchmark for datafusion works with Parquet/CSV data but the data 
> generation routine described in the README generates `.tbl` data.
> Could we describe how the TPC-H Parquet/CSV data can be generated to make the 
> benchmark easier to setup and more reproducible ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10300) [Rust] Improve benchmark documentation for generating/converting TPC-H data

2020-10-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10300:
---
Summary: [Rust] Improve benchmark documentation for generating/converting 
TPC-H data  (was: [Rust] Parquet/CSV TPC-H data)

> [Rust] Improve benchmark documentation for generating/converting TPC-H data
> ---
>
> Key: ARROW-10300
> URL: https://issues.apache.org/jira/browse/ARROW-10300
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Remi Dettai
>Assignee: Andy Grove
>Priority: Minor
>
> The TPC-H benchmark for datafusion works with Parquet/CSV data but the data 
> generation routine described in the README generates `.tbl` data.
> Could we describe how the TPC-H Parquet/CSV data can be generated to make the 
> benchmark easier to setup and more reproducible ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10350) [Rust] parquet_derive crate cannot be published to crates.io

2020-10-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10350:
--

 Summary: [Rust] parquet_derive crate cannot be published to 
crates.io
 Key: ARROW-10350
 URL: https://issues.apache.org/jira/browse/ARROW-10350
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 2.0.0
Reporter: Andy Grove
 Fix For: 3.0.0


The new parquet_derive crate is missing some fields in the Cargo manifest so 
cannot be published.
{code:java}
   Uploading parquet_derive v2.0.0 
(/home/andygrove/arrow-release/apache-arrow-2.0.0/rust/parquet_derive)
error: api errors (status 200 OK): missing or empty metadata fields: 
description, license. Please see 
https://doc.rust-lang.org/cargo/reference/manifest.html for how to upload 
metadata
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field

2020-08-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9849.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8045
[https://github.com/apache/arrow/pull/8045]

> [Rust] [DataFusion] Make UDFs not need a Field
> --
>
> Key: ARROW-9849
> URL: https://issues.apache.org/jira/browse/ARROW-9849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/7967,] shows that it is possible to not 
> require users to pass a `Field` to UDFs declarations and instead just pass a 
> `DataType`.
> Let's deprecate Field from them, and instead just use `DataType`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9464.
---
Resolution: Fixed

Issue resolved by pull request 8034
[https://github.com/apache/arrow/pull/8034]

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartitioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning() -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution() -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering() -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering() -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9464.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8029
[https://github.com/apache/arrow/pull/8029]

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning() -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution() -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering() -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering() -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9833) [Rust] [DataFusion] Refactor TableProvider.scan to return ExecutionPlan

2020-08-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9833.
---
Resolution: Fixed

Issue resolved by pull request 8028
[https://github.com/apache/arrow/pull/8028]

> [Rust] [DataFusion] Refactor TableProvider.scan to return ExecutionPlan
> ---
>
> Key: ARROW-9833
> URL: https://issues.apache.org/jira/browse/ARROW-9833
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Refactor TableProvider.scan to return ExecutionPlan instead of Vec 
> in preparation for removing Partition trait.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9838) [Rust] [DataFusion] Physical planner should insert explicit MergeExec nodes

2020-08-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-9838:
-

 Summary: [Rust] [DataFusion] Physical planner should insert 
explicit MergeExec nodes
 Key: ARROW-9838
 URL: https://issues.apache.org/jira/browse/ARROW-9838
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


Operators such as GlobalLimitExec, SortExec, and HashAggregateExec (in some 
cases) require a single input partition. Rather than have these operators 
perform their own merging of input partitions, the planner should insert 
explicit MergeExec nodes into the physical plan, when needed.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9809) [Rust] [DataFusion] logical schema = physical schema is not true

2020-08-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9809.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8024
[https://github.com/apache/arrow/pull/8024]

> [Rust] [DataFusion] logical schema = physical schema is not true
> 
>
> Key: ARROW-9809
> URL: https://issues.apache.org/jira/browse/ARROW-9809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In tests/sql.rs, we test that the physical and the optimized schema must 
> match. However, this is not necessarily true for all our queries. An example:
> {code:java}
> #[test]
> fn csv_query_sum_cast() {
> let mut ctx = ExecutionContext::new();
> register_aggregate_csv_by_sql( ctx);
> // c8 = i32; c9 = i64
> let sql = "SELECT c8 + c9 FROM aggregate_test_100";
> // check that the physical and logical schemas are equal
> execute( ctx, sql);
> }
> {code}
> The physical expression (and schema) of this operation, after optimization, 
> is {{CAST(c8 as Int64) Plus c9}} (this test fails).
> AFAIK, the invariant of the optimizer is that the output types and 
> nullability are the same.
> Also, note that the reason the optimized logical schema equals the logical 
> schema is that our type coercer does not change the output names of the 
> schema, even though it re-writes logical expressions. I.e. after the 
> optimization, `.to_field()` of an expression may no longer match the field 
> name nor type in the Plan's schema. IMO this is currently by (implicit?) 
> design, as we do not want our logical schema's column names to change during 
> optimizations, or all column references may point to non-existent columns. 
> This is something that brought up on the mailing list about polymorphism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reopened ARROW-9464:
---

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning() -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution() -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering() -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering() -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests

2020-08-17 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179356#comment-17179356
 ] 

Andy Grove commented on ARROW-9778:
---

Thanks [~jorgecarleitao] . When we construct the logical plan, we do open the 
source data files and infer the schema (unless a schema is provided) so I would 
consider this a bug in the logical plan.

> [Rust] [DataFusion] Logical and physical schemas' nullability does not match 
> in 8 out of 20 end-to-end tests
> 
>
> Key: ARROW-9778
> URL: https://issues.apache.org/jira/browse/ARROW-9778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>
> In `tests/sql.rs`, if we re-write the ```execute``` function to test the end 
> schemas, as
> ```
> /// Execute query and return result set as tab delimited string
> fn execute(ctx:  ExecutionContext, sql: ) -> Vec {
> let plan = ctx.create_logical_plan().unwrap();
> let plan = ctx.optimize().unwrap();
> let physical_plan = ctx.create_physical_plan().unwrap();
> let results = ctx.collect(physical_plan.as_ref()).unwrap();
> if results.len() > 0 {
> // results must match the logical schema
> assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref());
> }
> result_str()
> }
> ```
> we end up with 8 tests failing, which indicates that our physical and logical 
> plans are not aligned. In all cases, the issue is nullability: our logical 
> plan assumes nullability = true, while our physical plan may change the 
> nullability field.
> If we do not plan to track nullability on the logical level, we could 
> consider replacing Schema by a type that does not track nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9783) [Rust] [DataFusion] Logical aggregate expressions require explicit data type

2020-08-18 Thread Andy Grove (Jira)
Andy Grove created ARROW-9783:
-

 Summary: [Rust] [DataFusion] Logical aggregate expressions require 
explicit data type
 Key: ARROW-9783
 URL: https://issues.apache.org/jira/browse/ARROW-9783
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


When constructing a logical plan either directly or via the DataFrame API, it 
is not possible to construct an aggregate expression without providing a data 
type. This makes no sense because the aggregate functions need to determine 
their own data type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9783) [Rust] [DataFusion] Logical aggregate expressions require explicit data type

2020-08-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9783.
---
Resolution: Fixed

Issue resolved by pull request 7988
[https://github.com/apache/arrow/pull/7988]

> [Rust] [DataFusion] Logical aggregate expressions require explicit data type
> 
>
> Key: ARROW-9783
> URL: https://issues.apache.org/jira/browse/ARROW-9783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When constructing a logical plan either directly or via the DataFrame API, it 
> is not possible to construct an aggregate expression without providing a data 
> type. This makes no sense because the aggregate functions need to determine 
> their own data type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   >