[jira] [Assigned] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency

2020-09-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9957:
-

Assignee: Neville Dipale

> [Rust] Remove unmaintained tempdir dependency
> -
>
> Key: ARROW-9957
> URL: https://issues.apache.org/jira/browse/ARROW-9957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Replace tempdir with tempfile, also removing older versions of some 
> dependencies like rand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency

2020-09-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9957.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8157
[https://github.com/apache/arrow/pull/8157]

> [Rust] Remove unmaintained tempdir dependency
> -
>
> Key: ARROW-9957
> URL: https://issues.apache.org/jira/browse/ARROW-9957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Replace tempdir with tempfile, also removing older versions of some 
> dependencies like rand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment

2020-09-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9848:
-

Assignee: Neville Dipale

> [Rust] Implement changes to ensure flatbuffer alignment
> ---
>
> Key: ARROW-9848
> URL: https://issues.apache.org/jira/browse/ARROW-9848
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> See ARROW-6313, changes were made to all IPC implementations except for Rust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9966) [Rust] Speedup aggregate kernels

2020-09-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9966.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8165
[https://github.com/apache/arrow/pull/8165]

> [Rust] Speedup aggregate kernels
> 
>
> Key: ARROW-9966
> URL: https://issues.apache.org/jira/browse/ARROW-9966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9919:
--
Component/s: Rust - DataFusion

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9919:
--
Affects Version/s: 1.0.0

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Affects Versions: 1.0.0
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9919.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8116
[https://github.com/apache/arrow/pull/8116]

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9846) [Rust] Master branch broken build

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9846.
---
Resolution: Not A Problem

> [Rust] Master branch broken build
> -
>
> Key: ARROW-9846
> URL: https://issues.apache.org/jira/browse/ARROW-9846
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Master branch is failing to build in CI. It fails to compile 
> "tower-balance-0.3.0". I cannot reproduce locally.
> {code:java}
> error[E0502]: cannot borrow `self` as immutable because it is also borrowed 
> as mutable
>--> 
> /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/pool/mod.rs:381:21
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency

2020-09-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9957:
-

 Summary: [Rust] Remove unmaintained tempdir dependency
 Key: ARROW-9957
 URL: https://issues.apache.org/jira/browse/ARROW-9957
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Affects Versions: 1.0.0
Reporter: Neville Dipale


Replace tempdir with tempfile, also removing older versions of some 
dependencies like rand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10010) [Rust] Speedup arithmetic

2020-09-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10010.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8191
[https://github.com/apache/arrow/pull/8191]

> [Rust] Speedup arithmetic
> -
>
> Key: ARROW-10010
> URL: https://issues.apache.org/jira/browse/ARROW-10010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There are some optimizations possible in arithmetics kernels.
>  
> PR to follow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8883) [Rust] [Integration Testing] Enable passing tests and update spec doc

2020-09-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-8883:
-

Assignee: Neville Dipale

> [Rust] [Integration Testing] Enable passing tests and update spec doc
> -
>
> Key: ARROW-8883
> URL: https://issues.apache.org/jira/browse/ARROW-8883
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 0.17.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> Some of the integration test failures can be avoided by disabling unsupported 
> tests, like large lists and nested types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-15 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196619#comment-17196619
 ] 

Neville Dipale commented on ARROW-10002:


Hi [~batmanaod], I've looked at the code but haven't checked it out yet to do 
my own comparisons. I'd be interested in perf implications (I'm presuming 
there's no change for indexing), and how we would remove `default fn` on other 
trait methods, seeing as that it's mostly used to specialise between numeric 
primitives and booleans.

> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
> has been further discussion and ideas for resolving the soundness issue, but 
> to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9980.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8173
[https://github.com/apache/arrow/pull/8173]

> [Rust] Fix parquet crate clippy lints
> -
>
> Key: ARROW-9980
> URL: https://issues.apache.org/jira/browse/ARROW-9980
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This addresses most clippy lints on the parquet crate. Other remaining lints 
> can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-14 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9984.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8176
[https://github.com/apache/arrow/pull/8176]

> [Rust] [DataFusion] DRY of function to string
> -
>
> Key: ARROW-9984
> URL: https://issues.apache.org/jira/browse/ARROW-9984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9978) [Rust] Umbrella issue for clippy integration

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9978:
-

 Summary: [Rust] Umbrella issue for clippy integration
 Key: ARROW-9978
 URL: https://issues.apache.org/jira/browse/ARROW-9978
 Project: Apache Arrow
  Issue Type: New Feature
  Components: CI, Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This is an umbrella issue to collate outstanding and new tasks to enable clippy 
integration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9296) [CI][Rust] Enable more clippy lint checks

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9296:
--
Parent: ARROW-9978
Issue Type: Sub-task  (was: Improvement)

> [CI][Rust] Enable more clippy lint checks
> -
>
> Key: ARROW-9296
> URL: https://issues.apache.org/jira/browse/ARROW-9296
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration, Rust
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently only {{clippy::redundant_field_names}} is allowed, so we should 
> incrementally extend the list of enabled lints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9979:
-

 Summary: [Rust] Fix arrow crate clippy lints
 Key: ARROW-9979
 URL: https://issues.apache.org/jira/browse/ARROW-9979
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This fixes many clippy lints, but not all. It takes hours to address lints, 
ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9980:
-

Assignee: Neville Dipale

> [Rust] Fix parquet crate clippy lints
> -
>
> Key: ARROW-9980
> URL: https://issues.apache.org/jira/browse/ARROW-9980
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This addresses most clippy lints on the parquet crate. Other remaining lints 
> can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9338) [Rust] Add instructions for running clippy locally

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9338:
--
Parent: ARROW-9978
Issue Type: Sub-task  (was: Improvement)

> [Rust] Add instructions for running clippy locally
> --
>
> Key: ARROW-9338
> URL: https://issues.apache.org/jira/browse/ARROW-9338
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Paddy Horan
>Priority: Minor
>
> Similar to the "Code Formatting" section in the top level README it would be 
> useful to add instructions for running clippy locally to avoid wasted CI time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9979:
-

Assignee: Neville Dipale

> [Rust] Fix arrow crate clippy lints
> ---
>
> Key: ARROW-9979
> URL: https://issues.apache.org/jira/browse/ARROW-9979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This fixes many clippy lints, but not all. It takes hours to address lints, 
> ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9980:
-

 Summary: [Rust] Fix parquet crate clippy lints
 Key: ARROW-9980
 URL: https://issues.apache.org/jira/browse/ARROW-9980
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This addresses most clippy lints on the parquet crate. Other remaining lints 
can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9981) [Rust] Allow configuring flight IPC with IpcWriteOptions

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9981:
-

 Summary: [Rust] Allow configuring flight IPC with IpcWriteOptions
 Key: ARROW-9981
 URL: https://issues.apache.org/jira/browse/ARROW-9981
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


We have introduced an IPC write option, but we use the default for the 
arrow-flight crate, which is not ideal. Change this to allow configuring writer 
options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5123) [Rust] derive RecordWriter from struct definitions

2020-09-14 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195410#comment-17195410
 ] 

Neville Dipale commented on ARROW-5123:
---

I'm unable to assign to Xavier

> [Rust] derive RecordWriter from struct definitions
> --
>
> Key: ARROW-5123
> URL: https://issues.apache.org/jira/browse/ARROW-5123
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Xavier Lange
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a 
> rough transition time in the project): 
> https://github.com/sunchao/parquet-rs/pull/197
>  
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a 
> struct which mirrors the schema of your file, this 
> `derive(ParquetRecordWriter)` will write out all the fields, in the order in 
> which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
>   pub a_bool: bool,
>   pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter {
>   fn write_to_row_group(, row_group_writer:  Box);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust 
> compiler. The code generation takes rust syntax and emits additional syntax. 
> This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, 
> loaded by the machinery in cargo. Users don't have to do any special 
> `build.rs` steps or anything like that, it's automatic by including 
> `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a 
> section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to 
> the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The 
> `syn` crate parses the struct from a string-representation to a AST (a 
> recursive enum value). The AST contains all the values I care about when 
> generating a `RecordWriter` impl:
>  - the name of the struct
>  - the lifetime variables of the struct
>  - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo` 
> struct. It has the bits I care about for writing a column: `field_name`, 
> `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter` 
> implementation. The templating functionality is provided by the `quote` 
> crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
>   fn write_row_group(..) {
>     $({
>   $column_writer_snippet
>     })
>   } 
> }
> ```
> this template is then added under the struct definition, ending up something 
> like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
>   fn write_row_group(..) {
>     {
>    write_col_1();
>     };
>    {
>    write_col_2();
>    }
>   }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully 
> expanded and standalone. If a user ever changes their `struct MyValue` 
> definition the `ParquetRecordWriter` will be regenerated. There's no 
> intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to 
> install `cargo expand` [more info on 
> gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
>     pub a_bool: bool,
>     pub a2_bool: bool,
> }
> impl RecordWriter for &[DumbRecord] {
>     fn write_to_row_group(
>     ,
>     row_group_writer:  Box,
>     ) {
>     let mut row_group_writer = row_group_writer;
>     {
>     let vals: Vec = self.iter().map(|x| x.a_bool).collect();
>     let mut column_writer = 
> row_group_writer.next_column().unwrap().unwrap();
>     if let 
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>     column_writer
>     {
>     typed.write_batch([..], None, None).unwrap();
>     }
>     row_group_writer.close_column(column_writer).unwrap();
>     };
>     {
>     let vals: Vec = self.iter().map(|x| x.a2_bool).collect();
>     let mut 

[jira] [Resolved] (ARROW-5123) [Rust] derive RecordWriter from struct definitions

2020-09-14 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-5123.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 4140
[https://github.com/apache/arrow/pull/4140]

> [Rust] derive RecordWriter from struct definitions
> --
>
> Key: ARROW-5123
> URL: https://issues.apache.org/jira/browse/ARROW-5123
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Xavier Lange
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 14h 10m
>  Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a 
> rough transition time in the project): 
> https://github.com/sunchao/parquet-rs/pull/197
>  
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a 
> struct which mirrors the schema of your file, this 
> `derive(ParquetRecordWriter)` will write out all the fields, in the order in 
> which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
>   pub a_bool: bool,
>   pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter {
>   fn write_to_row_group(, row_group_writer:  Box);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust 
> compiler. The code generation takes rust syntax and emits additional syntax. 
> This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, 
> loaded by the machinery in cargo. Users don't have to do any special 
> `build.rs` steps or anything like that, it's automatic by including 
> `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a 
> section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to 
> the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The 
> `syn` crate parses the struct from a string-representation to a AST (a 
> recursive enum value). The AST contains all the values I care about when 
> generating a `RecordWriter` impl:
>  - the name of the struct
>  - the lifetime variables of the struct
>  - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo` 
> struct. It has the bits I care about for writing a column: `field_name`, 
> `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter` 
> implementation. The templating functionality is provided by the `quote` 
> crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
>   fn write_row_group(..) {
>     $({
>   $column_writer_snippet
>     })
>   } 
> }
> ```
> this template is then added under the struct definition, ending up something 
> like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
>   fn write_row_group(..) {
>     {
>    write_col_1();
>     };
>    {
>    write_col_2();
>    }
>   }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully 
> expanded and standalone. If a user ever changes their `struct MyValue` 
> definition the `ParquetRecordWriter` will be regenerated. There's no 
> intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to 
> install `cargo expand` [more info on 
> gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
>     pub a_bool: bool,
>     pub a2_bool: bool,
> }
> impl RecordWriter for &[DumbRecord] {
>     fn write_to_row_group(
>     ,
>     row_group_writer:  Box,
>     ) {
>     let mut row_group_writer = row_group_writer;
>     {
>     let vals: Vec = self.iter().map(|x| x.a_bool).collect();
>     let mut column_writer = 
> row_group_writer.next_column().unwrap().unwrap();
>     if let 
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>     column_writer
>     {
>     typed.write_batch([..], None, None).unwrap();
>     }
>     row_group_writer.close_column(column_writer).unwrap();
>     };
>     {
>     let vals: Vec = 

[jira] [Updated] (ARROW-8883) [Rust] [Integration Testing] Enable passing tests and update spec doc

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8883:
--
Summary: [Rust] [Integration Testing] Enable passing tests and update spec 
doc  (was: [Rust] [Integration Testing] Disable unsupported tests)

> [Rust] [Integration Testing] Enable passing tests and update spec doc
> -
>
> Key: ARROW-8883
> URL: https://issues.apache.org/jira/browse/ARROW-8883
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 0.17.0
>Reporter: Neville Dipale
>Priority: Major
>
> Some of the integration test failures can be avoided by disabling unsupported 
> tests, like large lists and nested types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10040:
--

Assignee: Neville Dipale

> [Rust] Create a way to slice unalligned offset buffers
> --
>
> Key: ARROW-10040
> URL: https://issues.apache.org/jira/browse/ARROW-10040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> We have limitations on the boolean kernels, where we can't apply the kernels 
> on buffers whose offsets aren't a multiple of 8. This has the potential of 
> preventing users from applying some computations on arrays whose offsets 
> aren't divisible by 8.
> We could create methods on Buffer that allow slicing buffers and copying them 
> into aligned buffers.
> An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10199.

Fix Version/s: 2.0.0
   Resolution: Fixed

This has been resolved, and will be fixed in next release in about a week or 2

> [Rust][Parquet] Release Parquet at crates.io to remove debug prints
> ---
>
> Key: ARROW-10199
> URL: https://issues.apache.org/jira/browse/ARROW-10199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Krzysztof Stanisławek
>Priority: Critical
> Fix For: 2.0.0
>
>
> Version of Parquet released to docs.rs & crates.io has debug prints in 
> [https://github.com/apache/arrow/blob/886d87bdea78ce80e39a4b5b6fd6ca6042474c5f/rust/parquet/src/column/writer.rs#L60].
>  They were pretty hard to track down, so I suggest considering logging create 
> in the future. When is the new version going to be released? Is there some 
> stable schedule I can expect?
> Is it recommended to use the current snapshot straight from github instead of 
> crates.io?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10040:
--

Assignee: Jörn Horstmann  (was: Neville Dipale)

> [Rust] Create a way to slice unalligned offset buffers
> --
>
> Key: ARROW-10040
> URL: https://issues.apache.org/jira/browse/ARROW-10040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> We have limitations on the boolean kernels, where we can't apply the kernels 
> on buffers whose offsets aren't a multiple of 8. This has the potential of 
> preventing users from applying some computations on arrays whose offsets 
> aren't divisible by 8.
> We could create methods on Buffer that allow slicing buffers and copying them 
> into aligned buffers.
> An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10225) [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10225:
---
Summary: [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests  
(was: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests)

> [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10040.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8262
[https://github.com/apache/arrow/pull/8262]

> [Rust] Create a way to slice unalligned offset buffers
> --
>
> Key: ARROW-10040
> URL: https://issues.apache.org/jira/browse/ARROW-10040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> We have limitations on the boolean kernels, where we can't apply the kernels 
> on buffers whose offsets aren't a multiple of 8. This has the potential of 
> preventing users from applying some computations on arrays whose offsets 
> aren't divisible by 8.
> We could create methods on Buffer that allow slicing buffers and copying them 
> into aligned buffers.
> An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10225.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8388
[https://github.com/apache/arrow/pull/8388]

> [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5352) [Rust] BinaryArray filter replaces nulls with empty strings

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale closed ARROW-5352.
-
Resolution: Duplicate

> [Rust] BinaryArray filter replaces nulls with empty strings
> ---
>
> Key: ARROW-5352
> URL: https://issues.apache.org/jira/browse/ARROW-5352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Minor
>
> The filter implementation for BinaryArray discards nullness of data. 
> BinaryArrays that are null (seem to) always return an empty string slice when 
> getting a value, so the way filter works might be a bug depending on what 
> Arrow developers' or users' intentions are.
> I think we should either preserve nulls (and their count) or document this as 
> intended behaviour.
> Below is a test case that reproduces the bug.
> {code:java}
> #[test]
> fn test_filter_binary_array_with_nulls() {
> let mut a: BinaryBuilder = BinaryBuilder::new(100);
> a.append_null().unwrap();
> a.append_string("a string").unwrap();
> a.append_null().unwrap();
> a.append_string("with nulls").unwrap();
> let array = a.finish();
> let b = BooleanArray::from(vec![true, true, true, true]);
> let c = filter(, ).unwrap();
> let d:  = c.as_any().downcast_ref::().unwrap();
> // I didn't expect this behaviour
> assert_eq!("", d.get_string(0));
> // fails here
> assert!(d.is_null(0));
> assert_eq!(4, d.len());
> // fails here
> assert_eq!(2, d.null_count());
> assert_eq!("a string", d.get_string(1));
> // fails here
> assert!(d.is_null(2));
> assert_eq!("with nulls", d.get_string(3));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10204) [RUST] [Datafusion] Test failure in aggregate_grouped_empty with simd feature enabled

2020-10-07 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10204.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8378
[https://github.com/apache/arrow/pull/8378]

> [RUST] [Datafusion] Test failure in aggregate_grouped_empty with simd feature 
> enabled
> -
>
> Key: ARROW-10204
> URL: https://issues.apache.org/jira/browse/ARROW-10204
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code}
>  execution::context::tests::aggregate_grouped_empty stdout 
> thread 'execution::context::tests::aggregate_grouped_empty' panicked at 
> 'assertion failed: `(left == right)`
>   left: `["0,0.0"]`,
>  right: `[]`', datafusion/src/execution/context.rs:883:9
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5440) [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos

2020-10-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale closed ARROW-5440.
-
Resolution: Cannot Reproduce

>From the comments, it sounds like this is no longer an issue

> [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos
> -
>
> Key: ARROW-5440
> URL: https://issues.apache.org/jira/browse/ARROW-5440
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
> Environment: CentOS Linux release 7.6.1810 (Core) 
>Reporter: Tenzin Rigden
>Priority: Major
> Attachments: parquet-test-libstd.tar.gz, serde_json_test.tar.gz
>
>
> Hello,
> In the rust parquet implementation ([https://github.com/sunchao/parquet-rs]) 
> on centos, the binary created has a `libstd-hash.so` shared library 
> dependency that is causing issues since it's a shared library found in the 
> rustup directory. This `libstd-hash.so` dependency isn't there on any other 
> rust binaries I've made before. This dependency means that I can't run this 
> binary anywhere where rustup isn't installed with that exact libstd library.
> This is not an issue on Mac.
> I've attached the rust files and here is the command line output below.
> {code:java|title=cli-output|borderStyle=solid}
> [centos@_ parquet-test]$ cat /etc/centos-release
> CentOS Linux release 7.6.1810 (Core)
> [centos@_ parquet-test]$ rustc --version
> rustc 1.36.0-nightly (e70d5386d 2019-05-27)
> [centos@_ parquet-test]$ ldd target/release/parquet-test
> linux-vdso.so.1 =>  (0x7ffd02fee000)
> libstd-44988553032616b2.so => not found
> librt.so.1 => /lib64/librt.so.1 (0x7f6ecd209000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f6eccfed000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f6eccdd7000)
> libc.so.6 => /lib64/libc.so.6 (0x7f6ecca0a000)
> libm.so.6 => /lib64/libm.so.6 (0x7f6ecc708000)
> /lib64/ld-linux-x86-64.so.2 (0x7f6ecd8b1000)
> [centos@_ parquet-test]$ ls -l 
> ~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so
> -rw-r--r--. 1 centos centos 5623568 May 27 21:46 
> /home/centos/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10299) [Rust] Support reading and writing V5 of IPC metadata

2020-10-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10299:
--

 Summary: [Rust] Support reading and writing V5 of IPC metadata
 Key: ARROW-10299
 URL: https://issues.apache.org/jira/browse/ARROW-10299
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


This is mostly alignment issues and tracking when we encounter the v4 legacy 
padding.

I had done this work in another branch, but discarded it without noticing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10191:
--

 Summary: [Rust] [Parquet] Add roundtrip tests for single column 
batches
 Key: ARROW-10191
 URL: https://issues.apache.org/jira/browse/ARROW-10191
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


To aid with test coverage and picking up information loss during Parquet and 
Arrow roundtrips, we can add tests that assert that all supported Arrow 
datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10198) [Dev] Python merge script doesn't close PRs if not merged on master

2020-10-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10198:
--

 Summary: [Dev] Python merge script doesn't close PRs if not merged 
on master
 Key: ARROW-10198
 URL: https://issues.apache.org/jira/browse/ARROW-10198
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Affects Versions: 1.0.1
Reporter: Neville Dipale


When using the merge script to merge PRs against non-master branches, the PR on 
Github doesn't get closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10289) [Rust] Support reading dictionary streams

2020-10-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10289:
--

 Summary: [Rust] Support reading dictionary streams
 Key: ARROW-10289
 URL: https://issues.apache.org/jira/browse/ARROW-10289
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


We support reading dictionaries in the IPC file reader.

We should do the same with the stream reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10236.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8460
[https://github.com/apache/arrow/pull/8460]

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10236:
---
Component/s: Rust

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10236:
---
Affects Version/s: 2.0.0

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-15 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215162#comment-17215162
 ] 

Neville Dipale commented on ARROW-10187:


[~andygrove] 64-bit types and offsets would also be a blocker for supporting 
wasm32.

If someone completes ARROW-9453, perhaps we can gauge from that on what effort 
it takes to support 32-bit.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5350) [Rust] Support filtering on primitive/string lists

2020-10-17 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-5350:
-

Assignee: Neville Dipale

> [Rust] Support filtering on primitive/string lists
> --
>
> Key: ARROW-5350
> URL: https://issues.apache.org/jira/browse/ARROW-5350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We currently only filter on primitive types, but not on lists and structs. 
> Add the ability to filter on nested array types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5350) [Rust] Support filtering on primitive/string lists

2020-10-17 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-5350.
---
Resolution: Fixed

Issue resolved by pull request 8364
[https://github.com/apache/arrow/pull/8364]

> [Rust] Support filtering on primitive/string lists
> --
>
> Key: ARROW-5350
> URL: https://issues.apache.org/jira/browse/ARROW-5350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We currently only filter on primitive types, but not on lists and structs. 
> Add the ability to filter on nested array types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray

2020-10-17 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10334.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8484
[https://github.com/apache/arrow/pull/8484]

> [Rust] [Parquet] Support reading and writing Arrow NullArray
> 
>
> Key: ARROW-10334
> URL: https://issues.apache.org/jira/browse/ARROW-10334
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray

2020-10-17 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10334:
--

Assignee: Neville Dipale

> [Rust] [Parquet] Support reading and writing Arrow NullArray
> 
>
> Key: ARROW-10334
> URL: https://issues.apache.org/jira/browse/ARROW-10334
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7842) [Rust] [Parquet] Implement array reader for list type

2020-10-17 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-7842.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8449
[https://github.com/apache/arrow/pull/8449]

> [Rust] [Parquet] Implement array reader for list type
> -
>
> Key: ARROW-7842
> URL: https://issues.apache.org/jira/browse/ARROW-7842
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Morgan Cassels
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Currently array reader does not support list or map types. The initial PR 
> implementing array reader  https://issues.apache.org/jira/browse/ARROW-4218 
> says that list and map support will come later. Is it known when support for 
> list types might be implemented?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray

2020-10-17 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10334:
--

 Summary: [Rust] [Parquet] Support reading and writing Arrow 
NullArray
 Key: ARROW-10334
 URL: https://issues.apache.org/jira/browse/ARROW-10334
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support

2020-10-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10163:
---
Component/s: Rust - DataFusion

> [Rust] [DataFusion] Add DictionaryArray coercion support
> 
>
> Key: ARROW-10163
> URL: https://issues.apache.org/jira/browse/ARROW-10163
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> --- 
> There is code in the datafusion physical planner that coerces arguments to 
> compatible types for some expressions (e.g. for equals: 
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153)
> This code needs to be modified to understand dictionary types (so, for 
> example we can express a predicate like col1 = "foo", where col1 is a 
> DictionaryArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support

2020-10-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10163:
---
Affects Version/s: 2.0.0

> [Rust] [DataFusion] Add DictionaryArray coercion support
> 
>
> Key: ARROW-10163
> URL: https://issues.apache.org/jira/browse/ARROW-10163
> Project: Apache Arrow
>  Issue Type: Sub-task
>Affects Versions: 2.0.0
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> --- 
> There is code in the datafusion physical planner that coerces arguments to 
> compatible types for some expressions (e.g. for equals: 
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153)
> This code needs to be modified to understand dictionary types (so, for 
> example we can express a predicate like col1 = "foo", where col1 is a 
> DictionaryArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10002) [Rust] Trait-specialization requires nightly

2020-10-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10002:
--

Assignee: Jorge Leitão

> [Rust] Trait-specialization requires nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
> has been further discussion and ideas for resolving the soundness issue, but 
> to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10002) [Rust] Trait-specialization requires nightly

2020-10-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10002.

Resolution: Fixed

Issue resolved by pull request 8485
[https://github.com/apache/arrow/pull/8485]

> [Rust] Trait-specialization requires nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
> has been further discussion and ideas for resolving the soundness issue, but 
> to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion

2020-10-18 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216350#comment-17216350
 ] 

Neville Dipale commented on ARROW-10159:


[~alamb] if there aren't more subtasks, we can mark this as completed. Thanks 
for getting this done

> [Rust][DataFusion] Add support for Dictionary types in data fusion
> --
>
> Key: ARROW-10159
> URL: https://issues.apache.org/jira/browse/ARROW-10159
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have a system that need to process low cardinality string data (aka there 
> are only a few distinct values, but there are many millions of values).
> Using a `StringArray` is very expensive as the same string value is copied 
> over and over again. The `DictionaryArray` was exactly designed to handle 
> this situatio:  rather than repeating each string, it uses indexes into a 
> dictionary and thus repeats integer values. 
> Sadly, DataFusion does not support processing on `DictionaryArray` types for 
> several reasons.
> This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I 
> would like to be possible:
> {code}
> #[tokio::test]
> async fn query_on_string_dictionary() -> Result<()> {
> // ensure that data fusion can operate on dictionary types
> // Use StringDictionary (32 bit indexes = keys)
> let field_type = DataType::Dictionary(
> Box::new(DataType::Int32),
> Box::new(DataType::Utf8),
> );
> let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, 
> true)]));
> let keys_builder = PrimitiveBuildernew(10);
> let values_builder = StringBuilder::new(10);
> let mut builder = StringDictionaryBuilder::new(
> keys_builder, values_builder
> );
> builder.append("one")?;
> builder.append_null()?;
> builder.append("three")?;
> let array = Arc::new(builder.finish());
> let data = RecordBatch::try_new(
> schema.clone(),
> vec![array],
> )?;
> let table = MemTable::new(schema, vec![vec![data]])?;
> let mut ctx = ExecutionContext::new();
> ctx.register_table("test", Box::new(table));
> // Basic SELECT
> let sql = "SELECT * FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\nNULL\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // basic filtering
> let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // filtering with constant
> let sql = "SELECT * FROM test WHERE d1 = 'three'";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"three\"".to_string();
> assert_eq!(expected, actual);
> // Expression evaluation
> let sql = "SELECT concat(d1, '-foo') FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
> assert_eq!(expected, actual);
> // aggregation
> let sql = "SELECT COUNT(d1) FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "2".to_string();
> assert_eq!(expected, actual);
> Ok(())
> }
> {code}
> However, it errors immediately:
> {code}
>  query_on_string_dictionary stdout 
> thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == 
> right)`
>   left: `"\"one\"\nNULL\n\"three\""`,
>  right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code{
> This ticket tracks adding proper support Dictionary types to DataFusion. I 
> will break the work down into several smaller subtasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support

2020-10-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10163.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8463
[https://github.com/apache/arrow/pull/8463]

> [Rust] [DataFusion] Add DictionaryArray coercion support
> 
>
> Key: ARROW-10163
> URL: https://issues.apache.org/jira/browse/ARROW-10163
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> --- 
> There is code in the datafusion physical planner that coerces arguments to 
> compatible types for some expressions (e.g. for equals: 
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153)
> This code needs to be modified to understand dictionary types (so, for 
> example we can express a predicate like col1 = "foo", where col1 is a 
> DictionaryArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion

2020-10-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10159:
---
Component/s: Rust - DataFusion
 Rust

> [Rust][DataFusion] Add support for Dictionary types in data fusion
> --
>
> Key: ARROW-10159
> URL: https://issues.apache.org/jira/browse/ARROW-10159
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have a system that need to process low cardinality string data (aka there 
> are only a few distinct values, but there are many millions of values).
> Using a `StringArray` is very expensive as the same string value is copied 
> over and over again. The `DictionaryArray` was exactly designed to handle 
> this situatio:  rather than repeating each string, it uses indexes into a 
> dictionary and thus repeats integer values. 
> Sadly, DataFusion does not support processing on `DictionaryArray` types for 
> several reasons.
> This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I 
> would like to be possible:
> {code}
> #[tokio::test]
> async fn query_on_string_dictionary() -> Result<()> {
> // ensure that data fusion can operate on dictionary types
> // Use StringDictionary (32 bit indexes = keys)
> let field_type = DataType::Dictionary(
> Box::new(DataType::Int32),
> Box::new(DataType::Utf8),
> );
> let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, 
> true)]));
> let keys_builder = PrimitiveBuildernew(10);
> let values_builder = StringBuilder::new(10);
> let mut builder = StringDictionaryBuilder::new(
> keys_builder, values_builder
> );
> builder.append("one")?;
> builder.append_null()?;
> builder.append("three")?;
> let array = Arc::new(builder.finish());
> let data = RecordBatch::try_new(
> schema.clone(),
> vec![array],
> )?;
> let table = MemTable::new(schema, vec![vec![data]])?;
> let mut ctx = ExecutionContext::new();
> ctx.register_table("test", Box::new(table));
> // Basic SELECT
> let sql = "SELECT * FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\nNULL\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // basic filtering
> let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // filtering with constant
> let sql = "SELECT * FROM test WHERE d1 = 'three'";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"three\"".to_string();
> assert_eq!(expected, actual);
> // Expression evaluation
> let sql = "SELECT concat(d1, '-foo') FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
> assert_eq!(expected, actual);
> // aggregation
> let sql = "SELECT COUNT(d1) FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "2".to_string();
> assert_eq!(expected, actual);
> Ok(())
> }
> {code}
> However, it errors immediately:
> {code}
>  query_on_string_dictionary stdout 
> thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == 
> right)`
>   left: `"\"one\"\nNULL\n\"three\""`,
>  right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code{
> This ticket tracks adding proper support Dictionary types to DataFusion. I 
> will break the work down into several smaller subtasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10261:
--

 Summary: [Rust] [BREAKING] Lists should take Field instead of 
DataType
 Key: ARROW-10261
 URL: https://issues.apache.org/jira/browse/ARROW-10261
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Integration, Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


There is currently no way of tracking nested field metadata on lists. For 
example, if a list's children are nullable, there's no way of telling just by 
looking at the Field.

This causes problems with integration testing, and also affects Parquet 
roundtrips.

I propose the breaking change of [Large|FixedSize]List taking a Field instead 
of Box, as this will overcome this issue, and ensure that the Rust 
implementation passes integration tests.

CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] as 
this addresses some of the roundtrip failures).

I'm leaning towards this landing in 3.0.0, as I'd love for us to have completed 
or made significant traction on the Arrow Parquet writer (and reader), and 
integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10258) [Rust] Support extension arrays

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10258:
--

 Summary: [Rust] Support extension arrays
 Key: ARROW-10258
 URL: https://issues.apache.org/jira/browse/ARROW-10258
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Integration, Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


This should include:
 * supporting the Arrow format
 * supporting field metadata

We can optionally:
 * support recognising known extensions (like UUID)

I'm mainly opening this up for wider visibility, I noticed that I was catching 
strays from metadata integration tests failing because Field doesn't support 
metadata :(



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10258) [Rust] Support extension arrays

2020-10-09 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10258:
---
Fix Version/s: 3.0.0

> [Rust] Support extension arrays
> ---
>
> Key: ARROW-10258
> URL: https://issues.apache.org/jira/browse/ARROW-10258
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 3.0.0
>
>
> This should include:
>  * supporting the Arrow format
>  * supporting field metadata
> We can optionally:
>  * support recognising known extensions (like UUID)
> I'm mainly opening this up for wider visibility, I noticed that I was 
> catching strays from metadata integration tests failing because Field doesn't 
> support metadata :(



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10259) [Rust] Support field metadata

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10259:
--

 Summary: [Rust] Support field metadata
 Key: ARROW-10259
 URL: https://issues.apache.org/jira/browse/ARROW-10259
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


The biggest hurdle to adding field metadata is HashMap and HashSet not 
implementing Hash, Ord and PartialOrd.

I was thinking of implementing the metadata as a Vec<(String, String)> to 
overcome this limitation, and then serializing correctly to JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10269) [Rust] Update nightly: Oct 2020 Edition

2020-10-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10269:
--

 Summary: [Rust] Update nightly: Oct 2020 Edition
 Key: ARROW-10269
 URL: https://issues.apache.org/jira/browse/ARROW-10269
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Neville Dipale


We should update to a more recent nighly after the 2.0.0 release. It carries 
some clippy annoyances, which will mean that I have to revert much of what I 
did around float comparisons.

Might also be preferable to do this sooner, so that we can complete the clippy 
integration and throw away the carrot in favour of the stick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10268) [Rust] Support writing dictionaries to IPC file and stream

2020-10-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10268:
--

 Summary: [Rust] Support writing dictionaries to IPC file and stream
 Key: ARROW-10268
 URL: https://issues.apache.org/jira/browse/ARROW-10268
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


We currently do not support writing dictionary arrays to the IPC file and 
stream format.

When this is supported, we can test the integration with other implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-10 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211654#comment-17211654
 ] 

Neville Dipale commented on ARROW-10261:


[~jhorstmann] nullability should be determined by the overall field for 
consistency; as you could have 1000 batches of 1000 records, but only have say 
5 nulls scattered around.

The main issue is that if I have a non-nullable list, which in turn has a 
nullable struct with various child fields with differing nullability; I won't 
know if the struct is nullable, because I lose that information when only 
taking the field.

Also, in the hypothetical case where the struct has some metadata of its own, 
it gets lost because we would only keep the DataType, and not other attributes 
such as dictionary or metadata (HashMap).

Interestingly, looking at the CPP implementation, it looks like they still use 
List, but I can't see how they preserve the extra details that the 
Rust implementation is failing because of. [~apitrou] any ideas?

> [Rust] [BREAKING] Lists should take Field instead of DataType
> -
>
> Key: ARROW-10261
> URL: https://issues.apache.org/jira/browse/ARROW-10261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>
> There is currently no way of tracking nested field metadata on lists. For 
> example, if a list's children are nullable, there's no way of telling just by 
> looking at the Field.
> This causes problems with integration testing, and also affects Parquet 
> roundtrips.
> I propose the breaking change of [Large|FixedSize]List taking a Field instead 
> of Box, as this will overcome this issue, and ensure that the Rust 
> implementation passes integration tests.
> CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] 
> as this addresses some of the roundtrip failures).
> I'm leaning towards this landing in 3.0.0, as I'd love for us to have 
> completed or made significant traction on the Arrow Parquet writer (and 
> reader), and integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10271:
---
Priority: Blocker  (was: Major)

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Priority: Blocker
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10271:
---
Affects Version/s: 1.0.1

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Priority: Major
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10271:
---
Component/s: Rust

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Assignee: Neville Dipale
>Priority: Blocker
> Fix For: 2.0.0
>
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10271:
---
Fix Version/s: 2.0.0

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Priority: Blocker
> Fix For: 2.0.0
>
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10271:
--

Assignee: Neville Dipale

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Assignee: Neville Dipale
>Priority: Blocker
> Fix For: 2.0.0
>
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-11 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211839#comment-17211839
 ] 

Neville Dipale commented on ARROW-10271:


I was planning on doing a pass to check if there's dependencies that we could 
bump. I'm aware of the packed_simd_2 change, and was planning on addressing it.

While we use an old nightly (call it a six-monthly at this stage), this issue 
will definitely break a lot of code for users.

> [Rust] packed_simd is broken and continued under a new project
> --
>
> Key: ARROW-10271
> URL: https://issues.apache.org/jira/browse/ARROW-10271
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Ritchie
>Assignee: Neville Dipale
>Priority: Blocker
> Fix For: 2.0.0
>
>
> The dependency doesn't compile on newer versions of nightly. This is also 
> known by the (new) project maintainers. Due to complications they continued 
> the project under a new name: `packed_simd_2`.
>  
> packed_simd = { version = "0.3.4", package = "packed_simd_2" }
>  
> See:
> https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10274) [Rust] arithmetic without SIMD does unnecesary copy

2020-10-11 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10274:
---
Component/s: Rust

> [Rust] arithmetic without SIMD does unnecesary copy
> ---
>
> Key: ARROW-10274
> URL: https://issues.apache.org/jira/browse/ARROW-10274
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Ritchie
>Priority: Minor
>
> The arithmetic kernels that don't use SIMD create a `vec` in memory and later 
> copy that data into a Buffer. Maybe we could directly write the arithmetic 
> result to a mutable buffer and prevent this redundant copy?
>  
>  
> {code:java}
> let values = (0..left.len())
> .map(|i| op(left.value(i), right.value(i))) 
> .collect::>();
>  
>   
> let data = ArrayData::new(
>   T::get_data_type(),
> left.len(),
> None,
> null_bit_buffer,
> 0,
> vec![Buffer::from(values.to_byte_slice())],
> vec![],
> );{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10168) [Rust] [Parquet] Extend arrow schema conversion to projected fields

2020-10-07 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10168.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8354
[https://github.com/apache/arrow/pull/8354]

> [Rust] [Parquet] Extend arrow schema conversion to projected fields
> ---
>
> Key: ARROW-10168
> URL: https://issues.apache.org/jira/browse/ARROW-10168
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When writing Arrow data to Parquet, we serialise the schema's IPC 
> representation. This schema is then read back by the Parquet reader, and used 
> to preserve the array type information from the original Arrow data.
> We however do not rely on the above mechanism when reading projected columns 
> from a Parquet file; i.e. if we have a file with 3 columns, but we only read 
> 2 columns, we do not yet rely on the serialised arrow schema; and can thus 
> lose type information.
> This behaviour was deliberately left out, as the function 
> *parquet_to_arrow_schema_by_columns* does not check for the existence of 
> arrow schema in the metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-07 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10225:
--

 Summary: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip 
tests
 Key: ARROW-10225
 URL: https://issues.apache.org/jira/browse/ARROW-10225
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


The Arrow spec allows makes the null bitmap optional if an array has no nulls 
[~carols10cents], so the tests that were failing were because we're comparing 
`None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-07 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10225:
--

Assignee: Neville Dipale

> [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10350) [Rust] parquet_derive crate cannot be published to crates.io

2020-10-19 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217141#comment-17217141
 ] 

Neville Dipale commented on ARROW-10350:


I added them as part of another commit, but the pre-release tests were failing. 
I couldn't figure out what the problem was, so I reverted the changes.

I think it's fine that we don't have the crate published as part of this 
release. Users can still use it from git for now.

> [Rust] parquet_derive crate cannot be published to crates.io
> 
>
> Key: ARROW-10350
> URL: https://issues.apache.org/jira/browse/ARROW-10350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> The new parquet_derive crate is missing some fields in the Cargo manifest so 
> cannot be published.
> {code:java}
>Uploading parquet_derive v2.0.0 
> (/home/andygrove/arrow-release/apache-arrow-2.0.0/rust/parquet_derive)
> error: api errors (status 200 OK): missing or empty metadata fields: 
> description, license. Please see 
> https://doc.rust-lang.org/cargo/reference/manifest.html for how to upload 
> metadata
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9742) [Rust] Create one standard DataFrame API

2020-08-17 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179274#comment-17179274
 ] 

Neville Dipale commented on ARROW-9742:
---

Hi [~jhorstmann], the scalar functions on the rust-dataframe library mainly 
call the Arrow compute functions. As we have implemented compute functions with 
an array being the smallest unit, I iterate the chunked arrays and call scalar 
functions on the arrays, before grouping them again into a chunk.

I explored usin Rayon for parallelising those compute functions, but it's not a 
priority (the project is really for me to explore ideas, with the goal being to 
create a lazy dataframe ala spark).

There's scope to add a lot of compute functions to Arrow so that downstream 
users can reuse them, and so we can optimise performance from one place. I 
haven't yet seen interest in functions like trig, temporal functions (I have a 
Jira open for this as I tend to do a lot of datetime conversions), and other 
functions beyond what we have. I think DF has some of these as UDFs, which 
probably makes sense to keep them there for now.

Regarding performance, we've found some patterns that help with 
autovectorisation when writing compute functions, I think at the least we could 
write them up so that downstream users can at least follow them.

One common mistake I've seen is that we iterate through array values, checking 
if a slot is valid or null, and computing the function if valid. An approach 
that works is to ignore nulls and calculate them from the validty mask.

> [Rust] Create one standard DataFrame API
> 
>
> Key: ARROW-9742
> URL: https://issues.apache.org/jira/browse/ARROW-9742
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>  There was a discussion in last Arrow sync call about the fact that there are 
> numerous Rust DataFrame projects and it would be good to have one standard, 
> in the Arrow repo.
> I do think it would be good to have a DataFrame trait in Arrow, with an 
> implementation in DataFusion, and making it possible for other projects to 
> extend/replace the implementation e.g. for distributed compute, or for GPU 
> compute, as two examples. 
> [~jhorstmann] Does this capture what you were suggesting in the call?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format

2020-08-17 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9777:
-

 Summary: [Rust] Implement IPC changes to catch up to 1.0.0 format
 Key: ARROW-9777
 URL: https://issues.apache.org/jira/browse/ARROW-9777
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


There are a number of IPC changes and features which the Rust implementation 
has fallen behind on. It's effectively using the legacy format that was 
released in 0.14.x.

Some that I encountered are:
 * change padding from 4 bytes to 8 bytes (along with the padding algorithm)
 * add an IPC writer option to support the legacy format and updated format
 * add error handling for the different metadata versions, we should support 
v4+ so it's an oversight to not explicitly return errors if unsupported 
versions are read

Some of the work already has Jiras open (e.g. body compression), I'll find them 
and mark them as related to this.

I'm tight for spare time, but I'll try work on this before the next release 
(along with the Parquet writer)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8423) [Rust] [Parquet] Serialize arrow schema into metadata when writing parquet

2020-08-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-8423.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7917
[https://github.com/apache/arrow/pull/7917]

> [Rust] [Parquet] Serialize arrow schema into metadata when writing parquet
> --
>
> Key: ARROW-8423
> URL: https://issues.apache.org/jira/browse/ARROW-8423
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andy Grove
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The C++ implementation uses  "ARROW:schema" as a value to store the arrow 
> schema as metadata. Implement same for compatibility.
> Having the original Arrow schema is useful for readers as it preserves some 
> properties like dictionary encoding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9728) [Rust] [Parquet] Compute nested spacing

2020-08-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9728:
-

Assignee: Neville Dipale

> [Rust] [Parquet] Compute nested spacing
> ---
>
> Key: ARROW-9728
> URL: https://issues.apache.org/jira/browse/ARROW-9728
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> When computing definition levels for deeply nested arrays that include lists, 
> the definition levels are correctly calculated, but they are not translated 
> into correct indexes for the eventual primitive arrays.
> For example, an int32 array could have no null values, but be a child of a 
> list that has null values. If say the first 5 values of the int32 array are 
> members of the first list item (i.e. list_array[0] = [1,2,3,4,5], and that 
> list is itself a child of a struct whose index is null, the whole 5 values of 
> the int32 array *should* be skipped. Further, the list's definition and 
> repetition levels will be represented by 1 slot instead of the 5.
> The current logic cannot cater for this, and potentially results in slicing 
> the int32 array incorrectly (sometimes including some of those first 5 
> values).
> This Jira is for the work necessary to compute the index into the eventual 
> leaf arrays correctly.
> I started doing it as part of the initial writer PR, but it's complex and is 
> blocking progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5213) [Format] Script for updating various checked-in Flatbuffers files

2020-08-20 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181303#comment-17181303
 ] 

Neville Dipale commented on ARROW-5213:
---

Which reminds me, I made updates to the files a while ago. I had to make manual 
changes because the flatbuffers crate hasn't been updated with some changes.

I'll update the checked-in files when I look at the IPC changes that we haven't 
worked on in Rust

> [Format] Script for updating various checked-in Flatbuffers files
> -
>
> Key: ARROW-5213
> URL: https://issues.apache.org/jira/browse/ARROW-5213
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Format, Go
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 2.0.0
>
>
> Some subprojects have begun checking in generated Flatbuffers files to source 
> control. This presents a maintainability issue when there are additions or 
> changes made to the .fbs sources. It would be useful to be able to automate 
> the update of these files so it doesn't have to happen on a manual / 
> case-by-case basis



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9841) [Rust] Update checked-in flatbuffer files

2020-08-24 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9841:
-

 Summary: [Rust] Update checked-in flatbuffer files
 Key: ARROW-9841
 URL: https://issues.apache.org/jira/browse/ARROW-9841
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


We can't automatically generate flatbuffer files in Rust due to a bug with 
required fields. 

The currently checked-in generated files are outdated, and should either be 
updated manually or by building the flatbuffers project from master in order to 
update them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format

2020-08-24 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9777:
--
Issue Type: Improvement  (was: Bug)

> [Rust] Implement IPC changes to catch up to 1.0.0 format
> 
>
> Key: ARROW-9777
> URL: https://issues.apache.org/jira/browse/ARROW-9777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Priority: Major
>
> There are a number of IPC changes and features which the Rust implementation 
> has fallen behind on. It's effectively using the legacy format that was 
> released in 0.14.x.
> Some that I encountered are:
>  * change padding from 4 bytes to 8 bytes (along with the padding algorithm)
>  * add an IPC writer option to support the legacy format and updated format
>  * add error handling for the different metadata versions, we should support 
> v4+ so it's an oversight to not explicitly return errors if unsupported 
> versions are read
> Some of the work already has Jiras open (e.g. body compression), I'll find 
> them and mark them as related to this.
> I'm tight for spare time, but I'll try work on this before the next release 
> (along with the Parquet writer)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9826) [Rust] add set function to PrimitiveArray

2020-08-24 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183712#comment-17183712
 ] 

Neville Dipale commented on ARROW-9826:
---

Once arrays are built, they're meant to be immutable. Wouldn't this better 
belong in ArrayBuilder?

> [Rust] add set function to PrimitiveArray
> -
>
> Key: ARROW-9826
> URL: https://issues.apache.org/jira/browse/ARROW-9826
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Francesco Gadaleta
>Priority: Major
>
> For in-place value replacement in Array, a `set()` function (maybe unsafe?) 
> would be required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment

2020-08-24 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9848:
-

 Summary: [Rust] Implement changes to ensure flatbuffer alignment
 Key: ARROW-9848
 URL: https://issues.apache.org/jira/browse/ARROW-9848
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


See ARROW-6313, changes were made to all IPC implementations except for Rust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10019) [Rust] Add substring kernel

2020-09-27 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10019.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8199
[https://github.com/apache/arrow/pull/8199]

> [Rust] Add substring kernel
> ---
>
> Key: ARROW-10019
> URL: https://issues.apache.org/jira/browse/ARROW-10019
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> substring returns a substring of a StringArray starting at a given index, and 
> with a given optional length.
> {{fn substring(array: , start: i32, length: ) -> 
> Result}}
> This operation is common in strings, and it is useful for string-based 
> transformations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10108) [Rust] [Parquet] Fix compiler warning about unused return value

2020-09-29 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204366#comment-17204366
 ] 

Neville Dipale commented on ARROW-10108:


There are a lot of clippy issues with the latest nightly too, perhaps we could 
move to the latest nightly after the 2.0.0 release? I'd also like for us to 
perform destructive refactors soon as possible after the release (rearranging 
the arrow::array module by splitting arrays into their own files)

> [Rust] [Parquet] Fix compiler warning about unused return value
> ---
>
> Key: ARROW-10108
> URL: https://issues.apache.org/jira/browse/ARROW-10108
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> When compiling with latest nightly, this warning is produced:
> {code:java}
> warning: unused return value of `std::mem::replace` that must be used
>--> parquet/src/encodings/encoding.rs:391:9
> |
> 391 | mem::replace( self.hash_slots, new_hash_slots);
> | ^^^
> |
> = note: `#[warn(unused_must_use)]` on by default
> = note: if you don't need the old value, you can just assign the new 
> value directly {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8421) [Rust] [Parquet] Implement parquet writer

2020-09-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8421:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] [Parquet] Implement parquet writer
> -
>
> Key: ARROW-8421
> URL: https://issues.apache.org/jira/browse/ARROW-8421
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the parent story. See subtasks for more information.
> Notes from [~wesm] :
> A couple of initial things to keep in mind
>  * Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
>  * You can optimize the special case where a nullable field's data has no 
> nulls
>  * A good amount of code is required to handle converting from the Arrow 
> physical form of various logical types to the Parquet equivalent one, see 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]
>  for details
>  * It would be worth thinking up front about how dictionary-encoded data is 
> handled both on the Arrow write and Arrow read paths. In parquet-cpp we 
> initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary 
> to dense String), and through real world need I was forced to revisit this 
> (quite painfully) to enable Arrow dictionaries to survive roundtrips to 
> Parquet format, and also achieve better performance and memory use in both 
> reads and writes. You can certainly do a dictionary-to-dense conversion like 
> we did, but you may someday find yourselves doing the same painful refactor 
> that I did to make dictionary write and read not only more efficient but also 
> dictionary order preserving.
> Notes from [~sunchao] :
> I roughly skimmed through the C++ implementation and think on the high level 
> we need to do the following:
>  # implement a method similar to {{WriteArrow}} in 
> [column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc].
>  We can further break this up into smaller pieces such as: 
> dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so 
> on and so forth.
>  # implement an arrow writer in the parquet crate 
> [here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow]. 
> This needs to offer similar APIs as 
> [writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8859) [Rust] [Integration Testing] Implement --quiet / verbose correctly

2020-09-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8859:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] [Integration Testing] Implement --quiet / verbose correctly
> --
>
> Key: ARROW-8859
> URL: https://issues.apache.org/jira/browse/ARROW-8859
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> The Rust tester has verbose=true hard-coded for now.
> When run with '{{archery --quiet"}}, RustTester should receive a {{quiet: 
> Bool}} via 
> [kwargs|https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L335]
>  somehwere and we should use that to set the verbose mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4193) [Rust] Add support for decimal data type

2020-09-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-4193:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] Add support for decimal data type
> 
>
> Key: ARROW-4193
> URL: https://issues.apache.org/jira/browse/ARROW-4193
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Priority: Minor
>  Labels: beginner
> Fix For: 3.0.0
>
>
> We should add {{Decimal(usize,usize)}} to DataType and add the corresponding 
> array and builder classes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3690) [Rust] Add Rust to the format integration testing

2020-09-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-3690:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] Add Rust to the format integration testing
> -
>
> Key: ARROW-3690
> URL: https://issues.apache.org/jira/browse/ARROW-3690
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Integration, Rust
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.0.0
>
>
> We should add Rust into the integration testing. See [here 
> title|https://github.com/apache/arrow/blob/master/integration/integration_test.py#L1166]
>  and [here|https://github.com/apache/arrow/tree/master/integration].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8258) [Rust] [Parquet] ArrowReader fails on some timestamp types

2020-09-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8258:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] [Parquet] ArrowReader fails on some timestamp types
> --
>
> Key: ARROW-8258
> URL: https://issues.apache.org/jira/browse/ARROW-8258
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Renjie Liu
>Priority: Major
> Fix For: 3.0.0
>
>
> I discovered this bug with this query
> {code:java}
> > SELECT tpep_pickup_datetime FROM taxi LIMIT 1;
> General("InvalidArgumentError(\"column types must match schema types, 
> expected Timestamp(Microsecond, None) but found UInt64 at column index 0\")") 
> {code}
> The parquet reader detects this schema when reading from the file:
> {code:java}
> Schema { 
>   fields: [
> Field { name: "tpep_pickup_datetime", data_type: Timestamp(Microsecond, 
> None), nullable: true, dict_id: 0, dict_is_ordered: false }
>   ], 
>   metadata: {} 
> } {code}
> The struct array read from the file contains:
> {code:java}
> [PrimitiveArray
> [
>   156731800800,
>   156731935700,
>   156732009200,
>   156732115100, {code}
>  When the Parquet arrow reader creates the record batch, the following 
> validation logic fails:
> {code:java}
> for i in 0..columns.len() {
> if columns[i].len() != len {
> return Err(ArrowError::InvalidArgumentError(
> "all columns in a record batch must have the same 
> length".to_string(),
> ));
> }
> if columns[i].data_type() != schema.field(i).data_type() {
> return Err(ArrowError::InvalidArgumentError(format!(
> "column types must match schema types, expected {:?} but found 
> {:?} at column index {}",
> schema.field(i).data_type(),
> columns[i].data_type(),
> i)));
> }
> }
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8853) [Rust] [Integration Testing] Enable Flight tests

2020-09-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8853:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] [Integration Testing] Enable Flight tests
> 
>
> Key: ARROW-8853
> URL: https://issues.apache.org/jira/browse/ARROW-8853
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3690) [Rust] Add Rust to the format integration testing

2020-09-29 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204368#comment-17204368
 ] 

Neville Dipale commented on ARROW-3690:
---

Kicking this can down to 3.0.0 as I won't complete all the sub-tasks. I'll try 
complete what I can for 2.0.0, so we can also update the documentation with 
what's supported in Rust.

> [Rust] Add Rust to the format integration testing
> -
>
> Key: ARROW-3690
> URL: https://issues.apache.org/jira/browse/ARROW-3690
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Integration, Rust
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.0.0
>
>
> We should add Rust into the integration testing. See [here 
> title|https://github.com/apache/arrow/blob/master/integration/integration_test.py#L1166]
>  and [here|https://github.com/apache/arrow/tree/master/integration].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9934) [Rust] Shape and stride check in tensor

2020-09-24 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9934.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8129
[https://github.com/apache/arrow/pull/8129]

> [Rust] Shape and stride check in tensor
> ---
>
> Key: ARROW-9934
> URL: https://issues.apache.org/jira/browse/ARROW-9934
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Fernando Herrera
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When creating a tensor there is no check for the supplied shape and stride. 
> There should be a check before creating the tensor object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10016) [Rust] [DataFusion] Implement IsNull and IsNotNull

2020-09-23 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10016.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8204
[https://github.com/apache/arrow/pull/8204]

> [Rust] [DataFusion] Implement IsNull and IsNotNull
> --
>
> Key: ARROW-10016
> URL: https://issues.apache.org/jira/browse/ARROW-10016
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, DataFusion has the logical operator `isNull` and `IsNotNull`, but 
> that operator has no physical implementation. Consequently, this operator 
> cannot be used.
> The goal of this improvement is to add support to this operator on the 
> physical plan.
> Note that these operators only care about the null bitmap, and thus should be 
> implementable to all types supported by Arrow.
> Both operators should probably return a non-null `BooleanArray`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10044) [Rust] Improve README

2020-09-23 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10044.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8224
[https://github.com/apache/arrow/pull/8224]

> [Rust] Improve README
> -
>
> Key: ARROW-10044
> URL: https://issues.apache.org/jira/browse/ARROW-10044
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10095) [Rust] [Parquet] Update for IPC changes

2020-09-25 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10095.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8274
[https://github.com/apache/arrow/pull/8274]

> [Rust] [Parquet] Update for IPC changes
> ---
>
> Key: ARROW-10095
> URL: https://issues.apache.org/jira/browse/ARROW-10095
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The IPC changes made to comply with MetadataVersion 4 broke the rust-parquet 
> writer branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9981) [Rust] Allow configuring flight IPC with IpcWriteOptions

2020-09-25 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9981:
-

Assignee: Neville Dipale

> [Rust] Allow configuring flight IPC with IpcWriteOptions
> 
>
> Key: ARROW-9981
> URL: https://issues.apache.org/jira/browse/ARROW-9981
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Minor
>
> We have introduced an IPC write option, but we use the default for the 
> arrow-flight crate, which is not ideal. Change this to allow configuring 
> writer options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9361) [Rust] Move other array types into their own modules

2020-09-25 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9361:
--
Fix Version/s: 3.0.0

> [Rust] Move other array types into their own modules
> 
>
> Key: ARROW-9361
> URL: https://issues.apache.org/jira/browse/ARROW-9361
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
> Fix For: 3.0.0
>
>
> The array module is getting too big to be practical.  We should leave the 
> core types like the Array trait in `array.rs` and move each array type into 
> its own sub-module as we did while implementing the Union array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9361) [Rust] Move other array types into their own modules

2020-09-25 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9361:
--
Priority: Blocker  (was: Major)

> [Rust] Move other array types into their own modules
> 
>
> Key: ARROW-9361
> URL: https://issues.apache.org/jira/browse/ARROW-9361
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The array module is getting too big to be practical.  We should leave the 
> core types like the Array trait in `array.rs` and move each array type into 
> its own sub-module as we did while implementing the Union array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7700) [Rust] All array types should have iterators and FromIterator support.

2020-09-25 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202323#comment-17202323
 ] 

Neville Dipale commented on ARROW-7700:
---

[~jorgecarleitao] might be related to what you're working on

> [Rust] All array types should have iterators and FromIterator support.
> --
>
> Key: ARROW-7700
> URL: https://issues.apache.org/jira/browse/ARROW-7700
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Reporter: Andy Thomason
>Priority: Major
>  Labels: Usability
>
> Array types should have an Iterable trait that generates plain or nullable 
> iterators.
> {code}
> pub trait Iterable<'a>
> where Self::IterType: std::iter::Iterator
> {
> type IterType;
> fn iter(&'a self) -> Self::IterType;
> fn iter_nulls(&'a self) -> NullableIterator;
> }
> {code}
> IterType depends on the array type from standard slice iterators for 
> primitive types, string iterators for UTF8 types and composite iterators 
> (generating other iterators) for list, struct and dictionary types.
> The NullableIterator type should bundle a null bitmap pointer with another 
> iterator type to form a composite iterator that returns an option:
> {code}
> /// Convert any iterator to a nullable iterator by using the null bitmap.
> #[derive(Debug, PartialEq, Clone)]
> pub struct NullableIterator {
> iter: T,
> i: usize,
> null_bitmap: *const u8,
> }
> impl NullableIterator {
> fn from(iter: T, null_bitmap: , offset: usize) -> Self;
> }
> {code}
> For more details, some exploratory work has been done here: 
> https://github.com/andy-thomason/arrow/blob/ARROW-iterators/rust/arrow/src/array/array.rs#L1711



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6892) [Rust] [DataFusion] Implement optimizer rule to remove redundant projections

2020-09-25 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202322#comment-17202322
 ] 

Neville Dipale commented on ARROW-6892:
---

[~andygrove]  [~jorgecarleitao] [~alamb]  do you know if this is resolved? 
There's been a lot of improvements to the optimizer, so checking if they 
perhaps included this.

> [Rust] [DataFusion] Implement optimizer rule to remove redundant projections
> 
>
> Key: ARROW-6892
> URL: https://issues.apache.org/jira/browse/ARROW-6892
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
>
> Currently we have code in the SQL query planner that wraps aggregate queries 
> in a projection (if needed) to preserve the order of the final results. This 
> is needed because the aggregate query execution always returns a result with 
> grouping expressions first and then aggregate expressions.
> It would be better (simpler, more readable code) to always wrap aggregates in 
> projections and have an optimizer rule to remove redundant projections. There 
> are likely other use cases where redundant projections might exist too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   >