[jira] [Assigned] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency
[ https://issues.apache.org/jira/browse/ARROW-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-9957: - Assignee: Neville Dipale > [Rust] Remove unmaintained tempdir dependency > - > > Key: ARROW-9957 > URL: https://issues.apache.org/jira/browse/ARROW-9957 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Trivial > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Replace tempdir with tempfile, also removing older versions of some > dependencies like rand. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency
[ https://issues.apache.org/jira/browse/ARROW-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9957. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8157 [https://github.com/apache/arrow/pull/8157] > [Rust] Remove unmaintained tempdir dependency > - > > Key: ARROW-9957 > URL: https://issues.apache.org/jira/browse/ARROW-9957 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Priority: Trivial > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Replace tempdir with tempfile, also removing older versions of some > dependencies like rand. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment
[ https://issues.apache.org/jira/browse/ARROW-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-9848: - Assignee: Neville Dipale > [Rust] Implement changes to ensure flatbuffer alignment > --- > > Key: ARROW-9848 > URL: https://issues.apache.org/jira/browse/ARROW-9848 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > > See ARROW-6313, changes were made to all IPC implementations except for Rust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9966) [Rust] Speedup aggregate kernels
[ https://issues.apache.org/jira/browse/ARROW-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9966. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8165 [https://github.com/apache/arrow/pull/8165] > [Rust] Speedup aggregate kernels > > > Key: ARROW-9966 > URL: https://issues.apache.org/jira/browse/ARROW-9966 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9919: -- Component/s: Rust - DataFusion > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9919: -- Affects Version/s: 1.0.0 > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Affects Versions: 1.0.0 >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9919. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8116 [https://github.com/apache/arrow/pull/8116] > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9846) [Rust] Master branch broken build
[ https://issues.apache.org/jira/browse/ARROW-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9846. --- Resolution: Not A Problem > [Rust] Master branch broken build > - > > Key: ARROW-9846 > URL: https://issues.apache.org/jira/browse/ARROW-9846 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > Master branch is failing to build in CI. It fails to compile > "tower-balance-0.3.0". I cannot reproduce locally. > {code:java} > error[E0502]: cannot borrow `self` as immutable because it is also borrowed > as mutable >--> > /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/pool/mod.rs:381:21 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency
Neville Dipale created ARROW-9957: - Summary: [Rust] Remove unmaintained tempdir dependency Key: ARROW-9957 URL: https://issues.apache.org/jira/browse/ARROW-9957 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Affects Versions: 1.0.0 Reporter: Neville Dipale Replace tempdir with tempfile, also removing older versions of some dependencies like rand. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10010) [Rust] Speedup arithmetic
[ https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10010. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8191 [https://github.com/apache/arrow/pull/8191] > [Rust] Speedup arithmetic > - > > Key: ARROW-10010 > URL: https://issues.apache.org/jira/browse/ARROW-10010 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > There are some optimizations possible in arithmetics kernels. > > PR to follow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8883) [Rust] [Integration Testing] Enable passing tests and update spec doc
[ https://issues.apache.org/jira/browse/ARROW-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-8883: - Assignee: Neville Dipale > [Rust] [Integration Testing] Enable passing tests and update spec doc > - > > Key: ARROW-8883 > URL: https://issues.apache.org/jira/browse/ARROW-8883 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Affects Versions: 0.17.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > > Some of the integration test failures can be avoided by disabling unsupported > tests, like large lists and nested types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196619#comment-17196619 ] Neville Dipale commented on ARROW-10002: Hi [~batmanaod], I've looked at the code but haven't checked it out yet to do my own comparisons. I'd be interested in perf implications (I'm presuming there's no change for indexing), and how we would remove `default fn` on other trait methods, seeing as that it's mostly used to specialise between numeric primitives and booleans. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there > has been further discussion and ideas for resolving the soundness issue, but > to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9980) [Rust] Fix parquet crate clippy lints
[ https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9980. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8173 [https://github.com/apache/arrow/pull/8173] > [Rust] Fix parquet crate clippy lints > - > > Key: ARROW-9980 > URL: https://issues.apache.org/jira/browse/ARROW-9980 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > This addresses most clippy lints on the parquet crate. Other remaining lints > can be addressed as part of future PRs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9984) [Rust] [DataFusion] DRY of function to string
[ https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9984. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8176 [https://github.com/apache/arrow/pull/8176] > [Rust] [DataFusion] DRY of function to string > - > > Key: ARROW-9984 > URL: https://issues.apache.org/jira/browse/ARROW-9984 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9978) [Rust] Umbrella issue for clippy integration
Neville Dipale created ARROW-9978: - Summary: [Rust] Umbrella issue for clippy integration Key: ARROW-9978 URL: https://issues.apache.org/jira/browse/ARROW-9978 Project: Apache Arrow Issue Type: New Feature Components: CI, Rust Affects Versions: 1.0.0 Reporter: Neville Dipale This is an umbrella issue to collate outstanding and new tasks to enable clippy integration -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9296) [CI][Rust] Enable more clippy lint checks
[ https://issues.apache.org/jira/browse/ARROW-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9296: -- Parent: ARROW-9978 Issue Type: Sub-task (was: Improvement) > [CI][Rust] Enable more clippy lint checks > - > > Key: ARROW-9296 > URL: https://issues.apache.org/jira/browse/ARROW-9296 > Project: Apache Arrow > Issue Type: Sub-task > Components: Continuous Integration, Rust >Reporter: Krisztian Szucs >Priority: Major > > Currently only {{clippy::redundant_field_names}} is allowed, so we should > incrementally extend the list of enabled lints. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9979) [Rust] Fix arrow crate clippy lints
Neville Dipale created ARROW-9979: - Summary: [Rust] Fix arrow crate clippy lints Key: ARROW-9979 URL: https://issues.apache.org/jira/browse/ARROW-9979 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale This fixes many clippy lints, but not all. It takes hours to address lints, ansd we can work on remaining ones in future PRs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9980) [Rust] Fix parquet crate clippy lints
[ https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-9980: - Assignee: Neville Dipale > [Rust] Fix parquet crate clippy lints > - > > Key: ARROW-9980 > URL: https://issues.apache.org/jira/browse/ARROW-9980 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > This addresses most clippy lints on the parquet crate. Other remaining lints > can be addressed as part of future PRs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9338) [Rust] Add instructions for running clippy locally
[ https://issues.apache.org/jira/browse/ARROW-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9338: -- Parent: ARROW-9978 Issue Type: Sub-task (was: Improvement) > [Rust] Add instructions for running clippy locally > -- > > Key: ARROW-9338 > URL: https://issues.apache.org/jira/browse/ARROW-9338 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Paddy Horan >Priority: Minor > > Similar to the "Code Formatting" section in the top level README it would be > useful to add instructions for running clippy locally to avoid wasted CI time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9979) [Rust] Fix arrow crate clippy lints
[ https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-9979: - Assignee: Neville Dipale > [Rust] Fix arrow crate clippy lints > --- > > Key: ARROW-9979 > URL: https://issues.apache.org/jira/browse/ARROW-9979 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This fixes many clippy lints, but not all. It takes hours to address lints, > ansd we can work on remaining ones in future PRs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9980) [Rust] Fix parquet crate clippy lints
Neville Dipale created ARROW-9980: - Summary: [Rust] Fix parquet crate clippy lints Key: ARROW-9980 URL: https://issues.apache.org/jira/browse/ARROW-9980 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale This addresses most clippy lints on the parquet crate. Other remaining lints can be addressed as part of future PRs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9981) [Rust] Allow configuring flight IPC with IpcWriteOptions
Neville Dipale created ARROW-9981: - Summary: [Rust] Allow configuring flight IPC with IpcWriteOptions Key: ARROW-9981 URL: https://issues.apache.org/jira/browse/ARROW-9981 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale We have introduced an IPC write option, but we use the default for the arrow-flight crate, which is not ideal. Change this to allow configuring writer options. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5123) [Rust] derive RecordWriter from struct definitions
[ https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195410#comment-17195410 ] Neville Dipale commented on ARROW-5123: --- I'm unable to assign to Xavier > [Rust] derive RecordWriter from struct definitions > -- > > Key: ARROW-5123 > URL: https://issues.apache.org/jira/browse/ARROW-5123 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Xavier Lange >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 14h 20m > Remaining Estimate: 0h > > Migrated from previous github issue (which saw a lot of comments but at a > rough transition time in the project): > https://github.com/sunchao/parquet-rs/pull/197 > > Goal > === > Writing many columns to a file is a chore. If you can put your values in to a > struct which mirrors the schema of your file, this > `derive(ParquetRecordWriter)` will write out all the fields, in the order in > which they are defined, to a row_group. > How to Use > === > ``` > extern crate parquet; > #[macro_use] extern crate parquet_derive; > #[derive(ParquetRecordWriter)] > struct ACompleteRecord<'a> { > pub a_bool: bool, > pub a_str: &'a str, > } > ``` > RecordWriter trait > === > This is the new trait which `parquet_derive` will implement for your structs. > ``` > use super::RowGroupWriter; > pub trait RecordWriter { > fn write_to_row_group(, row_group_writer: Box); > } > ``` > How does it work? > === > The `parquet_derive` crate adds code generating functionality to the rust > compiler. The code generation takes rust syntax and emits additional syntax. > This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, > loaded by the machinery in cargo. Users don't have to do any special > `build.rs` steps or anything like that, it's automatic by including > `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a > section saying as much: > ``` > [lib] > proc-macro = true > ``` > The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to > the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The > `syn` crate parses the struct from a string-representation to a AST (a > recursive enum value). The AST contains all the values I care about when > generating a `RecordWriter` impl: > - the name of the struct > - the lifetime variables of the struct > - the fields of the struct > The fields of the struct are translated from AST to a flat `FieldInfo` > struct. It has the bits I care about for writing a column: `field_name`, > `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. > The code then does the equivalent of templating to build the `RecordWriter` > implementation. The templating functionality is provided by the `quote` > crate. At a high-level the template for `RecordWriter` looks like: > ``` > impl RecordWriter for $struct_name { > fn write_row_group(..) { > $({ > $column_writer_snippet > }) > } > } > ``` > this template is then added under the struct definition, ending up something > like: > ``` > struct MyStruct { > } > impl RecordWriter for MyStruct { > fn write_row_group(..) { > { > write_col_1(); > }; > { > write_col_2(); > } > } > } > ``` > and finally _THIS_ is the code passed to rustc. It's just code now, fully > expanded and standalone. If a user ever changes their `struct MyValue` > definition the `ParquetRecordWriter` will be regenerated. There's no > intermediate values to version control or worry about. > Viewing the Derived Code > === > To see the generated code before it's compiled, one very useful bit is to > install `cargo expand` [more info on > gh](https://github.com/dtolnay/cargo-expand), then you can do: > ``` > $WORK_DIR/parquet-rs/parquet_derive_test > cargo expand --lib > ../temp.rs > ``` > then you can dump the contents: > ``` > struct DumbRecord { > pub a_bool: bool, > pub a2_bool: bool, > } > impl RecordWriter for &[DumbRecord] { > fn write_to_row_group( > , > row_group_writer: Box, > ) { > let mut row_group_writer = row_group_writer; > { > let vals: Vec = self.iter().map(|x| x.a_bool).collect(); > let mut column_writer = > row_group_writer.next_column().unwrap().unwrap(); > if let > parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = > column_writer > { > typed.write_batch([..], None, None).unwrap(); > } > row_group_writer.close_column(column_writer).unwrap(); > }; > { > let vals: Vec = self.iter().map(|x| x.a2_bool).collect(); > let mut
[jira] [Resolved] (ARROW-5123) [Rust] derive RecordWriter from struct definitions
[ https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-5123. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 4140 [https://github.com/apache/arrow/pull/4140] > [Rust] derive RecordWriter from struct definitions > -- > > Key: ARROW-5123 > URL: https://issues.apache.org/jira/browse/ARROW-5123 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Xavier Lange >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 14h 10m > Remaining Estimate: 0h > > Migrated from previous github issue (which saw a lot of comments but at a > rough transition time in the project): > https://github.com/sunchao/parquet-rs/pull/197 > > Goal > === > Writing many columns to a file is a chore. If you can put your values in to a > struct which mirrors the schema of your file, this > `derive(ParquetRecordWriter)` will write out all the fields, in the order in > which they are defined, to a row_group. > How to Use > === > ``` > extern crate parquet; > #[macro_use] extern crate parquet_derive; > #[derive(ParquetRecordWriter)] > struct ACompleteRecord<'a> { > pub a_bool: bool, > pub a_str: &'a str, > } > ``` > RecordWriter trait > === > This is the new trait which `parquet_derive` will implement for your structs. > ``` > use super::RowGroupWriter; > pub trait RecordWriter { > fn write_to_row_group(, row_group_writer: Box); > } > ``` > How does it work? > === > The `parquet_derive` crate adds code generating functionality to the rust > compiler. The code generation takes rust syntax and emits additional syntax. > This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, > loaded by the machinery in cargo. Users don't have to do any special > `build.rs` steps or anything like that, it's automatic by including > `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a > section saying as much: > ``` > [lib] > proc-macro = true > ``` > The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to > the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The > `syn` crate parses the struct from a string-representation to a AST (a > recursive enum value). The AST contains all the values I care about when > generating a `RecordWriter` impl: > - the name of the struct > - the lifetime variables of the struct > - the fields of the struct > The fields of the struct are translated from AST to a flat `FieldInfo` > struct. It has the bits I care about for writing a column: `field_name`, > `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. > The code then does the equivalent of templating to build the `RecordWriter` > implementation. The templating functionality is provided by the `quote` > crate. At a high-level the template for `RecordWriter` looks like: > ``` > impl RecordWriter for $struct_name { > fn write_row_group(..) { > $({ > $column_writer_snippet > }) > } > } > ``` > this template is then added under the struct definition, ending up something > like: > ``` > struct MyStruct { > } > impl RecordWriter for MyStruct { > fn write_row_group(..) { > { > write_col_1(); > }; > { > write_col_2(); > } > } > } > ``` > and finally _THIS_ is the code passed to rustc. It's just code now, fully > expanded and standalone. If a user ever changes their `struct MyValue` > definition the `ParquetRecordWriter` will be regenerated. There's no > intermediate values to version control or worry about. > Viewing the Derived Code > === > To see the generated code before it's compiled, one very useful bit is to > install `cargo expand` [more info on > gh](https://github.com/dtolnay/cargo-expand), then you can do: > ``` > $WORK_DIR/parquet-rs/parquet_derive_test > cargo expand --lib > ../temp.rs > ``` > then you can dump the contents: > ``` > struct DumbRecord { > pub a_bool: bool, > pub a2_bool: bool, > } > impl RecordWriter for &[DumbRecord] { > fn write_to_row_group( > , > row_group_writer: Box, > ) { > let mut row_group_writer = row_group_writer; > { > let vals: Vec = self.iter().map(|x| x.a_bool).collect(); > let mut column_writer = > row_group_writer.next_column().unwrap().unwrap(); > if let > parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = > column_writer > { > typed.write_batch([..], None, None).unwrap(); > } > row_group_writer.close_column(column_writer).unwrap(); > }; > { > let vals: Vec =
[jira] [Updated] (ARROW-8883) [Rust] [Integration Testing] Enable passing tests and update spec doc
[ https://issues.apache.org/jira/browse/ARROW-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-8883: -- Summary: [Rust] [Integration Testing] Enable passing tests and update spec doc (was: [Rust] [Integration Testing] Disable unsupported tests) > [Rust] [Integration Testing] Enable passing tests and update spec doc > - > > Key: ARROW-8883 > URL: https://issues.apache.org/jira/browse/ARROW-8883 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Affects Versions: 0.17.0 >Reporter: Neville Dipale >Priority: Major > > Some of the integration test failures can be avoided by disabling unsupported > tests, like large lists and nested types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers
[ https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10040: -- Assignee: Neville Dipale > [Rust] Create a way to slice unalligned offset buffers > -- > > Key: ARROW-10040 > URL: https://issues.apache.org/jira/browse/ARROW-10040 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > We have limitations on the boolean kernels, where we can't apply the kernels > on buffers whose offsets aren't a multiple of 8. This has the potential of > preventing users from applying some computations on arrays whose offsets > aren't divisible by 8. > We could create methods on Buffer that allow slicing buffers and copying them > into aligned buffers. > An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints
[ https://issues.apache.org/jira/browse/ARROW-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10199. Fix Version/s: 2.0.0 Resolution: Fixed This has been resolved, and will be fixed in next release in about a week or 2 > [Rust][Parquet] Release Parquet at crates.io to remove debug prints > --- > > Key: ARROW-10199 > URL: https://issues.apache.org/jira/browse/ARROW-10199 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Affects Versions: 1.0.1 >Reporter: Krzysztof Stanisławek >Priority: Critical > Fix For: 2.0.0 > > > Version of Parquet released to docs.rs & crates.io has debug prints in > [https://github.com/apache/arrow/blob/886d87bdea78ce80e39a4b5b6fd6ca6042474c5f/rust/parquet/src/column/writer.rs#L60]. > They were pretty hard to track down, so I suggest considering logging create > in the future. When is the new version going to be released? Is there some > stable schedule I can expect? > Is it recommended to use the current snapshot straight from github instead of > crates.io? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers
[ https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10040: -- Assignee: Jörn Horstmann (was: Neville Dipale) > [Rust] Create a way to slice unalligned offset buffers > -- > > Key: ARROW-10040 > URL: https://issues.apache.org/jira/browse/ARROW-10040 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > We have limitations on the boolean kernels, where we can't apply the kernels > on buffers whose offsets aren't a multiple of 8. This has the potential of > preventing users from applying some computations on arrays whose offsets > aren't divisible by 8. > We could create methods on Buffer that allow slicing buffers and copying them > into aligned buffers. > An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10225) [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests
[ https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10225: --- Summary: [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests (was: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests) > [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests > --- > > Key: ARROW-10225 > URL: https://issues.apache.org/jira/browse/ARROW-10225 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The Arrow spec allows makes the null bitmap optional if an array has no nulls > [~carols10cents], so the tests that were failing were because we're comparing > `None` with a 100% populated bitmap. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers
[ https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10040. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8262 [https://github.com/apache/arrow/pull/8262] > [Rust] Create a way to slice unalligned offset buffers > -- > > Key: ARROW-10040 > URL: https://issues.apache.org/jira/browse/ARROW-10040 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > We have limitations on the boolean kernels, where we can't apply the kernels > on buffers whose offsets aren't a multiple of 8. This has the potential of > preventing users from applying some computations on arrays whose offsets > aren't divisible by 8. > We could create methods on Buffer that allow slicing buffers and copying them > into aligned buffers. > An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
[ https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10225. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8388 [https://github.com/apache/arrow/pull/8388] > [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests > --- > > Key: ARROW-10225 > URL: https://issues.apache.org/jira/browse/ARROW-10225 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The Arrow spec allows makes the null bitmap optional if an array has no nulls > [~carols10cents], so the tests that were failing were because we're comparing > `None` with a 100% populated bitmap. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-5352) [Rust] BinaryArray filter replaces nulls with empty strings
[ https://issues.apache.org/jira/browse/ARROW-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale closed ARROW-5352. - Resolution: Duplicate > [Rust] BinaryArray filter replaces nulls with empty strings > --- > > Key: ARROW-5352 > URL: https://issues.apache.org/jira/browse/ARROW-5352 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.13.0 >Reporter: Neville Dipale >Priority: Minor > > The filter implementation for BinaryArray discards nullness of data. > BinaryArrays that are null (seem to) always return an empty string slice when > getting a value, so the way filter works might be a bug depending on what > Arrow developers' or users' intentions are. > I think we should either preserve nulls (and their count) or document this as > intended behaviour. > Below is a test case that reproduces the bug. > {code:java} > #[test] > fn test_filter_binary_array_with_nulls() { > let mut a: BinaryBuilder = BinaryBuilder::new(100); > a.append_null().unwrap(); > a.append_string("a string").unwrap(); > a.append_null().unwrap(); > a.append_string("with nulls").unwrap(); > let array = a.finish(); > let b = BooleanArray::from(vec![true, true, true, true]); > let c = filter(, ).unwrap(); > let d: = c.as_any().downcast_ref::().unwrap(); > // I didn't expect this behaviour > assert_eq!("", d.get_string(0)); > // fails here > assert!(d.is_null(0)); > assert_eq!(4, d.len()); > // fails here > assert_eq!(2, d.null_count()); > assert_eq!("a string", d.get_string(1)); > // fails here > assert!(d.is_null(2)); > assert_eq!("with nulls", d.get_string(3)); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10204) [RUST] [Datafusion] Test failure in aggregate_grouped_empty with simd feature enabled
[ https://issues.apache.org/jira/browse/ARROW-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10204. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8378 [https://github.com/apache/arrow/pull/8378] > [RUST] [Datafusion] Test failure in aggregate_grouped_empty with simd feature > enabled > - > > Key: ARROW-10204 > URL: https://issues.apache.org/jira/browse/ARROW-10204 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > {code} > execution::context::tests::aggregate_grouped_empty stdout > thread 'execution::context::tests::aggregate_grouped_empty' panicked at > 'assertion failed: `(left == right)` > left: `["0,0.0"]`, > right: `[]`', datafusion/src/execution/context.rs:883:9 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-5440) [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos
[ https://issues.apache.org/jira/browse/ARROW-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale closed ARROW-5440. - Resolution: Cannot Reproduce >From the comments, it sounds like this is no longer an issue > [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos > - > > Key: ARROW-5440 > URL: https://issues.apache.org/jira/browse/ARROW-5440 > Project: Apache Arrow > Issue Type: Bug > Components: Rust > Environment: CentOS Linux release 7.6.1810 (Core) >Reporter: Tenzin Rigden >Priority: Major > Attachments: parquet-test-libstd.tar.gz, serde_json_test.tar.gz > > > Hello, > In the rust parquet implementation ([https://github.com/sunchao/parquet-rs]) > on centos, the binary created has a `libstd-hash.so` shared library > dependency that is causing issues since it's a shared library found in the > rustup directory. This `libstd-hash.so` dependency isn't there on any other > rust binaries I've made before. This dependency means that I can't run this > binary anywhere where rustup isn't installed with that exact libstd library. > This is not an issue on Mac. > I've attached the rust files and here is the command line output below. > {code:java|title=cli-output|borderStyle=solid} > [centos@_ parquet-test]$ cat /etc/centos-release > CentOS Linux release 7.6.1810 (Core) > [centos@_ parquet-test]$ rustc --version > rustc 1.36.0-nightly (e70d5386d 2019-05-27) > [centos@_ parquet-test]$ ldd target/release/parquet-test > linux-vdso.so.1 => (0x7ffd02fee000) > libstd-44988553032616b2.so => not found > librt.so.1 => /lib64/librt.so.1 (0x7f6ecd209000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x7f6eccfed000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f6eccdd7000) > libc.so.6 => /lib64/libc.so.6 (0x7f6ecca0a000) > libm.so.6 => /lib64/libm.so.6 (0x7f6ecc708000) > /lib64/ld-linux-x86-64.so.2 (0x7f6ecd8b1000) > [centos@_ parquet-test]$ ls -l > ~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so > -rw-r--r--. 1 centos centos 5623568 May 27 21:46 > /home/centos/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10299) [Rust] Support reading and writing V5 of IPC metadata
Neville Dipale created ARROW-10299: -- Summary: [Rust] Support reading and writing V5 of IPC metadata Key: ARROW-10299 URL: https://issues.apache.org/jira/browse/ARROW-10299 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 2.0.0 Reporter: Neville Dipale This is mostly alignment issues and tracking when we encounter the v4 legacy padding. I had done this work in another branch, but discarded it without noticing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches
Neville Dipale created ARROW-10191: -- Summary: [Rust] [Parquet] Add roundtrip tests for single column batches Key: ARROW-10191 URL: https://issues.apache.org/jira/browse/ARROW-10191 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.1 Reporter: Neville Dipale To aid with test coverage and picking up information loss during Parquet and Arrow roundtrips, we can add tests that assert that all supported Arrow datatypes can be written and read correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10198) [Dev] Python merge script doesn't close PRs if not merged on master
Neville Dipale created ARROW-10198: -- Summary: [Dev] Python merge script doesn't close PRs if not merged on master Key: ARROW-10198 URL: https://issues.apache.org/jira/browse/ARROW-10198 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Affects Versions: 1.0.1 Reporter: Neville Dipale When using the merge script to merge PRs against non-master branches, the PR on Github doesn't get closed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10289) [Rust] Support reading dictionary streams
Neville Dipale created ARROW-10289: -- Summary: [Rust] Support reading dictionary streams Key: ARROW-10289 URL: https://issues.apache.org/jira/browse/ARROW-10289 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 2.0.0 Reporter: Neville Dipale We support reading dictionaries in the IPC file reader. We should do the same with the stream reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10236. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 8460 [https://github.com/apache/arrow/pull/8460] > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10236: --- Component/s: Rust > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10236: --- Affects Version/s: 2.0.0 > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 2.0.0 >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)
[ https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215162#comment-17215162 ] Neville Dipale commented on ARROW-10187: [~andygrove] 64-bit types and offsets would also be a blocker for supporting wasm32. If someone completes ARROW-9453, perhaps we can gauge from that on what effort it takes to support 32-bit. > [Rust] Test failures on 32 bit ARM (Raspberry Pi) > - > > Key: ARROW-10187 > URL: https://issues.apache.org/jira/browse/ARROW-10187 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > Perhaps these failures are to be expected and perhaps we can't really support > 32 bit? > > {code:java} > array::array::tests::test_primitive_array_from_vec stdout > thread 'array::array::tests::test_primitive_array_from_vec' panicked at > 'assertion failed: `(left == right)` > left: `144`, > right: `104`', arrow/src/array/array.rs:2383:9 > array::array::tests::test_primitive_array_from_vec_option stdout > thread 'array::array::tests::test_primitive_array_from_vec_option' panicked > at 'assertion failed: `(left == right)` > left: `224`, > right: `176`', arrow/src/array/array.rs:2409:9 > array::null::tests::test_null_array stdout > thread 'array::null::tests::test_null_array' panicked at 'assertion failed: > `(left == right)` > left: `64`, > right: `32`', arrow/src/array/null.rs:134:9 > array::union::tests::test_dense_union_i32 stdout > thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion > failed: `(left == right)` > left: `1024`, > right: `768`', arrow/src/array/union.rs:704:9 > memory::tests::test_allocate stdout > thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left > == right)` > left: `0`, > right: `32`', arrow/src/memory.rs:243:13 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-5350) [Rust] Support filtering on primitive/string lists
[ https://issues.apache.org/jira/browse/ARROW-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-5350: - Assignee: Neville Dipale > [Rust] Support filtering on primitive/string lists > -- > > Key: ARROW-5350 > URL: https://issues.apache.org/jira/browse/ARROW-5350 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > We currently only filter on primitive types, but not on lists and structs. > Add the ability to filter on nested array types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-5350) [Rust] Support filtering on primitive/string lists
[ https://issues.apache.org/jira/browse/ARROW-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-5350. --- Resolution: Fixed Issue resolved by pull request 8364 [https://github.com/apache/arrow/pull/8364] > [Rust] Support filtering on primitive/string lists > -- > > Key: ARROW-5350 > URL: https://issues.apache.org/jira/browse/ARROW-5350 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > We currently only filter on primitive types, but not on lists and structs. > Add the ability to filter on nested array types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray
[ https://issues.apache.org/jira/browse/ARROW-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10334. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 8484 [https://github.com/apache/arrow/pull/8484] > [Rust] [Parquet] Support reading and writing Arrow NullArray > > > Key: ARROW-10334 > URL: https://issues.apache.org/jira/browse/ARROW-10334 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 2.0.0 >Reporter: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray
[ https://issues.apache.org/jira/browse/ARROW-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10334: -- Assignee: Neville Dipale > [Rust] [Parquet] Support reading and writing Arrow NullArray > > > Key: ARROW-10334 > URL: https://issues.apache.org/jira/browse/ARROW-10334 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 2.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7842) [Rust] [Parquet] Implement array reader for list type
[ https://issues.apache.org/jira/browse/ARROW-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-7842. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 8449 [https://github.com/apache/arrow/pull/8449] > [Rust] [Parquet] Implement array reader for list type > - > > Key: ARROW-7842 > URL: https://issues.apache.org/jira/browse/ARROW-7842 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Morgan Cassels >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > Currently array reader does not support list or map types. The initial PR > implementing array reader https://issues.apache.org/jira/browse/ARROW-4218 > says that list and map support will come later. Is it known when support for > list types might be implemented? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray
Neville Dipale created ARROW-10334: -- Summary: [Rust] [Parquet] Support reading and writing Arrow NullArray Key: ARROW-10334 URL: https://issues.apache.org/jira/browse/ARROW-10334 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 2.0.0 Reporter: Neville Dipale -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support
[ https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10163: --- Component/s: Rust - DataFusion > [Rust] [DataFusion] Add DictionaryArray coercion support > > > Key: ARROW-10163 > URL: https://issues.apache.org/jira/browse/ARROW-10163 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Affects Versions: 2.0.0 >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > --- > There is code in the datafusion physical planner that coerces arguments to > compatible types for some expressions (e.g. for equals: > https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153) > This code needs to be modified to understand dictionary types (so, for > example we can express a predicate like col1 = "foo", where col1 is a > DictionaryArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support
[ https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10163: --- Affects Version/s: 2.0.0 > [Rust] [DataFusion] Add DictionaryArray coercion support > > > Key: ARROW-10163 > URL: https://issues.apache.org/jira/browse/ARROW-10163 > Project: Apache Arrow > Issue Type: Sub-task >Affects Versions: 2.0.0 >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > --- > There is code in the datafusion physical planner that coerces arguments to > compatible types for some expressions (e.g. for equals: > https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153) > This code needs to be modified to understand dictionary types (so, for > example we can express a predicate like col1 = "foo", where col1 is a > DictionaryArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10002) [Rust] Trait-specialization requires nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10002: -- Assignee: Jorge Leitão > [Rust] Trait-specialization requires nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there > has been further discussion and ideas for resolving the soundness issue, but > to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10002) [Rust] Trait-specialization requires nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10002. Resolution: Fixed Issue resolved by pull request 8485 [https://github.com/apache/arrow/pull/8485] > [Rust] Trait-specialization requires nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there > has been further discussion and ideas for resolving the soundness issue, but > to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion
[ https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216350#comment-17216350 ] Neville Dipale commented on ARROW-10159: [~alamb] if there aren't more subtasks, we can mark this as completed. Thanks for getting this done > [Rust][DataFusion] Add support for Dictionary types in data fusion > -- > > Key: ARROW-10159 > URL: https://issues.apache.org/jira/browse/ARROW-10159 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We have a system that need to process low cardinality string data (aka there > are only a few distinct values, but there are many millions of values). > Using a `StringArray` is very expensive as the same string value is copied > over and over again. The `DictionaryArray` was exactly designed to handle > this situatio: rather than repeating each string, it uses indexes into a > dictionary and thus repeats integer values. > Sadly, DataFusion does not support processing on `DictionaryArray` types for > several reasons. > This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I > would like to be possible: > {code} > #[tokio::test] > async fn query_on_string_dictionary() -> Result<()> { > // ensure that data fusion can operate on dictionary types > // Use StringDictionary (32 bit indexes = keys) > let field_type = DataType::Dictionary( > Box::new(DataType::Int32), > Box::new(DataType::Utf8), > ); > let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, > true)])); > let keys_builder = PrimitiveBuildernew(10); > let values_builder = StringBuilder::new(10); > let mut builder = StringDictionaryBuilder::new( > keys_builder, values_builder > ); > builder.append("one")?; > builder.append_null()?; > builder.append("three")?; > let array = Arc::new(builder.finish()); > let data = RecordBatch::try_new( > schema.clone(), > vec![array], > )?; > let table = MemTable::new(schema, vec![vec![data]])?; > let mut ctx = ExecutionContext::new(); > ctx.register_table("test", Box::new(table)); > // Basic SELECT > let sql = "SELECT * FROM test"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"one\"\nNULL\n\"three\"".to_string(); > assert_eq!(expected, actual); > // basic filtering > let sql = "SELECT * FROM test WHERE d1 IS NOT NULL"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"one\"\n\"three\"".to_string(); > assert_eq!(expected, actual); > // filtering with constant > let sql = "SELECT * FROM test WHERE d1 = 'three'"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"three\"".to_string(); > assert_eq!(expected, actual); > // Expression evaluation > let sql = "SELECT concat(d1, '-foo') FROM test"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string(); > assert_eq!(expected, actual); > // aggregation > let sql = "SELECT COUNT(d1) FROM test"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "2".to_string(); > assert_eq!(expected, actual); > Ok(()) > } > {code} > However, it errors immediately: > {code} > query_on_string_dictionary stdout > thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == > right)` > left: `"\"one\"\nNULL\n\"three\""`, > right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code{ > This ticket tracks adding proper support Dictionary types to DataFusion. I > will break the work down into several smaller subtasks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support
[ https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10163. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 8463 [https://github.com/apache/arrow/pull/8463] > [Rust] [DataFusion] Add DictionaryArray coercion support > > > Key: ARROW-10163 > URL: https://issues.apache.org/jira/browse/ARROW-10163 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > --- > There is code in the datafusion physical planner that coerces arguments to > compatible types for some expressions (e.g. for equals: > https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153) > This code needs to be modified to understand dictionary types (so, for > example we can express a predicate like col1 = "foo", where col1 is a > DictionaryArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion
[ https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10159: --- Component/s: Rust - DataFusion Rust > [Rust][DataFusion] Add support for Dictionary types in data fusion > -- > > Key: ARROW-10159 > URL: https://issues.apache.org/jira/browse/ARROW-10159 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We have a system that need to process low cardinality string data (aka there > are only a few distinct values, but there are many millions of values). > Using a `StringArray` is very expensive as the same string value is copied > over and over again. The `DictionaryArray` was exactly designed to handle > this situatio: rather than repeating each string, it uses indexes into a > dictionary and thus repeats integer values. > Sadly, DataFusion does not support processing on `DictionaryArray` types for > several reasons. > This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I > would like to be possible: > {code} > #[tokio::test] > async fn query_on_string_dictionary() -> Result<()> { > // ensure that data fusion can operate on dictionary types > // Use StringDictionary (32 bit indexes = keys) > let field_type = DataType::Dictionary( > Box::new(DataType::Int32), > Box::new(DataType::Utf8), > ); > let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, > true)])); > let keys_builder = PrimitiveBuildernew(10); > let values_builder = StringBuilder::new(10); > let mut builder = StringDictionaryBuilder::new( > keys_builder, values_builder > ); > builder.append("one")?; > builder.append_null()?; > builder.append("three")?; > let array = Arc::new(builder.finish()); > let data = RecordBatch::try_new( > schema.clone(), > vec![array], > )?; > let table = MemTable::new(schema, vec![vec![data]])?; > let mut ctx = ExecutionContext::new(); > ctx.register_table("test", Box::new(table)); > // Basic SELECT > let sql = "SELECT * FROM test"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"one\"\nNULL\n\"three\"".to_string(); > assert_eq!(expected, actual); > // basic filtering > let sql = "SELECT * FROM test WHERE d1 IS NOT NULL"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"one\"\n\"three\"".to_string(); > assert_eq!(expected, actual); > // filtering with constant > let sql = "SELECT * FROM test WHERE d1 = 'three'"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"three\"".to_string(); > assert_eq!(expected, actual); > // Expression evaluation > let sql = "SELECT concat(d1, '-foo') FROM test"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string(); > assert_eq!(expected, actual); > // aggregation > let sql = "SELECT COUNT(d1) FROM test"; > let actual = execute( ctx, sql).await.join("\n"); > let expected = "2".to_string(); > assert_eq!(expected, actual); > Ok(()) > } > {code} > However, it errors immediately: > {code} > query_on_string_dictionary stdout > thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == > right)` > left: `"\"one\"\nNULL\n\"three\""`, > right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code{ > This ticket tracks adding proper support Dictionary types to DataFusion. I > will break the work down into several smaller subtasks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType
Neville Dipale created ARROW-10261: -- Summary: [Rust] [BREAKING] Lists should take Field instead of DataType Key: ARROW-10261 URL: https://issues.apache.org/jira/browse/ARROW-10261 Project: Apache Arrow Issue Type: Sub-task Components: Integration, Rust Affects Versions: 1.0.1 Reporter: Neville Dipale There is currently no way of tracking nested field metadata on lists. For example, if a list's children are nullable, there's no way of telling just by looking at the Field. This causes problems with integration testing, and also affects Parquet roundtrips. I propose the breaking change of [Large|FixedSize]List taking a Field instead of Box, as this will overcome this issue, and ensure that the Rust implementation passes integration tests. CC [~andygrove] [~jorgecarleitao] [~alamb] [~jhorstmann] ([~carols10cents] as this addresses some of the roundtrip failures). I'm leaning towards this landing in 3.0.0, as I'd love for us to have completed or made significant traction on the Arrow Parquet writer (and reader), and integration testing, by then. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10258) [Rust] Support extension arrays
Neville Dipale created ARROW-10258: -- Summary: [Rust] Support extension arrays Key: ARROW-10258 URL: https://issues.apache.org/jira/browse/ARROW-10258 Project: Apache Arrow Issue Type: New Feature Components: Integration, Rust Affects Versions: 1.0.1 Reporter: Neville Dipale This should include: * supporting the Arrow format * supporting field metadata We can optionally: * support recognising known extensions (like UUID) I'm mainly opening this up for wider visibility, I noticed that I was catching strays from metadata integration tests failing because Field doesn't support metadata :( -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10258) [Rust] Support extension arrays
[ https://issues.apache.org/jira/browse/ARROW-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10258: --- Fix Version/s: 3.0.0 > [Rust] Support extension arrays > --- > > Key: ARROW-10258 > URL: https://issues.apache.org/jira/browse/ARROW-10258 > Project: Apache Arrow > Issue Type: New Feature > Components: Integration, Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > Fix For: 3.0.0 > > > This should include: > * supporting the Arrow format > * supporting field metadata > We can optionally: > * support recognising known extensions (like UUID) > I'm mainly opening this up for wider visibility, I noticed that I was > catching strays from metadata integration tests failing because Field doesn't > support metadata :( -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10259) [Rust] Support field metadata
Neville Dipale created ARROW-10259: -- Summary: [Rust] Support field metadata Key: ARROW-10259 URL: https://issues.apache.org/jira/browse/ARROW-10259 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale The biggest hurdle to adding field metadata is HashMap and HashSet not implementing Hash, Ord and PartialOrd. I was thinking of implementing the metadata as a Vec<(String, String)> to overcome this limitation, and then serializing correctly to JSON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10269) [Rust] Update nightly: Oct 2020 Edition
Neville Dipale created ARROW-10269: -- Summary: [Rust] Update nightly: Oct 2020 Edition Key: ARROW-10269 URL: https://issues.apache.org/jira/browse/ARROW-10269 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Neville Dipale We should update to a more recent nighly after the 2.0.0 release. It carries some clippy annoyances, which will mean that I have to revert much of what I did around float comparisons. Might also be preferable to do this sooner, so that we can complete the clippy integration and throw away the carrot in favour of the stick. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10268) [Rust] Support writing dictionaries to IPC file and stream
Neville Dipale created ARROW-10268: -- Summary: [Rust] Support writing dictionaries to IPC file and stream Key: ARROW-10268 URL: https://issues.apache.org/jira/browse/ARROW-10268 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.1 Reporter: Neville Dipale We currently do not support writing dictionary arrays to the IPC file and stream format. When this is supported, we can test the integration with other implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType
[ https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211654#comment-17211654 ] Neville Dipale commented on ARROW-10261: [~jhorstmann] nullability should be determined by the overall field for consistency; as you could have 1000 batches of 1000 records, but only have say 5 nulls scattered around. The main issue is that if I have a non-nullable list, which in turn has a nullable struct with various child fields with differing nullability; I won't know if the struct is nullable, because I lose that information when only taking the field. Also, in the hypothetical case where the struct has some metadata of its own, it gets lost because we would only keep the DataType, and not other attributes such as dictionary or metadata (HashMap). Interestingly, looking at the CPP implementation, it looks like they still use List, but I can't see how they preserve the extra details that the Rust implementation is failing because of. [~apitrou] any ideas? > [Rust] [BREAKING] Lists should take Field instead of DataType > - > > Key: ARROW-10261 > URL: https://issues.apache.org/jira/browse/ARROW-10261 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > > There is currently no way of tracking nested field metadata on lists. For > example, if a list's children are nullable, there's no way of telling just by > looking at the Field. > This causes problems with integration testing, and also affects Parquet > roundtrips. > I propose the breaking change of [Large|FixedSize]List taking a Field instead > of Box, as this will overcome this issue, and ensure that the Rust > implementation passes integration tests. > CC [~andygrove] [~jorgecarleitao] [~alamb] [~jhorstmann] ([~carols10cents] > as this addresses some of the roundtrip failures). > I'm leaning towards this landing in 3.0.0, as I'd love for us to have > completed or made significant traction on the Arrow Parquet writer (and > reader), and integration testing, by then. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Priority: Blocker (was: Major) > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Priority: Blocker > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Affects Version/s: 1.0.1 > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Priority: Major > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Component/s: Rust > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10271: --- Fix Version/s: 2.0.0 > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10271: -- Assignee: Neville Dipale > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project
[ https://issues.apache.org/jira/browse/ARROW-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211839#comment-17211839 ] Neville Dipale commented on ARROW-10271: I was planning on doing a pass to check if there's dependencies that we could bump. I'm aware of the packed_simd_2 change, and was planning on addressing it. While we use an old nightly (call it a six-monthly at this stage), this issue will definitely break a lot of code for users. > [Rust] packed_simd is broken and continued under a new project > -- > > Key: ARROW-10271 > URL: https://issues.apache.org/jira/browse/ARROW-10271 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 1.0.1 >Reporter: Ritchie >Assignee: Neville Dipale >Priority: Blocker > Fix For: 2.0.0 > > > The dependency doesn't compile on newer versions of nightly. This is also > known by the (new) project maintainers. Due to complications they continued > the project under a new name: `packed_simd_2`. > > packed_simd = { version = "0.3.4", package = "packed_simd_2" } > > See: > https://github.com/rust-lang/packed_simd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10274) [Rust] arithmetic without SIMD does unnecesary copy
[ https://issues.apache.org/jira/browse/ARROW-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10274: --- Component/s: Rust > [Rust] arithmetic without SIMD does unnecesary copy > --- > > Key: ARROW-10274 > URL: https://issues.apache.org/jira/browse/ARROW-10274 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Ritchie >Priority: Minor > > The arithmetic kernels that don't use SIMD create a `vec` in memory and later > copy that data into a Buffer. Maybe we could directly write the arithmetic > result to a mutable buffer and prevent this redundant copy? > > > {code:java} > let values = (0..left.len()) > .map(|i| op(left.value(i), right.value(i))) > .collect::>(); > > > let data = ArrayData::new( > T::get_data_type(), > left.len(), > None, > null_bit_buffer, > 0, > vec![Buffer::from(values.to_byte_slice())], > vec![], > );{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10168) [Rust] [Parquet] Extend arrow schema conversion to projected fields
[ https://issues.apache.org/jira/browse/ARROW-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10168. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8354 [https://github.com/apache/arrow/pull/8354] > [Rust] [Parquet] Extend arrow schema conversion to projected fields > --- > > Key: ARROW-10168 > URL: https://issues.apache.org/jira/browse/ARROW-10168 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When writing Arrow data to Parquet, we serialise the schema's IPC > representation. This schema is then read back by the Parquet reader, and used > to preserve the array type information from the original Arrow data. > We however do not rely on the above mechanism when reading projected columns > from a Parquet file; i.e. if we have a file with 3 columns, but we only read > 2 columns, we do not yet rely on the serialised arrow schema; and can thus > lose type information. > This behaviour was deliberately left out, as the function > *parquet_to_arrow_schema_by_columns* does not check for the existence of > arrow schema in the metadata. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
Neville Dipale created ARROW-10225: -- Summary: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests Key: ARROW-10225 URL: https://issues.apache.org/jira/browse/ARROW-10225 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.1 Reporter: Neville Dipale The Arrow spec allows makes the null bitmap optional if an array has no nulls [~carols10cents], so the tests that were failing were because we're comparing `None` with a 100% populated bitmap. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
[ https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10225: -- Assignee: Neville Dipale > [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests > --- > > Key: ARROW-10225 > URL: https://issues.apache.org/jira/browse/ARROW-10225 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > > The Arrow spec allows makes the null bitmap optional if an array has no nulls > [~carols10cents], so the tests that were failing were because we're comparing > `None` with a 100% populated bitmap. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10350) [Rust] parquet_derive crate cannot be published to crates.io
[ https://issues.apache.org/jira/browse/ARROW-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217141#comment-17217141 ] Neville Dipale commented on ARROW-10350: I added them as part of another commit, but the pre-release tests were failing. I couldn't figure out what the problem was, so I reverted the changes. I think it's fine that we don't have the crate published as part of this release. Users can still use it from git for now. > [Rust] parquet_derive crate cannot be published to crates.io > > > Key: ARROW-10350 > URL: https://issues.apache.org/jira/browse/ARROW-10350 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 2.0.0 >Reporter: Andy Grove >Priority: Major > Fix For: 3.0.0 > > > The new parquet_derive crate is missing some fields in the Cargo manifest so > cannot be published. > {code:java} >Uploading parquet_derive v2.0.0 > (/home/andygrove/arrow-release/apache-arrow-2.0.0/rust/parquet_derive) > error: api errors (status 200 OK): missing or empty metadata fields: > description, license. Please see > https://doc.rust-lang.org/cargo/reference/manifest.html for how to upload > metadata > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9742) [Rust] Create one standard DataFrame API
[ https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179274#comment-17179274 ] Neville Dipale commented on ARROW-9742: --- Hi [~jhorstmann], the scalar functions on the rust-dataframe library mainly call the Arrow compute functions. As we have implemented compute functions with an array being the smallest unit, I iterate the chunked arrays and call scalar functions on the arrays, before grouping them again into a chunk. I explored usin Rayon for parallelising those compute functions, but it's not a priority (the project is really for me to explore ideas, with the goal being to create a lazy dataframe ala spark). There's scope to add a lot of compute functions to Arrow so that downstream users can reuse them, and so we can optimise performance from one place. I haven't yet seen interest in functions like trig, temporal functions (I have a Jira open for this as I tend to do a lot of datetime conversions), and other functions beyond what we have. I think DF has some of these as UDFs, which probably makes sense to keep them there for now. Regarding performance, we've found some patterns that help with autovectorisation when writing compute functions, I think at the least we could write them up so that downstream users can at least follow them. One common mistake I've seen is that we iterate through array values, checking if a slot is valid or null, and computing the function if valid. An approach that works is to ignore nulls and calculate them from the validty mask. > [Rust] Create one standard DataFrame API > > > Key: ARROW-9742 > URL: https://issues.apache.org/jira/browse/ARROW-9742 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > There was a discussion in last Arrow sync call about the fact that there are > numerous Rust DataFrame projects and it would be good to have one standard, > in the Arrow repo. > I do think it would be good to have a DataFrame trait in Arrow, with an > implementation in DataFusion, and making it possible for other projects to > extend/replace the implementation e.g. for distributed compute, or for GPU > compute, as two examples. > [~jhorstmann] Does this capture what you were suggesting in the call? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format
Neville Dipale created ARROW-9777: - Summary: [Rust] Implement IPC changes to catch up to 1.0.0 format Key: ARROW-9777 URL: https://issues.apache.org/jira/browse/ARROW-9777 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale There are a number of IPC changes and features which the Rust implementation has fallen behind on. It's effectively using the legacy format that was released in 0.14.x. Some that I encountered are: * change padding from 4 bytes to 8 bytes (along with the padding algorithm) * add an IPC writer option to support the legacy format and updated format * add error handling for the different metadata versions, we should support v4+ so it's an oversight to not explicitly return errors if unsupported versions are read Some of the work already has Jiras open (e.g. body compression), I'll find them and mark them as related to this. I'm tight for spare time, but I'll try work on this before the next release (along with the Parquet writer) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8423) [Rust] [Parquet] Serialize arrow schema into metadata when writing parquet
[ https://issues.apache.org/jira/browse/ARROW-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-8423. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 7917 [https://github.com/apache/arrow/pull/7917] > [Rust] [Parquet] Serialize arrow schema into metadata when writing parquet > -- > > Key: ARROW-8423 > URL: https://issues.apache.org/jira/browse/ARROW-8423 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andy Grove >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The C++ implementation uses "ARROW:schema" as a value to store the arrow > schema as metadata. Implement same for compatibility. > Having the original Arrow schema is useful for readers as it preserves some > properties like dictionary encoding. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9728) [Rust] [Parquet] Compute nested spacing
[ https://issues.apache.org/jira/browse/ARROW-9728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-9728: - Assignee: Neville Dipale > [Rust] [Parquet] Compute nested spacing > --- > > Key: ARROW-9728 > URL: https://issues.apache.org/jira/browse/ARROW-9728 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > > When computing definition levels for deeply nested arrays that include lists, > the definition levels are correctly calculated, but they are not translated > into correct indexes for the eventual primitive arrays. > For example, an int32 array could have no null values, but be a child of a > list that has null values. If say the first 5 values of the int32 array are > members of the first list item (i.e. list_array[0] = [1,2,3,4,5], and that > list is itself a child of a struct whose index is null, the whole 5 values of > the int32 array *should* be skipped. Further, the list's definition and > repetition levels will be represented by 1 slot instead of the 5. > The current logic cannot cater for this, and potentially results in slicing > the int32 array incorrectly (sometimes including some of those first 5 > values). > This Jira is for the work necessary to compute the index into the eventual > leaf arrays correctly. > I started doing it as part of the initial writer PR, but it's complex and is > blocking progress. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5213) [Format] Script for updating various checked-in Flatbuffers files
[ https://issues.apache.org/jira/browse/ARROW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181303#comment-17181303 ] Neville Dipale commented on ARROW-5213: --- Which reminds me, I made updates to the files a while ago. I had to make manual changes because the flatbuffers crate hasn't been updated with some changes. I'll update the checked-in files when I look at the IPC changes that we haven't worked on in Rust > [Format] Script for updating various checked-in Flatbuffers files > - > > Key: ARROW-5213 > URL: https://issues.apache.org/jira/browse/ARROW-5213 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools, Format, Go >Reporter: Wes McKinney >Priority: Minor > Fix For: 2.0.0 > > > Some subprojects have begun checking in generated Flatbuffers files to source > control. This presents a maintainability issue when there are additions or > changes made to the .fbs sources. It would be useful to be able to automate > the update of these files so it doesn't have to happen on a manual / > case-by-case basis -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9841) [Rust] Update checked-in flatbuffer files
Neville Dipale created ARROW-9841: - Summary: [Rust] Update checked-in flatbuffer files Key: ARROW-9841 URL: https://issues.apache.org/jira/browse/ARROW-9841 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale We can't automatically generate flatbuffer files in Rust due to a bug with required fields. The currently checked-in generated files are outdated, and should either be updated manually or by building the flatbuffers project from master in order to update them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format
[ https://issues.apache.org/jira/browse/ARROW-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9777: -- Issue Type: Improvement (was: Bug) > [Rust] Implement IPC changes to catch up to 1.0.0 format > > > Key: ARROW-9777 > URL: https://issues.apache.org/jira/browse/ARROW-9777 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Priority: Major > > There are a number of IPC changes and features which the Rust implementation > has fallen behind on. It's effectively using the legacy format that was > released in 0.14.x. > Some that I encountered are: > * change padding from 4 bytes to 8 bytes (along with the padding algorithm) > * add an IPC writer option to support the legacy format and updated format > * add error handling for the different metadata versions, we should support > v4+ so it's an oversight to not explicitly return errors if unsupported > versions are read > Some of the work already has Jiras open (e.g. body compression), I'll find > them and mark them as related to this. > I'm tight for spare time, but I'll try work on this before the next release > (along with the Parquet writer) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9826) [Rust] add set function to PrimitiveArray
[ https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183712#comment-17183712 ] Neville Dipale commented on ARROW-9826: --- Once arrays are built, they're meant to be immutable. Wouldn't this better belong in ArrayBuilder? > [Rust] add set function to PrimitiveArray > - > > Key: ARROW-9826 > URL: https://issues.apache.org/jira/browse/ARROW-9826 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 1.0.0 >Reporter: Francesco Gadaleta >Priority: Major > > For in-place value replacement in Array, a `set()` function (maybe unsafe?) > would be required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment
Neville Dipale created ARROW-9848: - Summary: [Rust] Implement changes to ensure flatbuffer alignment Key: ARROW-9848 URL: https://issues.apache.org/jira/browse/ARROW-9848 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 1.0.0 Reporter: Neville Dipale See ARROW-6313, changes were made to all IPC implementations except for Rust -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10019) [Rust] Add substring kernel
[ https://issues.apache.org/jira/browse/ARROW-10019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10019. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8199 [https://github.com/apache/arrow/pull/8199] > [Rust] Add substring kernel > --- > > Key: ARROW-10019 > URL: https://issues.apache.org/jira/browse/ARROW-10019 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > substring returns a substring of a StringArray starting at a given index, and > with a given optional length. > {{fn substring(array: , start: i32, length: ) -> > Result}} > This operation is common in strings, and it is useful for string-based > transformations -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10108) [Rust] [Parquet] Fix compiler warning about unused return value
[ https://issues.apache.org/jira/browse/ARROW-10108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204366#comment-17204366 ] Neville Dipale commented on ARROW-10108: There are a lot of clippy issues with the latest nightly too, perhaps we could move to the latest nightly after the 2.0.0 release? I'd also like for us to perform destructive refactors soon as possible after the release (rearranging the arrow::array module by splitting arrays into their own files) > [Rust] [Parquet] Fix compiler warning about unused return value > --- > > Key: ARROW-10108 > URL: https://issues.apache.org/jira/browse/ARROW-10108 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > When compiling with latest nightly, this warning is produced: > {code:java} > warning: unused return value of `std::mem::replace` that must be used >--> parquet/src/encodings/encoding.rs:391:9 > | > 391 | mem::replace( self.hash_slots, new_hash_slots); > | ^^^ > | > = note: `#[warn(unused_must_use)]` on by default > = note: if you don't need the old value, you can just assign the new > value directly {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8421) [Rust] [Parquet] Implement parquet writer
[ https://issues.apache.org/jira/browse/ARROW-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-8421: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] [Parquet] Implement parquet writer > - > > Key: ARROW-8421 > URL: https://issues.apache.org/jira/browse/ARROW-8421 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the parent story. See subtasks for more information. > Notes from [~wesm] : > A couple of initial things to keep in mind > * Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields > * You can optimize the special case where a nullable field's data has no > nulls > * A good amount of code is required to handle converting from the Arrow > physical form of various logical types to the Parquet equivalent one, see > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc] > for details > * It would be worth thinking up front about how dictionary-encoded data is > handled both on the Arrow write and Arrow read paths. In parquet-cpp we > initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary > to dense String), and through real world need I was forced to revisit this > (quite painfully) to enable Arrow dictionaries to survive roundtrips to > Parquet format, and also achieve better performance and memory use in both > reads and writes. You can certainly do a dictionary-to-dense conversion like > we did, but you may someday find yourselves doing the same painful refactor > that I did to make dictionary write and read not only more efficient but also > dictionary order preserving. > Notes from [~sunchao] : > I roughly skimmed through the C++ implementation and think on the high level > we need to do the following: > # implement a method similar to {{WriteArrow}} in > [column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]. > We can further break this up into smaller pieces such as: > dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so > on and so forth. > # implement an arrow writer in the parquet crate > [here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow]. > This needs to offer similar APIs as > [writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8859) [Rust] [Integration Testing] Implement --quiet / verbose correctly
[ https://issues.apache.org/jira/browse/ARROW-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-8859: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] [Integration Testing] Implement --quiet / verbose correctly > -- > > Key: ARROW-8859 > URL: https://issues.apache.org/jira/browse/ARROW-8859 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 3.0.0 > > > The Rust tester has verbose=true hard-coded for now. > When run with '{{archery --quiet"}}, RustTester should receive a {{quiet: > Bool}} via > [kwargs|https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L335] > somehwere and we should use that to set the verbose mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4193) [Rust] Add support for decimal data type
[ https://issues.apache.org/jira/browse/ARROW-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-4193: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] Add support for decimal data type > > > Key: ARROW-4193 > URL: https://issues.apache.org/jira/browse/ARROW-4193 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Andy Grove >Priority: Minor > Labels: beginner > Fix For: 3.0.0 > > > We should add {{Decimal(usize,usize)}} to DataType and add the corresponding > array and builder classes. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3690) [Rust] Add Rust to the format integration testing
[ https://issues.apache.org/jira/browse/ARROW-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-3690: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] Add Rust to the format integration testing > - > > Key: ARROW-3690 > URL: https://issues.apache.org/jira/browse/ARROW-3690 > Project: Apache Arrow > Issue Type: New Feature > Components: Integration, Rust >Reporter: Chao Sun >Priority: Major > Fix For: 3.0.0 > > > We should add Rust into the integration testing. See [here > title|https://github.com/apache/arrow/blob/master/integration/integration_test.py#L1166] > and [here|https://github.com/apache/arrow/tree/master/integration]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8258) [Rust] [Parquet] ArrowReader fails on some timestamp types
[ https://issues.apache.org/jira/browse/ARROW-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-8258: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] [Parquet] ArrowReader fails on some timestamp types > -- > > Key: ARROW-8258 > URL: https://issues.apache.org/jira/browse/ARROW-8258 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Renjie Liu >Priority: Major > Fix For: 3.0.0 > > > I discovered this bug with this query > {code:java} > > SELECT tpep_pickup_datetime FROM taxi LIMIT 1; > General("InvalidArgumentError(\"column types must match schema types, > expected Timestamp(Microsecond, None) but found UInt64 at column index 0\")") > {code} > The parquet reader detects this schema when reading from the file: > {code:java} > Schema { > fields: [ > Field { name: "tpep_pickup_datetime", data_type: Timestamp(Microsecond, > None), nullable: true, dict_id: 0, dict_is_ordered: false } > ], > metadata: {} > } {code} > The struct array read from the file contains: > {code:java} > [PrimitiveArray > [ > 156731800800, > 156731935700, > 156732009200, > 156732115100, {code} > When the Parquet arrow reader creates the record batch, the following > validation logic fails: > {code:java} > for i in 0..columns.len() { > if columns[i].len() != len { > return Err(ArrowError::InvalidArgumentError( > "all columns in a record batch must have the same > length".to_string(), > )); > } > if columns[i].data_type() != schema.field(i).data_type() { > return Err(ArrowError::InvalidArgumentError(format!( > "column types must match schema types, expected {:?} but found > {:?} at column index {}", > schema.field(i).data_type(), > columns[i].data_type(), > i))); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8853) [Rust] [Integration Testing] Enable Flight tests
[ https://issues.apache.org/jira/browse/ARROW-8853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-8853: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] [Integration Testing] Enable Flight tests > > > Key: ARROW-8853 > URL: https://issues.apache.org/jira/browse/ARROW-8853 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3690) [Rust] Add Rust to the format integration testing
[ https://issues.apache.org/jira/browse/ARROW-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204368#comment-17204368 ] Neville Dipale commented on ARROW-3690: --- Kicking this can down to 3.0.0 as I won't complete all the sub-tasks. I'll try complete what I can for 2.0.0, so we can also update the documentation with what's supported in Rust. > [Rust] Add Rust to the format integration testing > - > > Key: ARROW-3690 > URL: https://issues.apache.org/jira/browse/ARROW-3690 > Project: Apache Arrow > Issue Type: New Feature > Components: Integration, Rust >Reporter: Chao Sun >Priority: Major > Fix For: 3.0.0 > > > We should add Rust into the integration testing. See [here > title|https://github.com/apache/arrow/blob/master/integration/integration_test.py#L1166] > and [here|https://github.com/apache/arrow/tree/master/integration]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9934) [Rust] Shape and stride check in tensor
[ https://issues.apache.org/jira/browse/ARROW-9934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9934. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8129 [https://github.com/apache/arrow/pull/8129] > [Rust] Shape and stride check in tensor > --- > > Key: ARROW-9934 > URL: https://issues.apache.org/jira/browse/ARROW-9934 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Fernando Herrera >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > When creating a tensor there is no check for the supplied shape and stride. > There should be a check before creating the tensor object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10016) [Rust] [DataFusion] Implement IsNull and IsNotNull
[ https://issues.apache.org/jira/browse/ARROW-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10016. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8204 [https://github.com/apache/arrow/pull/8204] > [Rust] [DataFusion] Implement IsNull and IsNotNull > -- > > Key: ARROW-10016 > URL: https://issues.apache.org/jira/browse/ARROW-10016 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Currently, DataFusion has the logical operator `isNull` and `IsNotNull`, but > that operator has no physical implementation. Consequently, this operator > cannot be used. > The goal of this improvement is to add support to this operator on the > physical plan. > Note that these operators only care about the null bitmap, and thus should be > implementable to all types supported by Arrow. > Both operators should probably return a non-null `BooleanArray`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10044) [Rust] Improve README
[ https://issues.apache.org/jira/browse/ARROW-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10044. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8224 [https://github.com/apache/arrow/pull/8224] > [Rust] Improve README > - > > Key: ARROW-10044 > URL: https://issues.apache.org/jira/browse/ARROW-10044 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10095) [Rust] [Parquet] Update for IPC changes
[ https://issues.apache.org/jira/browse/ARROW-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10095. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8274 [https://github.com/apache/arrow/pull/8274] > [Rust] [Parquet] Update for IPC changes > --- > > Key: ARROW-10095 > URL: https://issues.apache.org/jira/browse/ARROW-10095 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The IPC changes made to comply with MetadataVersion 4 broke the rust-parquet > writer branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9981) [Rust] Allow configuring flight IPC with IpcWriteOptions
[ https://issues.apache.org/jira/browse/ARROW-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-9981: - Assignee: Neville Dipale > [Rust] Allow configuring flight IPC with IpcWriteOptions > > > Key: ARROW-9981 > URL: https://issues.apache.org/jira/browse/ARROW-9981 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.0 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Minor > > We have introduced an IPC write option, but we use the default for the > arrow-flight crate, which is not ideal. Change this to allow configuring > writer options. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9361) [Rust] Move other array types into their own modules
[ https://issues.apache.org/jira/browse/ARROW-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9361: -- Fix Version/s: 3.0.0 > [Rust] Move other array types into their own modules > > > Key: ARROW-9361 > URL: https://issues.apache.org/jira/browse/ARROW-9361 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Fix For: 3.0.0 > > > The array module is getting too big to be practical. We should leave the > core types like the Array trait in `array.rs` and move each array type into > its own sub-module as we did while implementing the Union array. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9361) [Rust] Move other array types into their own modules
[ https://issues.apache.org/jira/browse/ARROW-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9361: -- Priority: Blocker (was: Major) > [Rust] Move other array types into their own modules > > > Key: ARROW-9361 > URL: https://issues.apache.org/jira/browse/ARROW-9361 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Blocker > Fix For: 3.0.0 > > > The array module is getting too big to be practical. We should leave the > core types like the Array trait in `array.rs` and move each array type into > its own sub-module as we did while implementing the Union array. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7700) [Rust] All array types should have iterators and FromIterator support.
[ https://issues.apache.org/jira/browse/ARROW-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202323#comment-17202323 ] Neville Dipale commented on ARROW-7700: --- [~jorgecarleitao] might be related to what you're working on > [Rust] All array types should have iterators and FromIterator support. > -- > > Key: ARROW-7700 > URL: https://issues.apache.org/jira/browse/ARROW-7700 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Reporter: Andy Thomason >Priority: Major > Labels: Usability > > Array types should have an Iterable trait that generates plain or nullable > iterators. > {code} > pub trait Iterable<'a> > where Self::IterType: std::iter::Iterator > { > type IterType; > fn iter(&'a self) -> Self::IterType; > fn iter_nulls(&'a self) -> NullableIterator; > } > {code} > IterType depends on the array type from standard slice iterators for > primitive types, string iterators for UTF8 types and composite iterators > (generating other iterators) for list, struct and dictionary types. > The NullableIterator type should bundle a null bitmap pointer with another > iterator type to form a composite iterator that returns an option: > {code} > /// Convert any iterator to a nullable iterator by using the null bitmap. > #[derive(Debug, PartialEq, Clone)] > pub struct NullableIterator { > iter: T, > i: usize, > null_bitmap: *const u8, > } > impl NullableIterator { > fn from(iter: T, null_bitmap: , offset: usize) -> Self; > } > {code} > For more details, some exploratory work has been done here: > https://github.com/andy-thomason/arrow/blob/ARROW-iterators/rust/arrow/src/array/array.rs#L1711 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6892) [Rust] [DataFusion] Implement optimizer rule to remove redundant projections
[ https://issues.apache.org/jira/browse/ARROW-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202322#comment-17202322 ] Neville Dipale commented on ARROW-6892: --- [~andygrove] [~jorgecarleitao] [~alamb] do you know if this is resolved? There's been a lot of improvements to the optimizer, so checking if they perhaps included this. > [Rust] [DataFusion] Implement optimizer rule to remove redundant projections > > > Key: ARROW-6892 > URL: https://issues.apache.org/jira/browse/ARROW-6892 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Minor > > Currently we have code in the SQL query planner that wraps aggregate queries > in a projection (if needed) to preserve the order of the final results. This > is needed because the aggregate query execution always returns a result with > grouping expressions first and then aggregate expressions. > It would be better (simpler, more readable code) to always wrap aggregates in > projections and have an optimizer rule to remove redundant projections. There > are likely other use cases where redundant projections might exist too. -- This message was sent by Atlassian Jira (v8.3.4#803005)