[jira] [Created] (ARROW-11024) Writing List to parquet sometimes writes wrong data
George Deamont created ARROW-11024: -- Summary: Writing List to parquet sometimes writes wrong data Key: ARROW-11024 URL: https://issues.apache.org/jira/browse/ARROW-11024 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0 Reporter: George Deamont Sometimes when writing tables that contain List columns, the data is written incorrectly. Here is a code sample that produces the error. There are no exceptions raised here, but a simple equality check via equals() yields False for the second test case... {code:java} import pyarrow as pa import pyarrow.parquet as pq # Input records look like this... # [ # [{'x':'abc','y':'abc'}], # [{'x':'abc','y':'abc'}], # [{'x':'abc','y':'abc'}], # ... # [{'x':'abc','y':'gcb'}], # [{'x':'abc','y':'gcb'}], # [{'x':'abc','y':'gcb'}], # ] # Write small amount of data to parquet file, and read it back. In this case, both tables are equal. data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100 array1 = pa.array(data1) table1 = pa.table([array1],names=['column']) pq.write_table(table1,'temp1.parquet') table1_1 = pq.read_table('temp1.parquet') print(table1_1.equals(table1)) # Write larger amount of data to parquet file, and read it back. In this case, the tables are not equal. data2 = data1*100 array2 = pa.array(data2) table2 = pa.table([array2],names=['column']) pq.write_table(table2,'temp2.parquet') table2_1 = pq.read_table('temp2.parquet') print(table2_1.equals(table2)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11023) [C++
Yibo Cai created ARROW-11023: Summary: [C++ Key: ARROW-11023 URL: https://issues.apache.org/jira/browse/ARROW-11023 Project: Apache Arrow Issue Type: Bug Reporter: Yibo Cai -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11022) [Rust] [DataFusion] Upgrade to tokio 1.0
Andy Grove created ARROW-11022: -- Summary: [Rust] [DataFusion] Upgrade to tokio 1.0 Key: ARROW-11022 URL: https://issues.apache.org/jira/browse/ARROW-11022 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 https://tokio.rs/blog/2020-12-tokio-1-0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11021) [Rust] Update dependcies in Rust
Daniël Heres created ARROW-11021: Summary: [Rust] Update dependcies in Rust Key: ARROW-11021 URL: https://issues.apache.org/jira/browse/ARROW-11021 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Daniël Heres Assignee: Daniël Heres -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11020) [Rust] [DataFusion] Implement better tests for ParquetExec
Andy Grove created ARROW-11020: -- Summary: [Rust] [DataFusion] Implement better tests for ParquetExec Key: ARROW-11020 URL: https://issues.apache.org/jira/browse/ARROW-11020 Project: Apache Arrow Issue Type: Test Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Implement better tests for ParquetExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11019) [Rust] [DataFusion] Add support for reading partitioned Parquet files
Andy Grove created ARROW-11019: -- Summary: [Rust] [DataFusion] Add support for reading partitioned Parquet files Key: ARROW-11019 URL: https://issues.apache.org/jira/browse/ARROW-11019 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Add support for reading Parquet files that are partitioned by key where the files are under a directory structure based on partition keys and values. /path/to/files/KEY1=value/KEY2=value/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11018) [Rust][DataFusion] Add null count column statistics
Daniël Heres created ARROW-11018: Summary: [Rust][DataFusion] Add null count column statistics Key: ARROW-11018 URL: https://issues.apache.org/jira/browse/ARROW-11018 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Daniël Heres Assignee: Daniël Heres Add option to provide column statistics, on null count first. This is one important step to provide more advanced cost based optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11017) [Rust] [DataFusion] Add support for Parquet schema merging
Andy Grove created ARROW-11017: -- Summary: [Rust] [DataFusion] Add support for Parquet schema merging Key: ARROW-11017 URL: https://issues.apache.org/jira/browse/ARROW-11017 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Add support for Parquet schema merging so that we can read data sets where some files have additional columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups
Andy Grove created ARROW-11016: -- Summary: [Rust] Parquet ArrayReader should allow reading a subset of row groups Key: ARROW-11016 URL: https://issues.apache.org/jira/browse/ARROW-11016 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Andy Grove Parquet ArrayReader currently only supports reading an entire file from start to finish and does not allow selectively reading a subset of row groups. This prevents us from parallelizing work across threads when processing a single parquet file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11015) [CI][Gandiva] Move gandiva nightly build from travis to github action
Projjal Chanda created ARROW-11015: -- Summary: [CI][Gandiva] Move gandiva nightly build from travis to github action Key: ARROW-11015 URL: https://issues.apache.org/jira/browse/ARROW-11015 Project: Apache Arrow Issue Type: Task Reporter: Projjal Chanda Assignee: Projjal Chanda -- This message was sent by Atlassian Jira (v8.3.4#803005)