[jira] [Created] (ARROW-11024) Writing List to parquet sometimes writes wrong data

2020-12-23 Thread George Deamont (Jira)
George Deamont created ARROW-11024:
--

 Summary: Writing List to parquet sometimes writes wrong 
data
 Key: ARROW-11024
 URL: https://issues.apache.org/jira/browse/ARROW-11024
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
Reporter: George Deamont


 

Sometimes when writing tables that contain List columns, the data is 
written incorrectly. Here is a code sample that produces the error. There are 
no exceptions raised here, but a simple equality check via equals() yields 
False for the second test case...

 
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq

# Input records look like this...
# [
# [{'x':'abc','y':'abc'}],
# [{'x':'abc','y':'abc'}],
# [{'x':'abc','y':'abc'}],
# ...
# [{'x':'abc','y':'gcb'}],
# [{'x':'abc','y':'gcb'}],
# [{'x':'abc','y':'gcb'}],
# ]

# Write small amount of data to parquet file, and read it back. In this case, 
both tables are equal.
data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
array1 = pa.array(data1)
table1 = pa.table([array1],names=['column'])
pq.write_table(table1,'temp1.parquet')
table1_1 = pq.read_table('temp1.parquet')
print(table1_1.equals(table1))

# Write larger amount of data to parquet file, and read it back. In this case, 
the tables are not equal.
data2 = data1*100
array2 = pa.array(data2)
table2 = pa.table([array2],names=['column'])
pq.write_table(table2,'temp2.parquet')
table2_1 = pq.read_table('temp2.parquet')
print(table2_1.equals(table2))

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11023) [C++

2020-12-23 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-11023:


 Summary: [C++
 Key: ARROW-11023
 URL: https://issues.apache.org/jira/browse/ARROW-11023
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11022) [Rust] [DataFusion] Upgrade to tokio 1.0

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11022:
--

 Summary: [Rust] [DataFusion] Upgrade to tokio 1.0
 Key: ARROW-11022
 URL: https://issues.apache.org/jira/browse/ARROW-11022
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


https://tokio.rs/blog/2020-12-tokio-1-0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11021) [Rust] Update dependcies in Rust

2020-12-23 Thread Jira
Daniël Heres created ARROW-11021:


 Summary: [Rust] Update dependcies in Rust
 Key: ARROW-11021
 URL: https://issues.apache.org/jira/browse/ARROW-11021
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11020) [Rust] [DataFusion] Implement better tests for ParquetExec

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11020:
--

 Summary: [Rust] [DataFusion] Implement better tests for ParquetExec
 Key: ARROW-11020
 URL: https://issues.apache.org/jira/browse/ARROW-11020
 Project: Apache Arrow
  Issue Type: Test
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Implement better tests for ParquetExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11019) [Rust] [DataFusion] Add support for reading partitioned Parquet files

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11019:
--

 Summary: [Rust] [DataFusion] Add support for reading partitioned 
Parquet files
 Key: ARROW-11019
 URL: https://issues.apache.org/jira/browse/ARROW-11019
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Add support for reading Parquet files that are partitioned by key where the 
files are under a directory structure based on partition keys and values.

/path/to/files/KEY1=value/KEY2=value/files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11018) [Rust][DataFusion] Add null count column statistics

2020-12-23 Thread Jira
Daniël Heres created ARROW-11018:


 Summary: [Rust][DataFusion] Add null count column statistics 
 Key: ARROW-11018
 URL: https://issues.apache.org/jira/browse/ARROW-11018
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres


Add option to provide column statistics, on null count first.

This is one important step to provide more advanced cost based optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11017) [Rust] [DataFusion] Add support for Parquet schema merging

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11017:
--

 Summary: [Rust] [DataFusion] Add support for Parquet schema 
merging 
 Key: ARROW-11017
 URL: https://issues.apache.org/jira/browse/ARROW-11017
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Andy Grove


Add support for Parquet schema merging so that we can read data sets where some 
files have additional columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups

2020-12-23 Thread Andy Grove (Jira)
Andy Grove created ARROW-11016:
--

 Summary: [Rust] Parquet ArrayReader should allow reading a subset 
of row groups
 Key: ARROW-11016
 URL: https://issues.apache.org/jira/browse/ARROW-11016
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove


Parquet ArrayReader currently only supports reading an entire file from start 
to finish and does not allow selectively reading a subset of row groups. This 
prevents us from parallelizing work across threads when processing a single 
parquet file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11015) [CI][Gandiva] Move gandiva nightly build from travis to github action

2020-12-23 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-11015:
--

 Summary: [CI][Gandiva] Move gandiva nightly build from travis to 
github action
 Key: ARROW-11015
 URL: https://issues.apache.org/jira/browse/ARROW-11015
 Project: Apache Arrow
  Issue Type: Task
Reporter: Projjal Chanda
Assignee: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)