Marco Neumann created ARROW-5028: ------------------------------------ Summary: Arrow->Parquet store drops and corrupts values Key: ARROW-5028 URL: https://issues.apache.org/jira/browse/ARROW-5028 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1, 0.13.0 Reporter: Marco Neumann Attachments: dct.pickle.gz
I am sorry if this bugs feels rather long and the reproduction data is large, but I was not able to reduce the data even further while still triggering the problem. I was able to trigger this behavior on master and on {{0.11.1}}. {code:python} import io import os.path import pickle import numpy as np import pyarrow as pa import pyarrow.parquet as pq def dct_to_table(index_dct): labeled_array = pa.array(np.array(list(index_dct.keys()))) partition_array = pa.array(np.array(list(index_dct.values()))) return pa.Table.from_arrays( [labeled_array, partition_array], names=['a', 'b'] ) def check_pq_nulls(data): fp = io.BytesIO(data) pfile = pq.ParquetFile(fp) assert pfile.num_row_groups == 1 md = pfile.metadata.row_group(0) col = md.column(1) assert col.path_in_schema == 'b.list.item' assert col.statistics.null_count == 0 # fails def roundtrip(table): buf = pa.BufferOutputStream() pq.write_table(table, buf) data = buf.getvalue().to_pybytes() # this fails: # check_pq_nulls(data) reader = pa.BufferReader(data) return pq.read_table(reader) with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: dct = pickle.load(fp) # this does NOT help: # pa.set_cpu_count(1) # import gc; gc.disable() table = dct_to_table(dct) # this fixes the issue: # table = pa.Table.from_pandas(table.to_pandas()) table2 = roundtrip(table) assert table.column('b').null_count == 0 assert table2.column('b').null_count == 0 # fails # if table2 is converted to pandas, you can also observe that some values at the end of column b are `['']` which clearly is not present in the original data {code} I would also be thankful for any pointers on where the bug comes from or on who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)