Python 3.7.6 (default, Jan 30 2020, 10:29:04)
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
print(pyarrow.__version__)
0.17.1

Seeing an issue where DECIMAL values written can seem to be corrupted based on 
very subtle changes to the data set. Example:

#!/bin/python3

import pandas as pd
import decimal
import pyarrow.parquet as pq

#$ python3
# Python 3.7.6 (default, Jan 30 2020, 10:29:04)
# [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
# >>> print(pyarrow.__version__)
#  0.17.1

# Results in unexpected output
df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'), 
decimal.Decimal('18446744073709551616'), decimal.Decimal('2147483648'), 
decimal.Decimal('1.111'), decimal.Decimal('-2'), decimal.Decimal('0')]})

df.to_parquet("/tmp/f")
pq_file = pq.ParquetFile("/tmp/f")
print (pq_file.read().to_pandas())

#Values Read:
# -221360928884514619.392, 
-442721857769029238.784,2147483648.000,1.111,-2.000,0.000

# Results in expected output (only difference is 1.11 vs. 1.111)
df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'), 
decimal.Decimal('18446744073709551616'), decimal.Decimal('2147483648'), 
decimal.Decimal('1.11'), decimal.Decimal('-2'), decimal.Decimal('0')]})

#Values Read:
9223372036854775808.00,18446744073709551616.00,2147483648.00,1.11,-2.00,0.00

df.to_parquet("/tmp/f")
pq_file = pq.ParquetFile("/tmp/f")
print (pq_file.read().to_pandas())

Reply via email to