Max Firman created ARROW-7350:
---------------------------------

             Summary: [Python] Parquet file metadata min and max statistics not 
decoded from bytes for Decimal data types
                 Key: ARROW-7350
                 URL: https://issues.apache.org/jira/browse/ARROW-7350
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: Max Firman


Parquet file metadata for Decimal type columns contain min and max values that 
are not decoded from bytes into Decimals. This causes issues in dependent 
libraries like Dask (see [https://github.com/dask/dask/issues/5647]).

 
{code:python|title=Reproducible example|borderStyle=solid}
from decimal import Decimal
import random

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

NUM_DATA_POINTS_PER_PARTITION = 25

random.seed(0)
data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} 
for i in range(NUM_DATA_POINTS_PER_PARTITION)]

df = pd.DataFrame(data1)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'my_data.parquet')

parquet_file = pq.ParquetFile('my_data.parquet')

assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, 
Decimal) # <-- AssertionError here because min has type bytes rather than 
Decimal
assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, 
Decimal)

{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to