Greetings,
We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
They have about 20,000 columns with integer column labels, and data with a
dtype of float32.
We’d like to store these dataframes with parquet, using the ability to read a
subset of columns and to store meta-data with the file.
We’ve found the reading performance less than expected compared to the
published benchmarks (e.g. Wes’ blog post).
Using a modified version of his script we did reproduce his results (~ 1GB/s
for high entropy, no dict on MacBook pro)
But there seem to be three factors that contribute to the slowdown for our
datasets:
- DateTimeIndex is much slower then a Int index (we see about a factor 5).
- The number of columns impact reading speed significantly (factor ~2 going
from 16 to 16,000 columns)
- The ‘use_pandas_metadata=True’ slows down reading significantly and appears
unnecessary? (about 40%)
Are there ways we could speedup the reading? Should we use a different layout?
Thanks for your help and insights!
Cheers,
Maarten
ps. the routines we used:
def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
table = pa.Table.from_pandas(df)
pq.write_table(table, fname, use_dictionary=False, compression=None)
return
def read_arrow_parquet(fname: str) -> pd.DataFrame:
table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
df = table.to_pandas()
return df