Greetings,

We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
They have about 20,000 columns with integer column labels, and data with a 
dtype of float32.

We’d like to store these dataframes with parquet, using the ability to read a 
subset of columns and to store meta-data with the file.

We’ve found the reading performance less than expected compared to the 
published benchmarks (e.g. Wes’ blog post).

Using a modified version of his script we did reproduce his results (~ 1GB/s 
for high entropy, no dict on MacBook pro)
 
But there seem to be three factors that contribute to the slowdown for our 
datasets:

- DateTimeIndex is much slower then a Int index (we see about a factor 5).
- The number of columns impact reading speed significantly (factor ~2 going 
from 16 to 16,000 columns)
- The ‘use_pandas_metadata=True’ slows down reading significantly and appears 
unnecessary? (about 40%)

Are there ways we could speedup the reading? Should we use a different layout?

Thanks for your help and insights!

Cheers,
Maarten 


ps. the routines we used:
def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
    table = pa.Table.from_pandas(df)
    pq.write_table(table, fname, use_dictionary=False, compression=None)
    return

def read_arrow_parquet(fname: str) -> pd.DataFrame:
    table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
    df = table.to_pandas()
    return df

Reply via email to