[Parquet][C++, Python]Parallelism of reading Parquet

Xinyu Zeng Wed, 13 Apr 2022 06:16:28 -0700

I want to make sure a few of my understanding is correct in this
thread. There are two ways to read a parquet file in C++, either
through ParquetFile/read_table, or through ParquetDataset. For the
former, the parallelism is per column because read_table simply passes
all row groups indices to DecodeRowGroups in reader.cc, and there is
no row group level parallelism. For the latter, the parallelism is per
column and per row group, which is a ColumnChunk, according to
RowGroupGenerator in file_parquet.cc. The difference between the
former and the latter is also differentiated by use_legacy_dataset in
Python. If my understanding is correct, I think this difference may be
better explained in doc to avoid confusion. I have to crush the code
to understand.


I was also wondering how pre_buffer works. Will coalescing ColumnChunk
ranges hurt parallelism? Or you can still parallelly read a huge range
after coalescing? To me, coalescing and parallel reading seem like a
tradeoff on S3?

Thanks in advance

[Parquet][C++, Python]Parallelism of reading Parquet

Reply via email to