Re: [Parquet][C++, Python]Parallelism of reading Parquet

Weston Pace Wed, 13 Apr 2022 13:51:18 -0700

Yes, that matches my understanding as well.  I think, when
pre-buffering is enabled, you might get parallel reads even if you are
using ParquetFile/read_table but I would have to check.  I agree that
it would be a good idea to add some documentation to all the readers
going over our parallelism at a high level.  I created [1] and will
try to update this when I get a chance.

> I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> ranges hurt parallelism? Or you can still parallelly read a huge range
> after coalescing? To me, coalescing and parallel reading seem like a
> tradeoff on S3?

It's possible but I think there is a rather small range of files/reads
that would be affected by this.  The coalescing will only close holes
smaller than 8KiB and will only coalesce up to 64MiB.  Generally files
are either larger than 64MiB or there are many files (in which case
the I/O from a single file doesn't really need to be parallel).
Furthermore, if we are not reading all of the columns then the gaps
between columns are larger than 8KiB.

We did benchmark pre buffering on S3 and, if I remember correctly, the
pre buffering option had a very beneficial impact when running in S3.
AWS recommends reads in the 8MB/16MB range and without pre-buffering I
think our reads are too small to be effective.

[1] https://issues.apache.org/jira/browse/ARROW-16194

On Wed, Apr 13, 2022 at 3:16 AM Xinyu Zeng <[email protected]> wrote:
>
> I want to make sure a few of my understanding is correct in this
> thread. There are two ways to read a parquet file in C++, either
> through ParquetFile/read_table, or through ParquetDataset. For the
> former, the parallelism is per column because read_table simply passes
> all row groups indices to DecodeRowGroups in reader.cc, and there is
> no row group level parallelism. For the latter, the parallelism is per
> column and per row group, which is a ColumnChunk, according to
> RowGroupGenerator in file_parquet.cc. The difference between the
> former and the latter is also differentiated by use_legacy_dataset in
> Python. If my understanding is correct, I think this difference may be
> better explained in doc to avoid confusion. I have to crush the code
> to understand.
>
> I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> ranges hurt parallelism? Or you can still parallelly read a huge range
> after coalescing? To me, coalescing and parallel reading seem like a
> tradeoff on S3?
>
> Thanks in advance

Re: [Parquet][C++, Python]Parallelism of reading Parquet

Reply via email to