Re: [Parquet][C++, Python]Parallelism of reading Parquet

Xinyu Zeng Wed, 13 Apr 2022 19:17:26 -0700

Thanks. A followup question on pre buffering. When the caching layer
caches all the ranges, will they all issue requests to S3
simultaneously to saturate S3 bandwidth? Or there is also a max of
parallelism downloading or pipelining technique?


On Thu, Apr 14, 2022 at 4:51 AM Weston Pace <[email protected]> wrote:
>
> Yes, that matches my understanding as well.  I think, when
> pre-buffering is enabled, you might get parallel reads even if you are
> using ParquetFile/read_table but I would have to check.  I agree that
> it would be a good idea to add some documentation to all the readers
> going over our parallelism at a high level.  I created [1] and will
> try to update this when I get a chance.
>
> > I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> > ranges hurt parallelism? Or you can still parallelly read a huge range
> > after coalescing? To me, coalescing and parallel reading seem like a
> > tradeoff on S3?
>
> It's possible but I think there is a rather small range of files/reads
> that would be affected by this.  The coalescing will only close holes
> smaller than 8KiB and will only coalesce up to 64MiB.  Generally files
> are either larger than 64MiB or there are many files (in which case
> the I/O from a single file doesn't really need to be parallel).
> Furthermore, if we are not reading all of the columns then the gaps
> between columns are larger than 8KiB.
>
> We did benchmark pre buffering on S3 and, if I remember correctly, the
> pre buffering option had a very beneficial impact when running in S3.
> AWS recommends reads in the 8MB/16MB range and without pre-buffering I
> think our reads are too small to be effective.
>
> [1] https://issues.apache.org/jira/browse/ARROW-16194
>
> On Wed, Apr 13, 2022 at 3:16 AM Xinyu Zeng <[email protected]> wrote:
> >
> > I want to make sure a few of my understanding is correct in this
> > thread. There are two ways to read a parquet file in C++, either
> > through ParquetFile/read_table, or through ParquetDataset. For the
> > former, the parallelism is per column because read_table simply passes
> > all row groups indices to DecodeRowGroups in reader.cc, and there is
> > no row group level parallelism. For the latter, the parallelism is per
> > column and per row group, which is a ColumnChunk, according to
> > RowGroupGenerator in file_parquet.cc. The difference between the
> > former and the latter is also differentiated by use_legacy_dataset in
> > Python. If my understanding is correct, I think this difference may be
> > better explained in doc to avoid confusion. I have to crush the code
> > to understand.
> >
> > I was also wondering how pre_buffer works. Will coalescing ColumnChunk
> > ranges hurt parallelism? Or you can still parallelly read a huge range
> > after coalescing? To me, coalescing and parallel reading seem like a
> > tradeoff on S3?
> >
> > Thanks in advance

Re: [Parquet][C++, Python]Parallelism of reading Parquet

Reply via email to