Re: Increasing degree of parallelism when reading Parquet files

Wail Alkowaileet Tue, 10 Aug 2021 09:10:38 -0700

Thanks Ingo for the detailed explanation and for benchmarking it! It is a
great input for us. We will look at the issue and hopefully we can get it
fixed before the end of the week.


On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo <[email protected]>
wrote:

> Let me also say that I can still rerun the experiments for the (hopefully
> subsequent) camera-ready version if the problem takes longer to fix.
>
> Cheers,
> Ingo
>
>
> > -----Original Message-----
> > From: Müller Ingo <[email protected]>
> > Sent: Tuesday, August 10, 2021 5:34 PM
> > To: [email protected]
> > Subject: RE: Increasing degree of parallelism when reading Parquet files
> >
> > Hey Mike!
> >
> > Thanks for confirming! I am happy to test any fixes that you may come up
> with.
> > If the happens to be simple and is fixed before Friday, I can still
> include it in the
> > revision I am currently working on ;) Otherwise, it'd be great to be
> able to
> > mention a Jira issue or similar (maybe this mailing list thread is
> enough?) that I
> > can refer to.
> >
> > Cheers,
> > Ingo
> >
> >
> > > -----Original Message-----
> > > From: Michael Carey <[email protected]>
> > > Sent: Tuesday, August 10, 2021 4:36 PM
> > > To: [email protected]
> > > Subject: Re: Increasing degree of parallelism when reading Parquet
> > > files
> > >
> > > Ingo,
> > >
> > > Got it!  It sounds like we indeed have a parallelism performance bug
> > > in the area of threading for S3, then.  Weird!  We'll look into it...
> > >
> > >
> > > Cheers,
> > >
> > > Mike
> > >
> > >
> > > On 8/9/21 11:21 PM, Müller Ingo wrote:
> > >
> > >
> > >     Hey Mike,
> > >
> > >     Just to clarify: "partitions" is the same thing as I/O devices,
> > > right? I have configured 48 of those via "[nc]\niodevices=..." and see
> > > the corresponding folders with content show up on the file system.
> > > When I vary the number of these devices, I see that all other storage
> > > format change the degree of parallelism with my queries. That
> > > mechanism thus seems to work in general. It just doesn't seem to work
> > > for Parquet on S3. (I am not 100% sure if I tried other file formats
> > > on S3.)
> > >
> > >     I have also tried to set compiler.parallelism to 4 for Parquet
> files
> > > on HDFS with a file:// path and did not see any effect, i.e., it used
> > > 48 threads, which corresponds to the number of I/O devices. However,
> > > with what Dmitry said, I guess that this is expected behavior and the
> > > flag should only influence the degree of parallelism after exchanges
> (which I
> > don't have in my queries).
> > >
> > >     Cheers,
> > >     Ingo
> > >
> > >
> > >
> > >             -----Original Message-----
> > >             From: Michael Carey <[email protected]>
> > > <mailto:[email protected]>
> > >             Sent: Monday, August 9, 2021 10:10 PM
> > >             To: [email protected]
> > > <mailto:[email protected]>
> > >             Subject: Re: Increasing degree of parallelism when reading
> > Parquet
> > > files
> > >
> > >             Ingo,
> > >
> > >             Q: In your Parquet/S3 testing, what does your current
> cluster
> > > configuration look
> > >             like?  (I.e., how many partitions have you configured it
> with -
> > > physical storage
> > >             partitions that is?)  Even though your S3 data isn't
> stored inside
> > > AsterixDB in this
> > >             case, the system still uses that info to decide how many
> parallel
> > > threads to use
> > >             at the base of its query plans.  (Obviously there is room
> for
> > > improvement on that
> > >             behavior for use cases involving external storage. :-))
> > >
> > >
> > >             Cheers,
> > >
> > >             Mike
> > >
> > >
> > >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> > >
> > >
> > >                     Hi Dmitry,
> > >
> > >                     Thanks a lot for checking! Indeed, my queries do
> not
> > have an
> > > exchange.
> > >             However, the number of I/O devices has indeed worked well
> in
> > many
> > > cases:
> > >             when I tried the various VM instance sizes, I always
> created as
> > many
> > > I/O devices
> > >             as there were physical cores (i.e., half the number of
> logical
> > > CPUs). For internal
> > >             storage as well as HDFS (both using the hdfs:// and the
> file://
> > > protocol), I saw
> > >             the full system being utilized. However, just for the case
> of
> > > Parquet on S3, I
> > >             cannot seem to make it use more than 16 cores.
> > >
> > >                     Cheers,
> > >                     Ingo
> > >
> > >
> > >
> > >                             -----Original Message-----
> > >                             From: Dmitry Lychagin
> > > <[email protected]>
> > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                             Sent: Monday, August 9, 2021 9:10 PM
> > >                             To: [email protected]
> > > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                             Subject: Re: Increasing degree of
> parallelism
> > when reading
> > >             Parquet files
> > >
> > >                             Hi Ingo,
> > >
> > >                             I checked the code and it seems that when
> > scanning external
> > >             datasource we're
> > >                             using the same number of cores as there are
> > configured storage
> > >             partitions (I/O
> > >                             devices).
> > >                             Therefore, if you want 96 cores to be used
> > when scanning
> > >             Parquet files then you
> > >                             need to configure 96 I/O devices.
> > >
> > >                             Compiler.parallelism setting is supposed to
> > affect how many
> > >             cores we use after
> > >                             the first EXCHANGE operator. However, if
> your
> > query doesn't
> > >             have any
> > >                             EXCHANGEs then it'll use the number of
> cores
> > assigned for the
> > >             initial data scan
> > >                             operator (number of I/O devices)
> > >
> > >                             Thanks,
> > >                             -- Dmitry
> > >
> > >
> > >                             On 8/9/21, 11:42 AM, "Müller  Ingo"
> > >             <[email protected]> <mailto:
> [email protected]>
> > > <mailto:[email protected]> <mailto:[email protected]>
>  wrote:
> > >
> > >                                  EXTERNAL EMAIL:  Use caution when
> > opening attachments
> > >             or clicking on links
> > >
> > >
> > >
> > >
> > >
> > >                                 Dear Dmitry,
> > >
> > >                                 Thanks a lot for the quick reply! I
> had not
> > though of this.
> > >             However, I have tried
> > >                             out both ways just now (per query and in
> the
> > cluster
> > >             configuration) and did not
> > >                             see any changes. Is there any way I can
> control
> > that the setting
> > >             was applied
> > >                             successfully? I have also tried setting
> > compiler.parallelism to 4
> > >             and still observed
> > >                             16 cores being utilized.
> > >
> > >                                 Note that the observed degree of
> parallelism
> > does not
> > >             correspond to anything
> > >                             related to the data set (I tried with
> every power
> > of two files
> > >             between 1 and 128)
> > >                             or the cluster (I tried with every power
> of two
> > cores between 2
> > >             and 64, as well
> > >                             as 48 and 96) and I always see 16 cores
> being
> > used (or fewer, if
> > >             the system has
> > >                             fewer). To me, this makes it unlikely that
> the
> > system really uses
> > >             the semantics
> > >                             for p=0 or p<0, but looks more like some
> hard-
> > coded value.
> > >
> > >                                 Cheers,
> > >                                 Ingo
> > >
> > >
> > >                                 > -----Original Message-----
> > >                                 > From: Dmitry Lychagin
> > > <[email protected]>
> > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                                 > Sent: Monday, August 9, 2021 7:25 PM
> > >                                 > To: [email protected]
> > > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                                 > Subject: Re: Increasing degree of
> > parallelism when reading
> > >             Parquet files
> > >                                 >
> > >                                 > Ingo,
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > We have `compiler.parallelism`
> parameter
> > that controls
> > >             how many cores are
> > >                                 > used for query execution.
> > >                                 >
> > >                                 > See
> > >                                 >
> > >
> > >
> > >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> > >             _param
> > >                                 > eter
> > >                                 >
> > >
> > >
> > >     <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > >             m_para
> > >                                 >
> > >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> > >
> > >     <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > > m_para>
> > >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> > >             meter>
> > >                                 >
> > >                                 > You can either set it per query
> (e.g. SET
> > >             `compiler.parallelism` "-1";) ,
> > >                                 >
> > >                                 > or globally in the cluster
> configuration:
> > >                                 >
> > >                                 >
> > >
> > >     https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> > >                                 > app/src/main/resources/cc2.conf#L57
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > Thanks,
> > >                                 >
> > >                                 > -- Dmitry
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > From: Müller Ingo
> > > <[email protected]> <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                                 > Reply-To: "
> [email protected]"
> > > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>   <[email protected]>
> > > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                                 > Date: Monday, August 9, 2021 at
> 10:05 AM
> > >                                 > To: "[email protected]"
> > > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>   <[email protected]>
> > > <mailto:[email protected]>
> > >             <mailto:[email protected]>
> > > <mailto:[email protected]>
> > >                                 > Subject: Increasing degree of
> parallelism
> > > when reading
> > >             Parquet files
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >  EXTERNAL EMAIL:  Use caution when
> > > opening attachments
> > >             or clicking on
> > >                             links
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > Dear AsterixDB devs,
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > I am currently trying out the new
> support
> > > for Parquet files
> > >             on S3 (still in the
> > >                                 > context of my High-energy Physics
> use case
> > > [1]). This works
> > >             great so far and
> > >                             has
> > >                                 > generally decent performance.
> However, I
> > > realized that it
> > >             does not use more
> > >                                 > than 16 cores, even though 96
> logical cores
> > > are available
> > >             and even though I
> > >                             run
> > >                                 > long-running queries (several
> minutes) on
> > > large data sets
> > >             with a large
> > >                             number of
> > >                                 > files (I tried 128 files of 17GB
> each). Is this
> > > an
> > >             arbitrary/artificial limitation
> > >                             that
> > >                                 > can be changed somehow (potentially
> with
> > > a small
> > >             patch+recompiling) or is
> > >                                 > there more serious development
> required
> > > to lift it? FYI, I am
> > >             currently using
> > >                                 > 03fd6d0f, which should include all
> > > S3/Parquet commits on
> > >             master.
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > Cheers,
> > >                                 >
> > >                                 > Ingo
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > [1] https://arxiv.org/abs/2104.12615
> > >                                 >
> > >                                 >
> > >
> > >
> > >
> > >
> > >
>
>

-- 

*Regards,*
Wail Alkowaileet

Re: Increasing degree of parallelism when reading Parquet files

Reply via email to