Re: Increasing degree of parallelism when reading Parquet files

Wail Alkowaileet Wed, 11 Aug 2021 09:43:18 -0700

Ingo,

Thanks for trying the Parquet connector. Your inputs were super valuable!
Sure you can use the current change if it solves the problem.
Please let us know if you have any questions/concerns.


On Wed, Aug 11, 2021 at 1:24 AM Müller Ingo <[email protected]>
wrote:

> Dear all,
>
> I have just tried out Wail's patch set from here:
> https://issues.apache.org/jira/browse/ASTERIXDB-2945. It seems to solve
> my problem fully: in the 96-vCPU instance with 48 I/O devices, I see about
> 4800% CPU utilization during query execution, and the queries run only
> marginally longer than if run against local files. Thanks a lot for the
> quick fix!
>
> Should I use this version for a full benchmark run or wait until the patch
> makes it to master?
>
> Cheers,
> Ingo
>
>
> > -----Original Message-----
> > From: Wail Alkowaileet <[email protected]>
> > Sent: Tuesday, August 10, 2021 6:10 PM
> > To: [email protected]
> > Subject: Re: Increasing degree of parallelism when reading Parquet files
> >
> > Thanks Ingo for the detailed explanation and for benchmarking it! It is
> a great
> > input for us. We will look at the issue and hopefully we can get it
> fixed before
> > the end of the week.
> >
> > On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo <[email protected]
> > <mailto:[email protected]> > wrote:
> >
> >
> >       Let me also say that I can still rerun the experiments for the
> (hopefully
> > subsequent) camera-ready version if the problem takes longer to fix.
> >
> >       Cheers,
> >       Ingo
> >
> >
> >       > -----Original Message-----
> >       > From: Müller Ingo <[email protected]
> > <mailto:[email protected]> >
> >       > Sent: Tuesday, August 10, 2021 5:34 PM
> >       > To: [email protected]
> > <mailto:[email protected]>
> >       > Subject: RE: Increasing degree of parallelism when reading
> Parquet
> > files
> >       >
> >       > Hey Mike!
> >       >
> >       > Thanks for confirming! I am happy to test any fixes that you may
> come
> > up with.
> >       > If the happens to be simple and is fixed before Friday, I can
> still include
> > it in the
> >       > revision I am currently working on ;) Otherwise, it'd be great
> to be able
> > to
> >       > mention a Jira issue or similar (maybe this mailing list thread
> is
> > enough?) that I
> >       > can refer to.
> >       >
> >       > Cheers,
> >       > Ingo
> >       >
> >       >
> >       > > -----Original Message-----
> >       > > From: Michael Carey <[email protected]
> > <mailto:[email protected]> >
> >       > > Sent: Tuesday, August 10, 2021 4:36 PM
> >       > > To: [email protected]
> > <mailto:[email protected]>
> >       > > Subject: Re: Increasing degree of parallelism when reading
> Parquet
> >       > > files
> >       > >
> >       > > Ingo,
> >       > >
> >       > > Got it!  It sounds like we indeed have a parallelism
> performance bug
> >       > > in the area of threading for S3, then.  Weird!  We'll look
> into it...
> >       > >
> >       > >
> >       > > Cheers,
> >       > >
> >       > > Mike
> >       > >
> >       > >
> >       > > On 8/9/21 11:21 PM, Müller Ingo wrote:
> >       > >
> >       > >
> >       > >     Hey Mike,
> >       > >
> >       > >     Just to clarify: "partitions" is the same thing as I/O
> devices,
> >       > > right? I have configured 48 of those via "[nc]\niodevices=..."
> and see
> >       > > the corresponding folders with content show up on the file
> system.
> >       > > When I vary the number of these devices, I see that all other
> storage
> >       > > format change the degree of parallelism with my queries. That
> >       > > mechanism thus seems to work in general. It just doesn't seem
> to
> > work
> >       > > for Parquet on S3. (I am not 100% sure if I tried other file
> formats
> >       > > on S3.)
> >       > >
> >       > >     I have also tried to set compiler.parallelism to 4 for
> Parquet files
> >       > > on HDFS with a file:// path and did not see any effect, i.e.,
> it used
> >       > > 48 threads, which corresponds to the number of I/O devices.
> > However,
> >       > > with what Dmitry said, I guess that this is expected behavior
> and the
> >       > > flag should only influence the degree of parallelism after
> exchanges
> > (which I
> >       > don't have in my queries).
> >       > >
> >       > >     Cheers,
> >       > >     Ingo
> >       > >
> >       > >
> >       > >
> >       > >             -----Original Message-----
> >       > >             From: Michael Carey <[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected] <mailto:[email protected]> >
> >       > >             Sent: Monday, August 9, 2021 10:10 PM
> >       > >             To: [email protected]
> > <mailto:[email protected]>
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             Subject: Re: Increasing degree of parallelism when
> reading
> >       > Parquet
> >       > > files
> >       > >
> >       > >             Ingo,
> >       > >
> >       > >             Q: In your Parquet/S3 testing, what does your
> current cluster
> >       > > configuration look
> >       > >             like?  (I.e., how many partitions have you
> configured it with -
> >       > > physical storage
> >       > >             partitions that is?)  Even though your S3 data
> isn't stored
> > inside
> >       > > AsterixDB in this
> >       > >             case, the system still uses that info to decide
> how many
> > parallel
> >       > > threads to use
> >       > >             at the base of its query plans.  (Obviously there
> is room for
> >       > > improvement on that
> >       > >             behavior for use cases involving external storage.
> :-))
> >       > >
> >       > >
> >       > >             Cheers,
> >       > >
> >       > >             Mike
> >       > >
> >       > >
> >       > >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> >       > >
> >       > >
> >       > >                     Hi Dmitry,
> >       > >
> >       > >                     Thanks a lot for checking! Indeed, my
> queries do not
> >       > have an
> >       > > exchange.
> >       > >             However, the number of I/O devices has indeed
> worked well
> > in
> >       > many
> >       > > cases:
> >       > >             when I tried the various VM instance sizes, I
> always created as
> >       > many
> >       > > I/O devices
> >       > >             as there were physical cores (i.e., half the
> number of logical
> >       > > CPUs). For internal
> >       > >             storage as well as HDFS (both using the hdfs://
> and the file://
> >       > > protocol), I saw
> >       > >             the full system being utilized. However, just for
> the case of
> >       > > Parquet on S3, I
> >       > >             cannot seem to make it use more than 16 cores.
> >       > >
> >       > >                     Cheers,
> >       > >                     Ingo
> >       > >
> >       > >
> >       > >
> >       > >                             -----Original Message-----
> >       > >                             From: Dmitry Lychagin
> >       > > <[email protected]
> > <mailto:[email protected]> >
> >       > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                             Sent: Monday, August 9, 2021 9:10
> PM
> >       > >                             To: [email protected]
> > <mailto:[email protected]>
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                             Subject: Re: Increasing degree of
> parallelism
> >       > when reading
> >       > >             Parquet files
> >       > >
> >       > >                             Hi Ingo,
> >       > >
> >       > >                             I checked the code and it seems
> that when
> >       > scanning external
> >       > >             datasource we're
> >       > >                             using the same number of cores as
> there are
> >       > configured storage
> >       > >             partitions (I/O
> >       > >                             devices).
> >       > >                             Therefore, if you want 96 cores to
> be used
> >       > when scanning
> >       > >             Parquet files then you
> >       > >                             need to configure 96 I/O devices.
> >       > >
> >       > >                             Compiler.parallelism setting is
> supposed to
> >       > affect how many
> >       > >             cores we use after
> >       > >                             the first EXCHANGE operator.
> However, if your
> >       > query doesn't
> >       > >             have any
> >       > >                             EXCHANGEs then it'll use the
> number of cores
> >       > assigned for the
> >       > >             initial data scan
> >       > >                             operator (number of I/O devices)
> >       > >
> >       > >                             Thanks,
> >       > >                             -- Dmitry
> >       > >
> >       > >
> >       > >                             On 8/9/21, 11:42 AM, "Müller
> Ingo"
> >       > >             <[email protected] <mailto:
> [email protected]>
> > > <mailto:[email protected] <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> > <mailto:[email protected]
> > <mailto:[email protected]> >   wrote:
> >       > >
> >       > >                                  EXTERNAL EMAIL:  Use caution
> when
> >       > opening attachments
> >       > >             or clicking on links
> >       > >
> >       > >
> >       > >
> >       > >
> >       > >
> >       > >                                 Dear Dmitry,
> >       > >
> >       > >                                 Thanks a lot for the quick
> reply! I had not
> >       > though of this.
> >       > >             However, I have tried
> >       > >                             out both ways just now (per query
> and in the
> >       > cluster
> >       > >             configuration) and did not
> >       > >                             see any changes. Is there any way
> I can control
> >       > that the setting
> >       > >             was applied
> >       > >                             successfully? I have also tried
> setting
> >       > compiler.parallelism to 4
> >       > >             and still observed
> >       > >                             16 cores being utilized.
> >       > >
> >       > >                                 Note that the observed degree
> of parallelism
> >       > does not
> >       > >             correspond to anything
> >       > >                             related to the data set (I tried
> with every power
> >       > of two files
> >       > >             between 1 and 128)
> >       > >                             or the cluster (I tried with every
> power of two
> >       > cores between 2
> >       > >             and 64, as well
> >       > >                             as 48 and 96) and I always see 16
> cores being
> >       > used (or fewer, if
> >       > >             the system has
> >       > >                             fewer). To me, this makes it
> unlikely that the
> >       > system really uses
> >       > >             the semantics
> >       > >                             for p=0 or p<0, but looks more
> like some hard-
> >       > coded value.
> >       > >
> >       > >                                 Cheers,
> >       > >                                 Ingo
> >       > >
> >       > >
> >       > >                                 > -----Original Message-----
> >       > >                                 > From: Dmitry Lychagin
> >       > > <[email protected]
> > <mailto:[email protected]> >
> >       > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                                 > Sent: Monday, August 9, 2021
> 7:25 PM
> >       > >                                 > To:
> [email protected]
> > <mailto:[email protected]>
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                                 > Subject: Re: Increasing
> degree of
> >       > parallelism when reading
> >       > >             Parquet files
> >       > >                                 >
> >       > >                                 > Ingo,
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > We have
> `compiler.parallelism` parameter
> >       > that controls
> >       > >             how many cores are
> >       > >                                 > used for query execution.
> >       > >                                 >
> >       > >                                 > See
> >       > >                                 >
> >       > >
> >       > >
> >       > >
> > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> >       > >             _param
> >       > >                                 > eter
> >       > >                                 >
> >       > >
> >       > >
> >       > >
> > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> >       > >             m_para
> >       > >                                 >
> >       > >
> >       >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> >       > >
> >       > >
> > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> >       > > m_para>
> >       > >
> >       >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> >       > >             meter>
> >       > >                                 >
> >       > >                                 > You can either set it per
> query (e.g. SET
> >       > >             `compiler.parallelism` "-1";) ,
> >       > >                                 >
> >       > >                                 > or globally in the cluster
> configuration:
> >       > >                                 >
> >       > >                                 >
> >       > >
> >       > >
> > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> >       > >                                 >
> app/src/main/resources/cc2.conf#L57
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > Thanks,
> >       > >                                 >
> >       > >                                 > -- Dmitry
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > From: Müller Ingo
> >       > > <[email protected] <mailto:[email protected]> >
> > <mailto:[email protected] <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                                 > Reply-To: "
> [email protected]
> > <mailto:[email protected]> "
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >   <[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                                 > Date: Monday, August 9, 2021
> at 10:05 AM
> >       > >                                 > To: "
> [email protected]
> > <mailto:[email protected]> "
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >   <[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >             <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > > <mailto:[email protected]
> > <mailto:[email protected]> >
> >       > >                                 > Subject: Increasing degree
> of parallelism
> >       > > when reading
> >       > >             Parquet files
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >  EXTERNAL EMAIL:  Use
> caution when
> >       > > opening attachments
> >       > >             or clicking on
> >       > >                             links
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > Dear AsterixDB devs,
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > I am currently trying out
> the new support
> >       > > for Parquet files
> >       > >             on S3 (still in the
> >       > >                                 > context of my High-energy
> Physics use case
> >       > > [1]). This works
> >       > >             great so far and
> >       > >                             has
> >       > >                                 > generally decent
> performance. However, I
> >       > > realized that it
> >       > >             does not use more
> >       > >                                 > than 16 cores, even though
> 96 logical cores
> >       > > are available
> >       > >             and even though I
> >       > >                             run
> >       > >                                 > long-running queries
> (several minutes) on
> >       > > large data sets
> >       > >             with a large
> >       > >                             number of
> >       > >                                 > files (I tried 128 files of
> 17GB each). Is this
> >       > > an
> >       > >             arbitrary/artificial limitation
> >       > >                             that
> >       > >                                 > can be changed somehow
> (potentially with
> >       > > a small
> >       > >             patch+recompiling) or is
> >       > >                                 > there more serious
> development required
> >       > > to lift it? FYI, I am
> >       > >             currently using
> >       > >                                 > 03fd6d0f, which should
> include all
> >       > > S3/Parquet commits on
> >       > >             master.
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > Cheers,
> >       > >                                 >
> >       > >                                 > Ingo
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > [1]
> https://arxiv.org/abs/2104.12615
> >       > >                                 >
> >       > >                                 >
> >       > >
> >       > >
> >       > >
> >       > >
> >       > >
> >
> >
> >
> >
> >
> > --
> >
> >
> > Regards,
> > Wail Alkowaileet
>
>

-- 

*Regards,*
Wail Alkowaileet

Re: Increasing degree of parallelism when reading Parquet files

Reply via email to