RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Tue, 10 Aug 2021 08:42:13 -0700

Let me also say that I can still rerun the experiments for the (hopefully 
subsequent) camera-ready version if the problem takes longer to fix.


Cheers,
Ingo


> -----Original Message-----
> From: Müller Ingo <[email protected]>
> Sent: Tuesday, August 10, 2021 5:34 PM
> To: [email protected]
> Subject: RE: Increasing degree of parallelism when reading Parquet files
> 
> Hey Mike!
> 
> Thanks for confirming! I am happy to test any fixes that you may come up with.
> If the happens to be simple and is fixed before Friday, I can still include 
> it in the
> revision I am currently working on ;) Otherwise, it'd be great to be able to
> mention a Jira issue or similar (maybe this mailing list thread is enough?) 
> that I
> can refer to.
> 
> Cheers,
> Ingo
> 
> 
> > -----Original Message-----
> > From: Michael Carey <[email protected]>
> > Sent: Tuesday, August 10, 2021 4:36 PM
> > To: [email protected]
> > Subject: Re: Increasing degree of parallelism when reading Parquet
> > files
> >
> > Ingo,
> >
> > Got it!  It sounds like we indeed have a parallelism performance bug
> > in the area of threading for S3, then.  Weird!  We'll look into it...
> >
> >
> > Cheers,
> >
> > Mike
> >
> >
> > On 8/9/21 11:21 PM, Müller Ingo wrote:
> >
> >
> >     Hey Mike,
> >
> >     Just to clarify: "partitions" is the same thing as I/O devices,
> > right? I have configured 48 of those via "[nc]\niodevices=..." and see
> > the corresponding folders with content show up on the file system.
> > When I vary the number of these devices, I see that all other storage
> > format change the degree of parallelism with my queries. That
> > mechanism thus seems to work in general. It just doesn't seem to work
> > for Parquet on S3. (I am not 100% sure if I tried other file formats
> > on S3.)
> >
> >     I have also tried to set compiler.parallelism to 4 for Parquet files
> > on HDFS with a file:// path and did not see any effect, i.e., it used
> > 48 threads, which corresponds to the number of I/O devices. However,
> > with what Dmitry said, I guess that this is expected behavior and the
> > flag should only influence the degree of parallelism after exchanges (which 
> > I
> don't have in my queries).
> >
> >     Cheers,
> >     Ingo
> >
> >
> >
> >             -----Original Message-----
> >             From: Michael Carey <[email protected]>
> > <mailto:[email protected]>
> >             Sent: Monday, August 9, 2021 10:10 PM
> >             To: [email protected]
> > <mailto:[email protected]>
> >             Subject: Re: Increasing degree of parallelism when reading
> Parquet
> > files
> >
> >             Ingo,
> >
> >             Q: In your Parquet/S3 testing, what does your current cluster
> > configuration look
> >             like?  (I.e., how many partitions have you configured it with -
> > physical storage
> >             partitions that is?)  Even though your S3 data isn't stored 
> > inside
> > AsterixDB in this
> >             case, the system still uses that info to decide how many 
> > parallel
> > threads to use
> >             at the base of its query plans.  (Obviously there is room for
> > improvement on that
> >             behavior for use cases involving external storage. :-))
> >
> >
> >             Cheers,
> >
> >             Mike
> >
> >
> >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> >
> >
> >                     Hi Dmitry,
> >
> >                     Thanks a lot for checking! Indeed, my queries do not
> have an
> > exchange.
> >             However, the number of I/O devices has indeed worked well in
> many
> > cases:
> >             when I tried the various VM instance sizes, I always created as
> many
> > I/O devices
> >             as there were physical cores (i.e., half the number of logical
> > CPUs). For internal
> >             storage as well as HDFS (both using the hdfs:// and the file://
> > protocol), I saw
> >             the full system being utilized. However, just for the case of
> > Parquet on S3, I
> >             cannot seem to make it use more than 16 cores.
> >
> >                     Cheers,
> >                     Ingo
> >
> >
> >
> >                             -----Original Message-----
> >                             From: Dmitry Lychagin
> > <[email protected]>
> <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                             Sent: Monday, August 9, 2021 9:10 PM
> >                             To: [email protected]
> > <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                             Subject: Re: Increasing degree of parallelism
> when reading
> >             Parquet files
> >
> >                             Hi Ingo,
> >
> >                             I checked the code and it seems that when
> scanning external
> >             datasource we're
> >                             using the same number of cores as there are
> configured storage
> >             partitions (I/O
> >                             devices).
> >                             Therefore, if you want 96 cores to be used
> when scanning
> >             Parquet files then you
> >                             need to configure 96 I/O devices.
> >
> >                             Compiler.parallelism setting is supposed to
> affect how many
> >             cores we use after
> >                             the first EXCHANGE operator. However, if your
> query doesn't
> >             have any
> >                             EXCHANGEs then it'll use the number of cores
> assigned for the
> >             initial data scan
> >                             operator (number of I/O devices)
> >
> >                             Thanks,
> >                             -- Dmitry
> >
> >
> >                             On 8/9/21, 11:42 AM, "Müller  Ingo"
> >             <[email protected]> <mailto:[email protected]>
> > <mailto:[email protected]> <mailto:[email protected]>   wrote:
> >
> >                                  EXTERNAL EMAIL:  Use caution when
> opening attachments
> >             or clicking on links
> >
> >
> >
> >
> >
> >                                 Dear Dmitry,
> >
> >                                 Thanks a lot for the quick reply! I had not
> though of this.
> >             However, I have tried
> >                             out both ways just now (per query and in the
> cluster
> >             configuration) and did not
> >                             see any changes. Is there any way I can control
> that the setting
> >             was applied
> >                             successfully? I have also tried setting
> compiler.parallelism to 4
> >             and still observed
> >                             16 cores being utilized.
> >
> >                                 Note that the observed degree of parallelism
> does not
> >             correspond to anything
> >                             related to the data set (I tried with every 
> > power
> of two files
> >             between 1 and 128)
> >                             or the cluster (I tried with every power of two
> cores between 2
> >             and 64, as well
> >                             as 48 and 96) and I always see 16 cores being
> used (or fewer, if
> >             the system has
> >                             fewer). To me, this makes it unlikely that the
> system really uses
> >             the semantics
> >                             for p=0 or p<0, but looks more like some hard-
> coded value.
> >
> >                                 Cheers,
> >                                 Ingo
> >
> >
> >                                 > -----Original Message-----
> >                                 > From: Dmitry Lychagin
> > <[email protected]>
> <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                                 > Sent: Monday, August 9, 2021 7:25 PM
> >                                 > To: [email protected]
> > <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                                 > Subject: Re: Increasing degree of
> parallelism when reading
> >             Parquet files
> >                                 >
> >                                 > Ingo,
> >                                 >
> >                                 >
> >                                 >
> >                                 > We have `compiler.parallelism` parameter
> that controls
> >             how many cores are
> >                                 > used for query execution.
> >                                 >
> >                                 > See
> >                                 >
> >
> >
> >     https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> >             _param
> >                                 > eter
> >                                 >
> >
> >
> >     <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> >             m_para
> >                                 >
> >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> >
> >     <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > m_para>
> >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> >             meter>
> >                                 >
> >                                 > You can either set it per query (e.g. SET
> >             `compiler.parallelism` "-1";) ,
> >                                 >
> >                                 > or globally in the cluster configuration:
> >                                 >
> >                                 >
> >
> >     https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> >                                 > app/src/main/resources/cc2.conf#L57
> >                                 >
> >                                 >
> >                                 >
> >                                 > Thanks,
> >                                 >
> >                                 > -- Dmitry
> >                                 >
> >                                 >
> >                                 >
> >                                 >
> >                                 >
> >                                 > From: Müller Ingo
> > <[email protected]> <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                                 > Reply-To: "[email protected]"
> > <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>   <[email protected]>
> > <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                                 > Date: Monday, August 9, 2021 at 10:05 AM
> >                                 > To: "[email protected]"
> > <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>   <[email protected]>
> > <mailto:[email protected]>
> >             <mailto:[email protected]>
> > <mailto:[email protected]>
> >                                 > Subject: Increasing degree of parallelism
> > when reading
> >             Parquet files
> >                                 >
> >                                 >
> >                                 >
> >                                 >  EXTERNAL EMAIL:  Use caution when
> > opening attachments
> >             or clicking on
> >                             links
> >                                 >
> >                                 >
> >                                 >
> >                                 >
> >                                 >
> >                                 > Dear AsterixDB devs,
> >                                 >
> >                                 >
> >                                 >
> >                                 > I am currently trying out the new support
> > for Parquet files
> >             on S3 (still in the
> >                                 > context of my High-energy Physics use case
> > [1]). This works
> >             great so far and
> >                             has
> >                                 > generally decent performance. However, I
> > realized that it
> >             does not use more
> >                                 > than 16 cores, even though 96 logical 
> > cores
> > are available
> >             and even though I
> >                             run
> >                                 > long-running queries (several minutes) on
> > large data sets
> >             with a large
> >                             number of
> >                                 > files (I tried 128 files of 17GB each). 
> > Is this
> > an
> >             arbitrary/artificial limitation
> >                             that
> >                                 > can be changed somehow (potentially with
> > a small
> >             patch+recompiling) or is
> >                                 > there more serious development required
> > to lift it? FYI, I am
> >             currently using
> >                                 > 03fd6d0f, which should include all
> > S3/Parquet commits on
> >             master.
> >                                 >
> >                                 >
> >                                 >
> >                                 > Cheers,
> >                                 >
> >                                 > Ingo
> >                                 >
> >                                 >
> >                                 >
> >                                 >
> >                                 >
> >                                 > [1] https://arxiv.org/abs/2104.12615
> >                                 >
> >                                 >
> >
> >
> >
> >
> >

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to