RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Tue, 10 Aug 2021 08:33:50 -0700

Hey Mike!

Thanks for confirming! I am happy to test any fixes that you may come up with. 
If the happens to be simple and is fixed before Friday, I can still include it 
in the revision I am currently working on ;) Otherwise, it'd be great to be 
able to mention a Jira issue or similar (maybe this mailing list thread is 
enough?) that I can refer to.


Cheers,
Ingo
 

> -----Original Message-----
> From: Michael Carey <[email protected]>
> Sent: Tuesday, August 10, 2021 4:36 PM
> To: [email protected]
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> Got it!  It sounds like we indeed have a parallelism performance bug in the 
> area
> of threading for S3, then.  Weird!  We'll look into it...
> 
> 
> Cheers,
> 
> Mike
> 
> 
> On 8/9/21 11:21 PM, Müller Ingo wrote:
> 
> 
>       Hey Mike,
> 
>       Just to clarify: "partitions" is the same thing as I/O devices, right? 
> I have
> configured 48 of those via "[nc]\niodevices=..." and see the corresponding
> folders with content show up on the file system. When I vary the number of
> these devices, I see that all other storage format change the degree of
> parallelism with my queries. That mechanism thus seems to work in general. It
> just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried 
> other
> file formats on S3.)
> 
>       I have also tried to set compiler.parallelism to 4 for Parquet files on
> HDFS with a file:// path and did not see any effect, i.e., it used 48 
> threads, which
> corresponds to the number of I/O devices. However, with what Dmitry said, I
> guess that this is expected behavior and the flag should only influence the
> degree of parallelism after exchanges (which I don't have in my queries).
> 
>       Cheers,
>       Ingo
> 
> 
> 
>               -----Original Message-----
>               From: Michael Carey <[email protected]>
> <mailto:[email protected]>
>               Sent: Monday, August 9, 2021 10:10 PM
>               To: [email protected]
> <mailto:[email protected]>
>               Subject: Re: Increasing degree of parallelism when reading
> Parquet files
> 
>               Ingo,
> 
>               Q: In your Parquet/S3 testing, what does your current cluster
> configuration look
>               like?  (I.e., how many partitions have you configured it with -
> physical storage
>               partitions that is?)  Even though your S3 data isn't stored 
> inside
> AsterixDB in this
>               case, the system still uses that info to decide how many 
> parallel
> threads to use
>               at the base of its query plans.  (Obviously there is room for
> improvement on that
>               behavior for use cases involving external storage. :-))
> 
> 
>               Cheers,
> 
>               Mike
> 
> 
>               On 8/9/21 12:28 PM, Müller Ingo wrote:
> 
> 
>                       Hi Dmitry,
> 
>                       Thanks a lot for checking! Indeed, my queries do not
> have an exchange.
>               However, the number of I/O devices has indeed worked well in
> many cases:
>               when I tried the various VM instance sizes, I always created as
> many I/O devices
>               as there were physical cores (i.e., half the number of logical
> CPUs). For internal
>               storage as well as HDFS (both using the hdfs:// and the file://
> protocol), I saw
>               the full system being utilized. However, just for the case of
> Parquet on S3, I
>               cannot seem to make it use more than 16 cores.
> 
>                       Cheers,
>                       Ingo
> 
> 
> 
>                               -----Original Message-----
>                               From: Dmitry Lychagin
> <[email protected]> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                               Sent: Monday, August 9, 2021 9:10 PM
>                               To: [email protected]
> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                               Subject: Re: Increasing degree of parallelism
> when reading
>               Parquet files
> 
>                               Hi Ingo,
> 
>                               I checked the code and it seems that when
> scanning external
>               datasource we're
>                               using the same number of cores as there are
> configured storage
>               partitions (I/O
>                               devices).
>                               Therefore, if you want 96 cores to be used
> when scanning
>               Parquet files then you
>                               need to configure 96 I/O devices.
> 
>                               Compiler.parallelism setting is supposed to
> affect how many
>               cores we use after
>                               the first EXCHANGE operator. However, if your
> query doesn't
>               have any
>                               EXCHANGEs then it'll use the number of cores
> assigned for the
>               initial data scan
>                               operator (number of I/O devices)
> 
>                               Thanks,
>                               -- Dmitry
> 
> 
>                               On 8/9/21, 11:42 AM, "Müller  Ingo"
>               <[email protected]> <mailto:[email protected]>
> <mailto:[email protected]> <mailto:[email protected]>   wrote:
> 
>                                    EXTERNAL EMAIL:  Use caution when
> opening attachments
>               or clicking on links
> 
> 
> 
> 
> 
>                                   Dear Dmitry,
> 
>                                   Thanks a lot for the quick reply! I had not
> though of this.
>               However, I have tried
>                               out both ways just now (per query and in the
> cluster
>               configuration) and did not
>                               see any changes. Is there any way I can control
> that the setting
>               was applied
>                               successfully? I have also tried setting
> compiler.parallelism to 4
>               and still observed
>                               16 cores being utilized.
> 
>                                   Note that the observed degree of parallelism
> does not
>               correspond to anything
>                               related to the data set (I tried with every 
> power
> of two files
>               between 1 and 128)
>                               or the cluster (I tried with every power of two
> cores between 2
>               and 64, as well
>                               as 48 and 96) and I always see 16 cores being
> used (or fewer, if
>               the system has
>                               fewer). To me, this makes it unlikely that the
> system really uses
>               the semantics
>                               for p=0 or p<0, but looks more like some hard-
> coded value.
> 
>                                   Cheers,
>                                   Ingo
> 
> 
>                                   > -----Original Message-----
>                                   > From: Dmitry Lychagin
> <[email protected]> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                                   > Sent: Monday, August 9, 2021 7:25 PM
>                                   > To: [email protected]
> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                                   > Subject: Re: Increasing degree of
> parallelism when reading
>               Parquet files
>                                   >
>                                   > Ingo,
>                                   >
>                                   >
>                                   >
>                                   > We have `compiler.parallelism` parameter
> that controls
>               how many cores are
>                                   > used for query execution.
>                                   >
>                                   > See
>                                   >
> 
> 
>       https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
>               _param
>                                   > eter
>                                   >
> 
> 
>       <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
>               m_para
>                                   >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> 
>       <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> m_para>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
>               meter>
>                                   >
>                                   > You can either set it per query (e.g. SET
>               `compiler.parallelism` "-1";) ,
>                                   >
>                                   > or globally in the cluster configuration:
>                                   >
>                                   >
> 
>       https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>                                   > app/src/main/resources/cc2.conf#L57
>                                   >
>                                   >
>                                   >
>                                   > Thanks,
>                                   >
>                                   > -- Dmitry
>                                   >
>                                   >
>                                   >
>                                   >
>                                   >
>                                   > From: Müller Ingo
> <[email protected]> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                                   > Reply-To: "[email protected]"
> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>   <[email protected]>
> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                                   > Date: Monday, August 9, 2021 at 10:05 AM
>                                   > To: "[email protected]"
> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>   <[email protected]>
> <mailto:[email protected]>
>               <mailto:[email protected]>
> <mailto:[email protected]>
>                                   > Subject: Increasing degree of parallelism
> when reading
>               Parquet files
>                                   >
>                                   >
>                                   >
>                                   >  EXTERNAL EMAIL:  Use caution when
> opening attachments
>               or clicking on
>                               links
>                                   >
>                                   >
>                                   >
>                                   >
>                                   >
>                                   > Dear AsterixDB devs,
>                                   >
>                                   >
>                                   >
>                                   > I am currently trying out the new support
> for Parquet files
>               on S3 (still in the
>                                   > context of my High-energy Physics use case
> [1]). This works
>               great so far and
>                               has
>                                   > generally decent performance. However, I
> realized that it
>               does not use more
>                                   > than 16 cores, even though 96 logical 
> cores
> are available
>               and even though I
>                               run
>                                   > long-running queries (several minutes) on
> large data sets
>               with a large
>                               number of
>                                   > files (I tried 128 files of 17GB each). 
> Is this
> an
>               arbitrary/artificial limitation
>                               that
>                                   > can be changed somehow (potentially with
> a small
>               patch+recompiling) or is
>                                   > there more serious development required
> to lift it? FYI, I am
>               currently using
>                                   > 03fd6d0f, which should include all
> S3/Parquet commits on
>               master.
>                                   >
>                                   >
>                                   >
>                                   > Cheers,
>                                   >
>                                   > Ingo
>                                   >
>                                   >
>                                   >
>                                   >
>                                   >
>                                   > [1] https://arxiv.org/abs/2104.12615
>                                   >
>                                   >
> 
> 
> 
> 
>

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to