RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Mon, 09 Aug 2021 23:22:06 -0700

Hey Mike,

Just to clarify: "partitions" is the same thing as I/O devices, right? I have 
configured 48 of those via "[nc]\niodevices=..." and see the corresponding 
folders with content show up on the file system. When I vary the number of 
these devices, I see that all other storage format change the degree of 
parallelism with my queries. That mechanism thus seems to work in general. It 
just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried 
other file formats on S3.)


I have also tried to set compiler.parallelism to 4 for Parquet files on HDFS 
with a file:// path and did not see any effect, i.e., it used 48 threads, which 
corresponds to the number of I/O devices. However, with what Dmitry said, I 
guess that this is expected behavior and the flag should only influence the 
degree of parallelism after exchanges (which I don't have in my queries).

Cheers,
Ingo


> -----Original Message-----
> From: Michael Carey <[email protected]>
> Sent: Monday, August 9, 2021 10:10 PM
> To: [email protected]
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> Q: In your Parquet/S3 testing, what does your current cluster configuration 
> look
> like?  (I.e., how many partitions have you configured it with - physical 
> storage
> partitions that is?)  Even though your S3 data isn't stored inside AsterixDB 
> in this
> case, the system still uses that info to decide how many parallel threads to 
> use
> at the base of its query plans.  (Obviously there is room for improvement on 
> that
> behavior for use cases involving external storage. :-))
> 
> 
> Cheers,
> 
> Mike
> 
> 
> On 8/9/21 12:28 PM, Müller Ingo wrote:
> 
> 
>       Hi Dmitry,
> 
>       Thanks a lot for checking! Indeed, my queries do not have an exchange.
> However, the number of I/O devices has indeed worked well in many cases:
> when I tried the various VM instance sizes, I always created as many I/O 
> devices
> as there were physical cores (i.e., half the number of logical CPUs). For 
> internal
> storage as well as HDFS (both using the hdfs:// and the file:// protocol), I 
> saw
> the full system being utilized. However, just for the case of Parquet on S3, I
> cannot seem to make it use more than 16 cores.
> 
>       Cheers,
>       Ingo
> 
> 
> 
>               -----Original Message-----
>               From: Dmitry Lychagin <[email protected]>
> <mailto:[email protected]>
>               Sent: Monday, August 9, 2021 9:10 PM
>               To: [email protected]
> <mailto:[email protected]>
>               Subject: Re: Increasing degree of parallelism when reading
> Parquet files
> 
>               Hi Ingo,
> 
>               I checked the code and it seems that when scanning external
> datasource we're
>               using the same number of cores as there are configured storage
> partitions (I/O
>               devices).
>               Therefore, if you want 96 cores to be used when scanning
> Parquet files then you
>               need to configure 96 I/O devices.
> 
>               Compiler.parallelism setting is supposed to affect how many
> cores we use after
>               the first EXCHANGE operator. However, if your query doesn't
> have any
>               EXCHANGEs then it'll use the number of cores assigned for the
> initial data scan
>               operator (number of I/O devices)
> 
>               Thanks,
>               -- Dmitry
> 
> 
>               On 8/9/21, 11:42 AM, "Müller  Ingo"
> <[email protected]> <mailto:[email protected]>  wrote:
> 
>                    EXTERNAL EMAIL:  Use caution when opening attachments
> or clicking on links
> 
> 
> 
> 
> 
>                   Dear Dmitry,
> 
>                   Thanks a lot for the quick reply! I had not though of this.
> However, I have tried
>               out both ways just now (per query and in the cluster
> configuration) and did not
>               see any changes. Is there any way I can control that the setting
> was applied
>               successfully? I have also tried setting compiler.parallelism to 
> 4
> and still observed
>               16 cores being utilized.
> 
>                   Note that the observed degree of parallelism does not
> correspond to anything
>               related to the data set (I tried with every power of two files
> between 1 and 128)
>               or the cluster (I tried with every power of two cores between 2
> and 64, as well
>               as 48 and 96) and I always see 16 cores being used (or fewer, if
> the system has
>               fewer). To me, this makes it unlikely that the system really 
> uses
> the semantics
>               for p=0 or p<0, but looks more like some hard-coded value.
> 
>                   Cheers,
>                   Ingo
> 
> 
>                   > -----Original Message-----
>                   > From: Dmitry Lychagin <[email protected]>
> <mailto:[email protected]>
>                   > Sent: Monday, August 9, 2021 7:25 PM
>                   > To: [email protected]
> <mailto:[email protected]>
>                   > Subject: Re: Increasing degree of parallelism when reading
> Parquet files
>                   >
>                   > Ingo,
>                   >
>                   >
>                   >
>                   > We have `compiler.parallelism` parameter that controls
> how many cores are
>                   > used for query execution.
>                   >
>                   > See
>                   >
> 
>       https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> _param
>                   > eter
>                   >
> 
>       <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> m_para
>                   >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> meter>
>                   >
>                   > You can either set it per query (e.g. SET
> `compiler.parallelism` "-1";) ,
>                   >
>                   > or globally in the cluster configuration:
>                   >
>                   >
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>                   > app/src/main/resources/cc2.conf#L57
>                   >
>                   >
>                   >
>                   > Thanks,
>                   >
>                   > -- Dmitry
>                   >
>                   >
>                   >
>                   >
>                   >
>                   > From: Müller Ingo <[email protected]>
> <mailto:[email protected]>
>                   > Reply-To: "[email protected]"
> <mailto:[email protected]>  <[email protected]>
> <mailto:[email protected]>
>                   > Date: Monday, August 9, 2021 at 10:05 AM
>                   > To: "[email protected]"
> <mailto:[email protected]>  <[email protected]>
> <mailto:[email protected]>
>                   > Subject: Increasing degree of parallelism when reading
> Parquet files
>                   >
>                   >
>                   >
>                   >  EXTERNAL EMAIL:  Use caution when opening attachments
> or clicking on
>               links
>                   >
>                   >
>                   >
>                   >
>                   >
>                   > Dear AsterixDB devs,
>                   >
>                   >
>                   >
>                   > I am currently trying out the new support for Parquet 
> files
> on S3 (still in the
>                   > context of my High-energy Physics use case [1]). This 
> works
> great so far and
>               has
>                   > generally decent performance. However, I realized that it
> does not use more
>                   > than 16 cores, even though 96 logical cores are available
> and even though I
>               run
>                   > long-running queries (several minutes) on large data sets
> with a large
>               number of
>                   > files (I tried 128 files of 17GB each). Is this an
> arbitrary/artificial limitation
>               that
>                   > can be changed somehow (potentially with a small
> patch+recompiling) or is
>                   > there more serious development required to lift it? FYI, 
> I am
> currently using
>                   > 03fd6d0f, which should include all S3/Parquet commits on
> master.
>                   >
>                   >
>                   >
>                   > Cheers,
>                   >
>                   > Ingo
>                   >
>                   >
>                   >
>                   >
>                   >
>                   > [1] https://arxiv.org/abs/2104.12615
>                   >
>                   >
> 
> 
>

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to