RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Mon, 09 Aug 2021 12:28:28 -0700

Hi Dmitry,

Thanks a lot for checking! Indeed, my queries do not have an exchange. However, 
the number of I/O devices has indeed worked well in many cases: when I tried 
the various VM instance sizes, I always created as many I/O devices as there 
were physical cores (i.e., half the number of logical CPUs). For internal 
storage as well as HDFS (both using the hdfs:// and the file:// protocol), I 
saw the full system being utilized. However, just for the case of Parquet on 
S3, I cannot seem to make it use more than 16 cores.


Cheers,
Ingo


> -----Original Message-----
> From: Dmitry Lychagin <[email protected]>
> Sent: Monday, August 9, 2021 9:10 PM
> To: [email protected]
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Hi Ingo,
> 
> I checked the code and it seems that when scanning external datasource we're
> using the same number of cores as there are configured storage partitions (I/O
> devices).
> Therefore, if you want 96 cores to be used when scanning Parquet files then 
> you
> need to configure 96 I/O devices.
> 
> Compiler.parallelism setting is supposed to affect how many cores we use after
> the first EXCHANGE operator. However, if your query doesn't have any
> EXCHANGEs then it'll use the number of cores assigned for the initial data 
> scan
> operator (number of I/O devices)
> 
> Thanks,
> -- Dmitry
> 
> 
> On 8/9/21, 11:42 AM, "Müller  Ingo" <[email protected]> wrote:
> 
>      EXTERNAL EMAIL:  Use caution when opening attachments or clicking on 
> links
> 
> 
> 
> 
> 
>     Dear Dmitry,
> 
>     Thanks a lot for the quick reply! I had not though of this. However, I 
> have tried
> out both ways just now (per query and in the cluster configuration) and did 
> not
> see any changes. Is there any way I can control that the setting was applied
> successfully? I have also tried setting compiler.parallelism to 4 and still 
> observed
> 16 cores being utilized.
> 
>     Note that the observed degree of parallelism does not correspond to 
> anything
> related to the data set (I tried with every power of two files between 1 and 
> 128)
> or the cluster (I tried with every power of two cores between 2 and 64, as 
> well
> as 48 and 96) and I always see 16 cores being used (or fewer, if the system 
> has
> fewer). To me, this makes it unlikely that the system really uses the 
> semantics
> for p=0 or p<0, but looks more like some hard-coded value.
> 
>     Cheers,
>     Ingo
> 
> 
>     > -----Original Message-----
>     > From: Dmitry Lychagin <[email protected]>
>     > Sent: Monday, August 9, 2021 7:25 PM
>     > To: [email protected]
>     > Subject: Re: Increasing degree of parallelism when reading Parquet files
>     >
>     > Ingo,
>     >
>     >
>     >
>     > We have `compiler.parallelism` parameter that controls how many cores 
> are
>     > used for query execution.
>     >
>     > See
>     >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
>     > eter
>     >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
>     > meter>
>     >
>     > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) 
> ,
>     >
>     > or globally in the cluster configuration:
>     >
>     > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>     > app/src/main/resources/cc2.conf#L57
>     >
>     >
>     >
>     > Thanks,
>     >
>     > -- Dmitry
>     >
>     >
>     >
>     >
>     >
>     > From: Müller Ingo <[email protected]>
>     > Reply-To: "[email protected]" <[email protected]>
>     > Date: Monday, August 9, 2021 at 10:05 AM
>     > To: "[email protected]" <[email protected]>
>     > Subject: Increasing degree of parallelism when reading Parquet files
>     >
>     >
>     >
>     >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
> links
>     >
>     >
>     >
>     >
>     >
>     > Dear AsterixDB devs,
>     >
>     >
>     >
>     > I am currently trying out the new support for Parquet files on S3 
> (still in the
>     > context of my High-energy Physics use case [1]). This works great so 
> far and
> has
>     > generally decent performance. However, I realized that it does not use 
> more
>     > than 16 cores, even though 96 logical cores are available and even 
> though I
> run
>     > long-running queries (several minutes) on large data sets with a large
> number of
>     > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial 
> limitation
> that
>     > can be changed somehow (potentially with a small patch+recompiling) or 
> is
>     > there more serious development required to lift it? FYI, I am currently 
> using
>     > 03fd6d0f, which should include all S3/Parquet commits on master.
>     >
>     >
>     >
>     > Cheers,
>     >
>     > Ingo
>     >
>     >
>     >
>     >
>     >
>     > [1] https://arxiv.org/abs/2104.12615
>     >
>     >
>

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to