Re: Increasing degree of parallelism when reading Parquet files

Michael Carey Mon, 09 Aug 2021 13:09:58 -0700

Ingo,

Q: In your Parquet/S3 testing, what does your current clusterconfiguration look like? (I.e., how many partitions have you configuredit with - physical storage partitions that is?) Even though your S3data isn't stored inside AsterixDB in this case, the system still usesthat info to decide how many parallel threads to use at the base of itsquery plans. (Obviously there is room for improvement on that behaviorfor use cases involving external storage. :-))


Cheers,

Mike

On 8/9/21 12:28 PM, Müller Ingo wrote:

Hi Dmitry,

Thanks a lot for checking! Indeed, my queries do not have an exchange. However, 
the number of I/O devices has indeed worked well in many cases: when I tried 
the various VM instance sizes, I always created as many I/O devices as there 
were physical cores (i.e., half the number of logical CPUs). For internal 
storage as well as HDFS (both using the hdfs:// and the file:// protocol), I 
saw the full system being utilized. However, just for the case of Parquet on 
S3, I cannot seem to make it use more than 16 cores.

Cheers,
Ingo

-----Original Message-----
From: Dmitry Lychagin <[email protected]>
Sent: Monday, August 9, 2021 9:10 PM
To: [email protected]
Subject: Re: Increasing degree of parallelism when reading Parquet files

Hi Ingo,

I checked the code and it seems that when scanning external datasource we're
using the same number of cores as there are configured storage partitions (I/O
devices).
Therefore, if you want 96 cores to be used when scanning Parquet files then you
need to configure 96 I/O devices.

Compiler.parallelism setting is supposed to affect how many cores we use after
the first EXCHANGE operator. However, if your query doesn't have any
EXCHANGEs then it'll use the number of cores assigned for the initial data scan
operator (number of I/O devices)

Thanks,
-- Dmitry


On 8/9/21, 11:42 AM, "Müller  Ingo" <[email protected]> wrote:

      EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links





     Dear Dmitry,

     Thanks a lot for the quick reply! I had not though of this. However, I 
have tried
out both ways just now (per query and in the cluster configuration) and did not
see any changes. Is there any way I can control that the setting was applied
successfully? I have also tried setting compiler.parallelism to 4 and still 
observed
16 cores being utilized.

     Note that the observed degree of parallelism does not correspond to 
anything
related to the data set (I tried with every power of two files between 1 and 
128)
or the cluster (I tried with every power of two cores between 2 and 64, as well
as 48 and 96) and I always see 16 cores being used (or fewer, if the system has
fewer). To me, this makes it unlikely that the system really uses the semantics
for p=0 or p<0, but looks more like some hard-coded value.

     Cheers,
     Ingo


     > -----Original Message-----
     > From: Dmitry Lychagin <[email protected]>
     > Sent: Monday, August 9, 2021 7:25 PM
     > To: [email protected]
     > Subject: Re: Increasing degree of parallelism when reading Parquet files
     >
     > Ingo,
     >
     >
     >
     > We have `compiler.parallelism` parameter that controls how many cores are
     > used for query execution.
     >
     > See
     >
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
     > eter
     >
<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
     > meter>
     >
     > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
     >
     > or globally in the cluster configuration:
     >
     > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
     > app/src/main/resources/cc2.conf#L57
     >
     >
     >
     > Thanks,
     >
     > -- Dmitry
     >
     >
     >
     >
     >
     > From: Müller Ingo <[email protected]>
     > Reply-To: "[email protected]" <[email protected]>
     > Date: Monday, August 9, 2021 at 10:05 AM
     > To: "[email protected]" <[email protected]>
     > Subject: Increasing degree of parallelism when reading Parquet files
     >
     >
     >
     >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
links
     >
     >
     >
     >
     >
     > Dear AsterixDB devs,
     >
     >
     >
     > I am currently trying out the new support for Parquet files on S3 (still 
in the
     > context of my High-energy Physics use case [1]). This works great so far 
and
has
     > generally decent performance. However, I realized that it does not use 
more
     > than 16 cores, even though 96 logical cores are available and even 
though I
run
     > long-running queries (several minutes) on large data sets with a large
number of
     > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial 
limitation
that
     > can be changed somehow (potentially with a small patch+recompiling) or is
     > there more serious development required to lift it? FYI, I am currently 
using
     > 03fd6d0f, which should include all S3/Parquet commits on master.
     >
     >
     >
     > Cheers,
     >
     > Ingo
     >
     >
     >
     >
     >
     > [1] https://arxiv.org/abs/2104.12615
     >
     >

Re: Increasing degree of parallelism when reading Parquet files

Reply via email to