-----Original Message-----
From: Michael Carey <[email protected]>
Sent: Monday, August 9, 2021 10:10 PM
To: [email protected]
Subject: Re: Increasing degree of parallelism when reading Parquet files
Ingo,
Q: In your Parquet/S3 testing, what does your current cluster configuration look
like? (I.e., how many partitions have you configured it with - physical storage
partitions that is?) Even though your S3 data isn't stored inside AsterixDB in
this
case, the system still uses that info to decide how many parallel threads to use
at the base of its query plans. (Obviously there is room for improvement on
that
behavior for use cases involving external storage. :-))
Cheers,
Mike
On 8/9/21 12:28 PM, Müller Ingo wrote:
Hi Dmitry,
Thanks a lot for checking! Indeed, my queries do not have an exchange.
However, the number of I/O devices has indeed worked well in many cases:
when I tried the various VM instance sizes, I always created as many I/O devices
as there were physical cores (i.e., half the number of logical CPUs). For
internal
storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw
the full system being utilized. However, just for the case of Parquet on S3, I
cannot seem to make it use more than 16 cores.
Cheers,
Ingo
-----Original Message-----
From: Dmitry Lychagin <[email protected]>
<mailto:[email protected]>
Sent: Monday, August 9, 2021 9:10 PM
To: [email protected]
<mailto:[email protected]>
Subject: Re: Increasing degree of parallelism when reading
Parquet files
Hi Ingo,
I checked the code and it seems that when scanning external
datasource we're
using the same number of cores as there are configured storage
partitions (I/O
devices).
Therefore, if you want 96 cores to be used when scanning
Parquet files then you
need to configure 96 I/O devices.
Compiler.parallelism setting is supposed to affect how many
cores we use after
the first EXCHANGE operator. However, if your query doesn't
have any
EXCHANGEs then it'll use the number of cores assigned for the
initial data scan
operator (number of I/O devices)
Thanks,
-- Dmitry
On 8/9/21, 11:42 AM, "Müller Ingo"
<[email protected]> <mailto:[email protected]> wrote:
EXTERNAL EMAIL: Use caution when opening attachments
or clicking on links
Dear Dmitry,
Thanks a lot for the quick reply! I had not though of this.
However, I have tried
out both ways just now (per query and in the cluster
configuration) and did not
see any changes. Is there any way I can control that the setting
was applied
successfully? I have also tried setting compiler.parallelism to
4
and still observed
16 cores being utilized.
Note that the observed degree of parallelism does not
correspond to anything
related to the data set (I tried with every power of two files
between 1 and 128)
or the cluster (I tried with every power of two cores between 2
and 64, as well
as 48 and 96) and I always see 16 cores being used (or fewer, if
the system has
fewer). To me, this makes it unlikely that the system really
uses
the semantics
for p=0 or p<0, but looks more like some hard-coded value.
Cheers,
Ingo
> -----Original Message-----
> From: Dmitry Lychagin <[email protected]>
<mailto:[email protected]>
> Sent: Monday, August 9, 2021 7:25 PM
> To: [email protected]
<mailto:[email protected]>
> Subject: Re: Increasing degree of parallelism when reading
Parquet files
>
> Ingo,
>
>
>
> We have `compiler.parallelism` parameter that controls
how many cores are
> used for query execution.
>
> See
>
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
_param
> eter
>
<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
m_para
>
<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
meter>
>
> You can either set it per query (e.g. SET
`compiler.parallelism` "-1";) ,
>
> or globally in the cluster configuration:
>
>
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> app/src/main/resources/cc2.conf#L57
>
>
>
> Thanks,
>
> -- Dmitry
>
>
>
>
>
> From: Müller Ingo <[email protected]>
<mailto:[email protected]>
> Reply-To: "[email protected]"
<mailto:[email protected]> <[email protected]>
<mailto:[email protected]>
> Date: Monday, August 9, 2021 at 10:05 AM
> To: "[email protected]"
<mailto:[email protected]> <[email protected]>
<mailto:[email protected]>
> Subject: Increasing degree of parallelism when reading
Parquet files
>
>
>
> EXTERNAL EMAIL: Use caution when opening attachments
or clicking on
links
>
>
>
>
>
> Dear AsterixDB devs,
>
>
>
> I am currently trying out the new support for Parquet
files
on S3 (still in the
> context of my High-energy Physics use case [1]). This
works
great so far and
has
> generally decent performance. However, I realized that it
does not use more
> than 16 cores, even though 96 logical cores are available
and even though I
run
> long-running queries (several minutes) on large data sets
with a large
number of
> files (I tried 128 files of 17GB each). Is this an
arbitrary/artificial limitation
that
> can be changed somehow (potentially with a small
patch+recompiling) or is
> there more serious development required to lift it? FYI,
I am
currently using
> 03fd6d0f, which should include all S3/Parquet commits on
master.
>
>
>
> Cheers,
>
> Ingo
>
>
>
>
>
> [1] https://arxiv.org/abs/2104.12615
>
>