Hi Ingo,

I checked the code and it seems that when scanning external datasource we're 
using the same number of cores as there are configured storage partitions (I/O 
devices).
Therefore, if you want 96 cores to be used when scanning Parquet files then you 
need to configure 96 I/O devices.

Compiler.parallelism setting is supposed to affect how many cores we use after 
the first EXCHANGE operator. However, if your query doesn't have any EXCHANGEs 
then it'll use the number of cores assigned for the initial data scan operator 
(number of I/O devices)

Thanks,
-- Dmitry
 

On 8/9/21, 11:42 AM, "Müller  Ingo" <ingo.muel...@inf.ethz.ch> wrote:

     EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links





    Dear Dmitry,

    Thanks a lot for the quick reply! I had not though of this. However, I have 
tried out both ways just now (per query and in the cluster configuration) and 
did not see any changes. Is there any way I can control that the setting was 
applied successfully? I have also tried setting compiler.parallelism to 4 and 
still observed 16 cores being utilized.

    Note that the observed degree of parallelism does not correspond to 
anything related to the data set (I tried with every power of two files between 
1 and 128) or the cluster (I tried with every power of two cores between 2 and 
64, as well as 48 and 96) and I always see 16 cores being used (or fewer, if 
the system has fewer). To me, this makes it unlikely that the system really 
uses the semantics for p=0 or p<0, but looks more like some hard-coded value.

    Cheers,
    Ingo


    > -----Original Message-----
    > From: Dmitry Lychagin <dmitry.lycha...@couchbase.com>
    > Sent: Monday, August 9, 2021 7:25 PM
    > To: users@asterixdb.apache.org
    > Subject: Re: Increasing degree of parallelism when reading Parquet files
    >
    > Ingo,
    >
    >
    >
    > We have `compiler.parallelism` parameter that controls how many cores are
    > used for query execution.
    >
    > See
    > 
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
    > eter
    > 
<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
    > meter>
    >
    > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
    >
    > or globally in the cluster configuration:
    >
    > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
    > app/src/main/resources/cc2.conf#L57
    >
    >
    >
    > Thanks,
    >
    > -- Dmitry
    >
    >
    >
    >
    >
    > From: Müller Ingo <ingo.muel...@inf.ethz.ch>
    > Reply-To: "users@asterixdb.apache.org" <users@asterixdb.apache.org>
    > Date: Monday, August 9, 2021 at 10:05 AM
    > To: "users@asterixdb.apache.org" <users@asterixdb.apache.org>
    > Subject: Increasing degree of parallelism when reading Parquet files
    >
    >
    >
    >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on 
links
    >
    >
    >
    >
    >
    > Dear AsterixDB devs,
    >
    >
    >
    > I am currently trying out the new support for Parquet files on S3 (still 
in the
    > context of my High-energy Physics use case [1]). This works great so far 
and has
    > generally decent performance. However, I realized that it does not use 
more
    > than 16 cores, even though 96 logical cores are available and even though 
I run
    > long-running queries (several minutes) on large data sets with a large 
number of
    > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial 
limitation that
    > can be changed somehow (potentially with a small patch+recompiling) or is
    > there more serious development required to lift it? FYI, I am currently 
using
    > 03fd6d0f, which should include all S3/Parquet commits on master.
    >
    >
    >
    > Cheers,
    >
    > Ingo
    >
    >
    >
    >
    >
    > [1] https://arxiv.org/abs/2104.12615
    >
    >


Reply via email to