Hi Ingo, I checked the code and it seems that when scanning external datasource we're using the same number of cores as there are configured storage partitions (I/O devices). Therefore, if you want 96 cores to be used when scanning Parquet files then you need to configure 96 I/O devices.
Compiler.parallelism setting is supposed to affect how many cores we use after the first EXCHANGE operator. However, if your query doesn't have any EXCHANGEs then it'll use the number of cores assigned for the initial data scan operator (number of I/O devices) Thanks, -- Dmitry On 8/9/21, 11:42 AM, "Müller Ingo" <ingo.muel...@inf.ethz.ch> wrote: EXTERNAL EMAIL: Use caution when opening attachments or clicking on links Dear Dmitry, Thanks a lot for the quick reply! I had not though of this. However, I have tried out both ways just now (per query and in the cluster configuration) and did not see any changes. Is there any way I can control that the setting was applied successfully? I have also tried setting compiler.parallelism to 4 and still observed 16 cores being utilized. Note that the observed degree of parallelism does not correspond to anything related to the data set (I tried with every power of two files between 1 and 128) or the cluster (I tried with every power of two cores between 2 and 64, as well as 48 and 96) and I always see 16 cores being used (or fewer, if the system has fewer). To me, this makes it unlikely that the system really uses the semantics for p=0 or p<0, but looks more like some hard-coded value. Cheers, Ingo > -----Original Message----- > From: Dmitry Lychagin <dmitry.lycha...@couchbase.com> > Sent: Monday, August 9, 2021 7:25 PM > To: users@asterixdb.apache.org > Subject: Re: Increasing degree of parallelism when reading Parquet files > > Ingo, > > > > We have `compiler.parallelism` parameter that controls how many cores are > used for query execution. > > See > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param > eter > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para > meter> > > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) , > > or globally in the cluster configuration: > > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix- > app/src/main/resources/cc2.conf#L57 > > > > Thanks, > > -- Dmitry > > > > > > From: Müller Ingo <ingo.muel...@inf.ethz.ch> > Reply-To: "users@asterixdb.apache.org" <users@asterixdb.apache.org> > Date: Monday, August 9, 2021 at 10:05 AM > To: "users@asterixdb.apache.org" <users@asterixdb.apache.org> > Subject: Increasing degree of parallelism when reading Parquet files > > > > EXTERNAL EMAIL: Use caution when opening attachments or clicking on links > > > > > > Dear AsterixDB devs, > > > > I am currently trying out the new support for Parquet files on S3 (still in the > context of my High-energy Physics use case [1]). This works great so far and has > generally decent performance. However, I realized that it does not use more > than 16 cores, even though 96 logical cores are available and even though I run > long-running queries (several minutes) on large data sets with a large number of > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation that > can be changed somehow (potentially with a small patch+recompiling) or is > there more serious development required to lift it? FYI, I am currently using > 03fd6d0f, which should include all S3/Parquet commits on master. > > > > Cheers, > > Ingo > > > > > > [1] https://arxiv.org/abs/2104.12615 > >