Hey Wail, Thanks a lot for helping! I am reading from an EC2 m5d.24xlarge instance from AWS's S3. I am pretty sure that S3 is not the problem: First, others have measured [1] up to 2.7GB/s from S3. When I measure the network bandwidth of AsterixDB, I see in the order of 600MB/s. (The 60-80MB/s you mention are typical per connection, but one application can get a much higher bandwidth using multiple connections.) Indeed, with other systems in the comparison, I can read from S3 *much* faster from S3 than AsterixDB; my current understanding is that AserixDB is compute-bound. Also, when I scale up the instance size, I get almost perfectly linear speed up until exactly the point where I use 16 cores (and no speed-up after that) -- this is unlikely to happen if the network is getting saturated, where you'd see some slow-down before the saturation point. Finally, I see 16 threads using 100% of one core each and all other threads being completely idle -- this doesn't look like 48 threads waiting for the network either.
I have a few other things to try, then I'll report back. Cheers, Ingo [1] https://github.com/dvassallo/s3-benchmark > -----Original Message----- > From: Wail Alkowaileet <[email protected]> > Sent: Monday, August 9, 2021 10:04 PM > To: [email protected] > Subject: Re: Increasing degree of parallelism when reading Parquet files > > Hi Ingo, > > Were you reading from an actual S3 bucket? or was it a local S3 mock server? > The reason I ask is because reading from a remote bucket is slow (the fastest > I > have seen was ~60MB/s). If your HDFS server(s) are backed by NVMe drives, > then the read speed could be in GBs/s. For the remote S3 bucket case, other > cores would be idle as they will be waiting for the data to arrive. > > > On Mon, Aug 9, 2021 at 12:28 PM Müller Ingo <[email protected] > <mailto:[email protected]> > wrote: > > > Hi Dmitry, > > Thanks a lot for checking! Indeed, my queries do not have an exchange. > However, the number of I/O devices has indeed worked well in many cases: > when I tried the various VM instance sizes, I always created as many I/O > devices > as there were physical cores (i.e., half the number of logical CPUs). For > internal > storage as well as HDFS (both using the hdfs:// and the file:// protocol), I > saw > the full system being utilized. However, just for the case of Parquet on S3, I > cannot seem to make it use more than 16 cores. > > Cheers, > Ingo > > > > -----Original Message----- > > From: Dmitry Lychagin <[email protected] > <mailto:[email protected]> > > > Sent: Monday, August 9, 2021 9:10 PM > > To: [email protected] > <mailto:[email protected]> > > Subject: Re: Increasing degree of parallelism when reading Parquet > files > > > > Hi Ingo, > > > > I checked the code and it seems that when scanning external > datasource we're > > using the same number of cores as there are configured storage > partitions (I/O > > devices). > > Therefore, if you want 96 cores to be used when scanning Parquet > files then you > > need to configure 96 I/O devices. > > > > Compiler.parallelism setting is supposed to affect how many cores we > use after > > the first EXCHANGE operator. However, if your query doesn't have any > > EXCHANGEs then it'll use the number of cores assigned for the initial > data scan > > operator (number of I/O devices) > > > > Thanks, > > -- Dmitry > > > > > > On 8/9/21, 11:42 AM, "Müller Ingo" <[email protected] > <mailto:[email protected]> > wrote: > > > > EXTERNAL EMAIL: Use caution when opening attachments or > clicking on links > > > > > > > > > > > > Dear Dmitry, > > > > Thanks a lot for the quick reply! I had not though of this. > However, I > have tried > > out both ways just now (per query and in the cluster configuration) > and did not > > see any changes. Is there any way I can control that the setting was > applied > > successfully? I have also tried setting compiler.parallelism to 4 and > still > observed > > 16 cores being utilized. > > > > Note that the observed degree of parallelism does not correspond > to anything > > related to the data set (I tried with every power of two files > between 1 > and 128) > > or the cluster (I tried with every power of two cores between 2 and > 64, as well > > as 48 and 96) and I always see 16 cores being used (or fewer, if the > system has > > fewer). To me, this makes it unlikely that the system really uses the > semantics > > for p=0 or p<0, but looks more like some hard-coded value. > > > > Cheers, > > Ingo > > > > > > > -----Original Message----- > > > From: Dmitry Lychagin <[email protected] > <mailto:[email protected]> > > > > Sent: Monday, August 9, 2021 7:25 PM > > > To: [email protected] > <mailto:[email protected]> > > > Subject: Re: Increasing degree of parallelism when reading > Parquet files > > > > > > Ingo, > > > > > > > > > > > > We have `compiler.parallelism` parameter that controls how many > cores are > > > used for query execution. > > > > > > See > > > > > > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param > > > eter > > > > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para > > > meter> > > > > > > You can either set it per query (e.g. SET > `compiler.parallelism` "- > 1";) , > > > > > > or globally in the cluster configuration: > > > > > > > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix- > > > app/src/main/resources/cc2.conf#L57 > > > > > > > > > > > > Thanks, > > > > > > -- Dmitry > > > > > > > > > > > > > > > > > > From: Müller Ingo <[email protected] > <mailto:[email protected]> > > > > Reply-To: "[email protected] > <mailto:[email protected]> " <[email protected] > <mailto:[email protected]> > > > > Date: Monday, August 9, 2021 at 10:05 AM > > > To: "[email protected] > <mailto:[email protected]> " <[email protected] > <mailto:[email protected]> > > > > Subject: Increasing degree of parallelism when reading Parquet > files > > > > > > > > > > > > EXTERNAL EMAIL: Use caution when opening attachments or > clicking on > > links > > > > > > > > > > > > > > > > > > Dear AsterixDB devs, > > > > > > > > > > > > I am currently trying out the new support for Parquet files on > S3 > (still in the > > > context of my High-energy Physics use case [1]). This works > great > so far and > > has > > > generally decent performance. However, I realized that it does > not > use more > > > than 16 cores, even though 96 logical cores are available and > even > though I > > run > > > long-running queries (several minutes) on large data sets with a > large > > number of > > > files (I tried 128 files of 17GB each). Is this an > arbitrary/artificial > limitation > > that > > > can be changed somehow (potentially with a small > patch+recompiling) or is > > > there more serious development required to lift it? FYI, I am > currently using > > > 03fd6d0f, which should include all S3/Parquet commits on master. > > > > > > > > > > > > Cheers, > > > > > > Ingo > > > > > > > > > > > > > > > > > > [1] https://arxiv.org/abs/2104.12615 > > > > > > > > > > > > > > -- > > > Regards, > Wail Alkowaileet
