Dear all, I have just tried out Wail's patch set from here: https://issues.apache.org/jira/browse/ASTERIXDB-2945. It seems to solve my problem fully: in the 96-vCPU instance with 48 I/O devices, I see about 4800% CPU utilization during query execution, and the queries run only marginally longer than if run against local files. Thanks a lot for the quick fix!
Should I use this version for a full benchmark run or wait until the patch makes it to master? Cheers, Ingo > -----Original Message----- > From: Wail Alkowaileet <[email protected]> > Sent: Tuesday, August 10, 2021 6:10 PM > To: [email protected] > Subject: Re: Increasing degree of parallelism when reading Parquet files > > Thanks Ingo for the detailed explanation and for benchmarking it! It is a > great > input for us. We will look at the issue and hopefully we can get it fixed > before > the end of the week. > > On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo <[email protected] > <mailto:[email protected]> > wrote: > > > Let me also say that I can still rerun the experiments for the > (hopefully > subsequent) camera-ready version if the problem takes longer to fix. > > Cheers, > Ingo > > > > -----Original Message----- > > From: Müller Ingo <[email protected] > <mailto:[email protected]> > > > Sent: Tuesday, August 10, 2021 5:34 PM > > To: [email protected] > <mailto:[email protected]> > > Subject: RE: Increasing degree of parallelism when reading Parquet > files > > > > Hey Mike! > > > > Thanks for confirming! I am happy to test any fixes that you may come > up with. > > If the happens to be simple and is fixed before Friday, I can still > include > it in the > > revision I am currently working on ;) Otherwise, it'd be great to be > able > to > > mention a Jira issue or similar (maybe this mailing list thread is > enough?) that I > > can refer to. > > > > Cheers, > > Ingo > > > > > > > -----Original Message----- > > > From: Michael Carey <[email protected] > <mailto:[email protected]> > > > > Sent: Tuesday, August 10, 2021 4:36 PM > > > To: [email protected] > <mailto:[email protected]> > > > Subject: Re: Increasing degree of parallelism when reading Parquet > > > files > > > > > > Ingo, > > > > > > Got it! It sounds like we indeed have a parallelism performance bug > > > in the area of threading for S3, then. Weird! We'll look into > it... > > > > > > > > > Cheers, > > > > > > Mike > > > > > > > > > On 8/9/21 11:21 PM, Müller Ingo wrote: > > > > > > > > > Hey Mike, > > > > > > Just to clarify: "partitions" is the same thing as I/O devices, > > > right? I have configured 48 of those via "[nc]\niodevices=..." and > see > > > the corresponding folders with content show up on the file system. > > > When I vary the number of these devices, I see that all other > storage > > > format change the degree of parallelism with my queries. That > > > mechanism thus seems to work in general. It just doesn't seem to > work > > > for Parquet on S3. (I am not 100% sure if I tried other file formats > > > on S3.) > > > > > > I have also tried to set compiler.parallelism to 4 for Parquet > files > > > on HDFS with a file:// path and did not see any effect, i.e., it > used > > > 48 threads, which corresponds to the number of I/O devices. > However, > > > with what Dmitry said, I guess that this is expected behavior and > the > > > flag should only influence the degree of parallelism after exchanges > (which I > > don't have in my queries). > > > > > > Cheers, > > > Ingo > > > > > > > > > > > > -----Original Message----- > > > From: Michael Carey <[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] <mailto:[email protected]> > > > > Sent: Monday, August 9, 2021 10:10 PM > > > To: [email protected] > <mailto:[email protected]> > > > <mailto:[email protected] > <mailto:[email protected]> > > > > Subject: Re: Increasing degree of parallelism when > reading > > Parquet > > > files > > > > > > Ingo, > > > > > > Q: In your Parquet/S3 testing, what does your current > cluster > > > configuration look > > > like? (I.e., how many partitions have you configured > it with - > > > physical storage > > > partitions that is?) Even though your S3 data isn't > stored > inside > > > AsterixDB in this > > > case, the system still uses that info to decide how many > parallel > > > threads to use > > > at the base of its query plans. (Obviously there is > room for > > > improvement on that > > > behavior for use cases involving external storage. :-)) > > > > > > > > > Cheers, > > > > > > Mike > > > > > > > > > On 8/9/21 12:28 PM, Müller Ingo wrote: > > > > > > > > > Hi Dmitry, > > > > > > Thanks a lot for checking! Indeed, my queries > do not > > have an > > > exchange. > > > However, the number of I/O devices has indeed worked > well > in > > many > > > cases: > > > when I tried the various VM instance sizes, I always > created as > > many > > > I/O devices > > > as there were physical cores (i.e., half the number of > logical > > > CPUs). For internal > > > storage as well as HDFS (both using the hdfs:// and the > file:// > > > protocol), I saw > > > the full system being utilized. However, just for the > case of > > > Parquet on S3, I > > > cannot seem to make it use more than 16 cores. > > > > > > Cheers, > > > Ingo > > > > > > > > > > > > -----Original Message----- > > > From: Dmitry Lychagin > > > <[email protected] > <mailto:[email protected]> > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > Sent: Monday, August 9, 2021 9:10 PM > > > To: [email protected] > <mailto:[email protected]> > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > Subject: Re: Increasing degree of > parallelism > > when reading > > > Parquet files > > > > > > Hi Ingo, > > > > > > I checked the code and it seems that > when > > scanning external > > > datasource we're > > > using the same number of cores as there > are > > configured storage > > > partitions (I/O > > > devices). > > > Therefore, if you want 96 cores to be > used > > when scanning > > > Parquet files then you > > > need to configure 96 I/O devices. > > > > > > Compiler.parallelism setting is > supposed to > > affect how many > > > cores we use after > > > the first EXCHANGE operator. However, > if your > > query doesn't > > > have any > > > EXCHANGEs then it'll use the number of > cores > > assigned for the > > > initial data scan > > > operator (number of I/O devices) > > > > > > Thanks, > > > -- Dmitry > > > > > > > > > On 8/9/21, 11:42 AM, "Müller Ingo" > > > <[email protected] > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]> > wrote: > > > > > > EXTERNAL EMAIL: Use caution when > > opening attachments > > > or clicking on links > > > > > > > > > > > > > > > > > > Dear Dmitry, > > > > > > Thanks a lot for the quick reply! I > had not > > though of this. > > > However, I have tried > > > out both ways just now (per query and > in the > > cluster > > > configuration) and did not > > > see any changes. Is there any way I can > control > > that the setting > > > was applied > > > successfully? I have also tried setting > > compiler.parallelism to 4 > > > and still observed > > > 16 cores being utilized. > > > > > > Note that the observed degree of > parallelism > > does not > > > correspond to anything > > > related to the data set (I tried with > every power > > of two files > > > between 1 and 128) > > > or the cluster (I tried with every > power of two > > cores between 2 > > > and 64, as well > > > as 48 and 96) and I always see 16 cores > being > > used (or fewer, if > > > the system has > > > fewer). To me, this makes it unlikely > that the > > system really uses > > > the semantics > > > for p=0 or p<0, but looks more like > some hard- > > coded value. > > > > > > Cheers, > > > Ingo > > > > > > > > > > -----Original Message----- > > > > From: Dmitry Lychagin > > > <[email protected] > <mailto:[email protected]> > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > > Sent: Monday, August 9, 2021 7:25 > PM > > > > To: [email protected] > <mailto:[email protected]> > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > > Subject: Re: Increasing degree of > > parallelism when reading > > > Parquet files > > > > > > > > Ingo, > > > > > > > > > > > > > > > > We have `compiler.parallelism` > parameter > > that controls > > > how many cores are > > > > used for query execution. > > > > > > > > See > > > > > > > > > > > > > > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism > > > _param > > > > eter > > > > > > > > > > > > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis > > > m_para > > > > > > > > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para> > > > > > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis > > > m_para> > > > > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para> > > > meter> > > > > > > > > You can either set it per query > (e.g. SET > > > `compiler.parallelism` "-1";) , > > > > > > > > or globally in the cluster > configuration: > > > > > > > > > > > > > > > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix- > > > > > app/src/main/resources/cc2.conf#L57 > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -- Dmitry > > > > > > > > > > > > > > > > > > > > > > > > From: Müller Ingo > > > <[email protected] <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > > Reply-To: > "[email protected] > <mailto:[email protected]> " > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > <[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > > Date: Monday, August 9, 2021 at > 10:05 AM > > > > To: "[email protected] > <mailto:[email protected]> " > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > <[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > <mailto:[email protected] > <mailto:[email protected]> > > > > > Subject: Increasing degree of > parallelism > > > when reading > > > Parquet files > > > > > > > > > > > > > > > > EXTERNAL EMAIL: Use caution when > > > opening attachments > > > or clicking on > > > links > > > > > > > > > > > > > > > > > > > > > > > > Dear AsterixDB devs, > > > > > > > > > > > > > > > > I am currently trying out the new > support > > > for Parquet files > > > on S3 (still in the > > > > context of my High-energy Physics > use case > > > [1]). This works > > > great so far and > > > has > > > > generally decent performance. > However, I > > > realized that it > > > does not use more > > > > than 16 cores, even though 96 > logical cores > > > are available > > > and even though I > > > run > > > > long-running queries (several > minutes) on > > > large data sets > > > with a large > > > number of > > > > files (I tried 128 files of 17GB > each). Is this > > > an > > > arbitrary/artificial limitation > > > that > > > > can be changed somehow > (potentially with > > > a small > > > patch+recompiling) or is > > > > there more serious development > required > > > to lift it? FYI, I am > > > currently using > > > > 03fd6d0f, which should include all > > > S3/Parquet commits on > > > master. > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Ingo > > > > > > > > > > > > > > > > > > > > > > > > [1] > https://arxiv.org/abs/2104.12615 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > Wail Alkowaileet
