Hey Mike! Thanks for confirming! I am happy to test any fixes that you may come up with. If the happens to be simple and is fixed before Friday, I can still include it in the revision I am currently working on ;) Otherwise, it'd be great to be able to mention a Jira issue or similar (maybe this mailing list thread is enough?) that I can refer to.
Cheers, Ingo > -----Original Message----- > From: Michael Carey <[email protected]> > Sent: Tuesday, August 10, 2021 4:36 PM > To: [email protected] > Subject: Re: Increasing degree of parallelism when reading Parquet files > > Ingo, > > Got it! It sounds like we indeed have a parallelism performance bug in the > area > of threading for S3, then. Weird! We'll look into it... > > > Cheers, > > Mike > > > On 8/9/21 11:21 PM, Müller Ingo wrote: > > > Hey Mike, > > Just to clarify: "partitions" is the same thing as I/O devices, right? > I have > configured 48 of those via "[nc]\niodevices=..." and see the corresponding > folders with content show up on the file system. When I vary the number of > these devices, I see that all other storage format change the degree of > parallelism with my queries. That mechanism thus seems to work in general. It > just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried > other > file formats on S3.) > > I have also tried to set compiler.parallelism to 4 for Parquet files on > HDFS with a file:// path and did not see any effect, i.e., it used 48 > threads, which > corresponds to the number of I/O devices. However, with what Dmitry said, I > guess that this is expected behavior and the flag should only influence the > degree of parallelism after exchanges (which I don't have in my queries). > > Cheers, > Ingo > > > > -----Original Message----- > From: Michael Carey <[email protected]> > <mailto:[email protected]> > Sent: Monday, August 9, 2021 10:10 PM > To: [email protected] > <mailto:[email protected]> > Subject: Re: Increasing degree of parallelism when reading > Parquet files > > Ingo, > > Q: In your Parquet/S3 testing, what does your current cluster > configuration look > like? (I.e., how many partitions have you configured it with - > physical storage > partitions that is?) Even though your S3 data isn't stored > inside > AsterixDB in this > case, the system still uses that info to decide how many > parallel > threads to use > at the base of its query plans. (Obviously there is room for > improvement on that > behavior for use cases involving external storage. :-)) > > > Cheers, > > Mike > > > On 8/9/21 12:28 PM, Müller Ingo wrote: > > > Hi Dmitry, > > Thanks a lot for checking! Indeed, my queries do not > have an exchange. > However, the number of I/O devices has indeed worked well in > many cases: > when I tried the various VM instance sizes, I always created as > many I/O devices > as there were physical cores (i.e., half the number of logical > CPUs). For internal > storage as well as HDFS (both using the hdfs:// and the file:// > protocol), I saw > the full system being utilized. However, just for the case of > Parquet on S3, I > cannot seem to make it use more than 16 cores. > > Cheers, > Ingo > > > > -----Original Message----- > From: Dmitry Lychagin > <[email protected]> <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > Sent: Monday, August 9, 2021 9:10 PM > To: [email protected] > <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > Subject: Re: Increasing degree of parallelism > when reading > Parquet files > > Hi Ingo, > > I checked the code and it seems that when > scanning external > datasource we're > using the same number of cores as there are > configured storage > partitions (I/O > devices). > Therefore, if you want 96 cores to be used > when scanning > Parquet files then you > need to configure 96 I/O devices. > > Compiler.parallelism setting is supposed to > affect how many > cores we use after > the first EXCHANGE operator. However, if your > query doesn't > have any > EXCHANGEs then it'll use the number of cores > assigned for the > initial data scan > operator (number of I/O devices) > > Thanks, > -- Dmitry > > > On 8/9/21, 11:42 AM, "Müller Ingo" > <[email protected]> <mailto:[email protected]> > <mailto:[email protected]> <mailto:[email protected]> wrote: > > EXTERNAL EMAIL: Use caution when > opening attachments > or clicking on links > > > > > > Dear Dmitry, > > Thanks a lot for the quick reply! I had not > though of this. > However, I have tried > out both ways just now (per query and in the > cluster > configuration) and did not > see any changes. Is there any way I can control > that the setting > was applied > successfully? I have also tried setting > compiler.parallelism to 4 > and still observed > 16 cores being utilized. > > Note that the observed degree of parallelism > does not > correspond to anything > related to the data set (I tried with every > power > of two files > between 1 and 128) > or the cluster (I tried with every power of two > cores between 2 > and 64, as well > as 48 and 96) and I always see 16 cores being > used (or fewer, if > the system has > fewer). To me, this makes it unlikely that the > system really uses > the semantics > for p=0 or p<0, but looks more like some hard- > coded value. > > Cheers, > Ingo > > > > -----Original Message----- > > From: Dmitry Lychagin > <[email protected]> <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > > Sent: Monday, August 9, 2021 7:25 PM > > To: [email protected] > <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > > Subject: Re: Increasing degree of > parallelism when reading > Parquet files > > > > Ingo, > > > > > > > > We have `compiler.parallelism` parameter > that controls > how many cores are > > used for query execution. > > > > See > > > > > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism > _param > > eter > > > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis > m_para > > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para> > > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis > m_para> > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para> > meter> > > > > You can either set it per query (e.g. SET > `compiler.parallelism` "-1";) , > > > > or globally in the cluster configuration: > > > > > > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix- > > app/src/main/resources/cc2.conf#L57 > > > > > > > > Thanks, > > > > -- Dmitry > > > > > > > > > > > > From: Müller Ingo > <[email protected]> <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > > Reply-To: "[email protected]" > <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> <[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > > Date: Monday, August 9, 2021 at 10:05 AM > > To: "[email protected]" > <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> <[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > <mailto:[email protected]> > > Subject: Increasing degree of parallelism > when reading > Parquet files > > > > > > > > EXTERNAL EMAIL: Use caution when > opening attachments > or clicking on > links > > > > > > > > > > > > Dear AsterixDB devs, > > > > > > > > I am currently trying out the new support > for Parquet files > on S3 (still in the > > context of my High-energy Physics use case > [1]). This works > great so far and > has > > generally decent performance. However, I > realized that it > does not use more > > than 16 cores, even though 96 logical > cores > are available > and even though I > run > > long-running queries (several minutes) on > large data sets > with a large > number of > > files (I tried 128 files of 17GB each). > Is this > an > arbitrary/artificial limitation > that > > can be changed somehow (potentially with > a small > patch+recompiling) or is > > there more serious development required > to lift it? FYI, I am > currently using > > 03fd6d0f, which should include all > S3/Parquet commits on > master. > > > > > > > > Cheers, > > > > Ingo > > > > > > > > > > > > [1] https://arxiv.org/abs/2104.12615 > > > > > > > > >
