RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Thu, 26 Aug 2021 01:35:52 -0700

Dear all,

Thanks a lot for the help over the last weeks. We have just published the 
updated version our study on query languages and systems in the context of 
high-energy physics (HEP) here: https://arxiv.org/abs/2104.12615. In that 
version, AsterixDB and SQL++ are part of the study. In short, we concluded 
that, like JSONiq, SQL++ is a perfect fit for HEP. This isn't completely 
surprising since the nested (but fully-structured) data model from that domain 
is a subset of what both languages were originally designed for. In terms of 
performance, AsterixDB fares significantly better than our implementation of 
JSONiq (RumbleDB), but both are too slow to be useful in practice, between an 
order of magnitude and two slower than what the physicist use today.


We have made the complete set of scripts, query implementations, etc public 
here: https://github.com/RumbleDB/hep-iris-benchmark-scripts/. If anybody has 
any type of feedback on the study, the experiment set-up, or the query 
implementations, we'd be curious to hear it.

All the best,
Ingo


> -----Original Message-----
> From: Wail Alkowaileet <[email protected]>
> Sent: Wednesday, August 11, 2021 6:41 PM
> To: [email protected]
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> Thanks for trying the Parquet connector. Your inputs were super valuable!
> Sure you can use the current change if it solves the problem.
> Please let us know if you have any questions/concerns.
> 
> On Wed, Aug 11, 2021 at 1:24 AM Müller Ingo <[email protected]
> <mailto:[email protected]> > wrote:
> 
> 
>       Dear all,
> 
>       I have just tried out Wail's patch set from here:
> https://issues.apache.org/jira/browse/ASTERIXDB-2945. It seems to solve my
> problem fully: in the 96-vCPU instance with 48 I/O devices, I see about 4800%
> CPU utilization during query execution, and the queries run only marginally
> longer than if run against local files. Thanks a lot for the quick fix!
> 
>       Should I use this version for a full benchmark run or wait until the
> patch makes it to master?
> 
>       Cheers,
>       Ingo
> 
> 
>       > -----Original Message-----
>       > From: Wail Alkowaileet <[email protected]
> <mailto:[email protected]> >
>       > Sent: Tuesday, August 10, 2021 6:10 PM
>       > To: [email protected]
> <mailto:[email protected]>
>       > Subject: Re: Increasing degree of parallelism when reading Parquet
> files
>       >
>       > Thanks Ingo for the detailed explanation and for benchmarking it!
> It is a great
>       > input for us. We will look at the issue and hopefully we can get it
> fixed before
>       > the end of the week.
>       >
>       > On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo
> <[email protected] <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > > wrote:
>       >
>       >
>       >       Let me also say that I can still rerun the experiments for the
> (hopefully
>       > subsequent) camera-ready version if the problem takes longer to
> fix.
>       >
>       >       Cheers,
>       >       Ingo
>       >
>       >
>       >       > -----Original Message-----
>       >       > From: Müller Ingo <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > Sent: Tuesday, August 10, 2021 5:34 PM
>       >       > To: [email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> >
>       >       > Subject: RE: Increasing degree of parallelism when reading
> Parquet
>       > files
>       >       >
>       >       > Hey Mike!
>       >       >
>       >       > Thanks for confirming! I am happy to test any fixes that you
> may come
>       > up with.
>       >       > If the happens to be simple and is fixed before Friday, I can
> still include
>       > it in the
>       >       > revision I am currently working on ;) Otherwise, it'd be great
> to be able
>       > to
>       >       > mention a Jira issue or similar (maybe this mailing list 
> thread is
>       > enough?) that I
>       >       > can refer to.
>       >       >
>       >       > Cheers,
>       >       > Ingo
>       >       >
>       >       >
>       >       > > -----Original Message-----
>       >       > > From: Michael Carey <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected] <mailto:[email protected]> > >
>       >       > > Sent: Tuesday, August 10, 2021 4:36 PM
>       >       > > To: [email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> >
>       >       > > Subject: Re: Increasing degree of parallelism when reading
> Parquet
>       >       > > files
>       >       > >
>       >       > > Ingo,
>       >       > >
>       >       > > Got it!  It sounds like we indeed have a parallelism
> performance bug
>       >       > > in the area of threading for S3, then.  Weird!  We'll look 
> into
> it...
>       >       > >
>       >       > >
>       >       > > Cheers,
>       >       > >
>       >       > > Mike
>       >       > >
>       >       > >
>       >       > > On 8/9/21 11:21 PM, Müller Ingo wrote:
>       >       > >
>       >       > >
>       >       > >     Hey Mike,
>       >       > >
>       >       > >     Just to clarify: "partitions" is the same thing as I/O 
> devices,
>       >       > > right? I have configured 48 of those via 
> "[nc]\niodevices=..."
> and see
>       >       > > the corresponding folders with content show up on the file
> system.
>       >       > > When I vary the number of these devices, I see that all 
> other
> storage
>       >       > > format change the degree of parallelism with my queries.
> That
>       >       > > mechanism thus seems to work in general. It just doesn't
> seem to
>       > work
>       >       > > for Parquet on S3. (I am not 100% sure if I tried other file
> formats
>       >       > > on S3.)
>       >       > >
>       >       > >     I have also tried to set compiler.parallelism to 4 for
> Parquet files
>       >       > > on HDFS with a file:// path and did not see any effect, 
> i.e., it
> used
>       >       > > 48 threads, which corresponds to the number of I/O devices.
>       > However,
>       >       > > with what Dmitry said, I guess that this is expected 
> behavior
> and the
>       >       > > flag should only influence the degree of parallelism after
> exchanges
>       > (which I
>       >       > don't have in my queries).
>       >       > >
>       >       > >     Cheers,
>       >       > >     Ingo
>       >       > >
>       >       > >
>       >       > >
>       >       > >             -----Original Message-----
>       >       > >             From: Michael Carey <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected] <mailto:[email protected]> > >
>       >       > > <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]> > >
>       >       > >             Sent: Monday, August 9, 2021 10:10 PM
>       >       > >             To: [email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             Subject: Re: Increasing degree of parallelism 
> when
> reading
>       >       > Parquet
>       >       > > files
>       >       > >
>       >       > >             Ingo,
>       >       > >
>       >       > >             Q: In your Parquet/S3 testing, what does your 
> current
> cluster
>       >       > > configuration look
>       >       > >             like?  (I.e., how many partitions have you 
> configured it
> with -
>       >       > > physical storage
>       >       > >             partitions that is?)  Even though your S3 data 
> isn't
> stored
>       > inside
>       >       > > AsterixDB in this
>       >       > >             case, the system still uses that info to decide 
> how
> many
>       > parallel
>       >       > > threads to use
>       >       > >             at the base of its query plans.  (Obviously 
> there is
> room for
>       >       > > improvement on that
>       >       > >             behavior for use cases involving external 
> storage. :-))
>       >       > >
>       >       > >
>       >       > >             Cheers,
>       >       > >
>       >       > >             Mike
>       >       > >
>       >       > >
>       >       > >             On 8/9/21 12:28 PM, Müller Ingo wrote:
>       >       > >
>       >       > >
>       >       > >                     Hi Dmitry,
>       >       > >
>       >       > >                     Thanks a lot for checking! Indeed, my 
> queries do
> not
>       >       > have an
>       >       > > exchange.
>       >       > >             However, the number of I/O devices has indeed
> worked well
>       > in
>       >       > many
>       >       > > cases:
>       >       > >             when I tried the various VM instance sizes, I 
> always
> created as
>       >       > many
>       >       > > I/O devices
>       >       > >             as there were physical cores (i.e., half the 
> number of
> logical
>       >       > > CPUs). For internal
>       >       > >             storage as well as HDFS (both using the hdfs:// 
> and
> the file://
>       >       > > protocol), I saw
>       >       > >             the full system being utilized. However, just 
> for the
> case of
>       >       > > Parquet on S3, I
>       >       > >             cannot seem to make it use more than 16 cores.
>       >       > >
>       >       > >                     Cheers,
>       >       > >                     Ingo
>       >       > >
>       >       > >
>       >       > >
>       >       > >                             -----Original Message-----
>       >       > >                             From: Dmitry Lychagin
>       >       > > <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                             Sent: Monday, August 9, 2021 
> 9:10 PM
>       >       > >                             To: [email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                             Subject: Re: Increasing degree 
> of parallelism
>       >       > when reading
>       >       > >             Parquet files
>       >       > >
>       >       > >                             Hi Ingo,
>       >       > >
>       >       > >                             I checked the code and it seems 
> that when
>       >       > scanning external
>       >       > >             datasource we're
>       >       > >                             using the same number of cores 
> as there are
>       >       > configured storage
>       >       > >             partitions (I/O
>       >       > >                             devices).
>       >       > >                             Therefore, if you want 96 cores 
> to be used
>       >       > when scanning
>       >       > >             Parquet files then you
>       >       > >                             need to configure 96 I/O 
> devices.
>       >       > >
>       >       > >                             Compiler.parallelism setting is 
> supposed to
>       >       > affect how many
>       >       > >             cores we use after
>       >       > >                             the first EXCHANGE operator. 
> However, if
> your
>       >       > query doesn't
>       >       > >             have any
>       >       > >                             EXCHANGEs then it'll use the 
> number of
> cores
>       >       > assigned for the
>       >       > >             initial data scan
>       >       > >                             operator (number of I/O devices)
>       >       > >
>       >       > >                             Thanks,
>       >       > >                             -- Dmitry
>       >       > >
>       >       > >
>       >       > >                             On 8/9/21, 11:42 AM, "Müller  
> Ingo"
>       >       > >             <[email protected]
> <mailto:[email protected]>  <mailto:[email protected]
> <mailto:[email protected]> >
>       > > <mailto:[email protected]
> <mailto:[email protected]>  <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >   wrote:
>       >       > >
>       >       > >                                  EXTERNAL EMAIL:  Use 
> caution when
>       >       > opening attachments
>       >       > >             or clicking on links
>       >       > >
>       >       > >
>       >       > >
>       >       > >
>       >       > >
>       >       > >                                 Dear Dmitry,
>       >       > >
>       >       > >                                 Thanks a lot for the quick 
> reply! I had not
>       >       > though of this.
>       >       > >             However, I have tried
>       >       > >                             out both ways just now (per 
> query and in the
>       >       > cluster
>       >       > >             configuration) and did not
>       >       > >                             see any changes. Is there any 
> way I can
> control
>       >       > that the setting
>       >       > >             was applied
>       >       > >                             successfully? I have also tried 
> setting
>       >       > compiler.parallelism to 4
>       >       > >             and still observed
>       >       > >                             16 cores being utilized.
>       >       > >
>       >       > >                                 Note that the observed 
> degree of
> parallelism
>       >       > does not
>       >       > >             correspond to anything
>       >       > >                             related to the data set (I 
> tried with every
> power
>       >       > of two files
>       >       > >             between 1 and 128)
>       >       > >                             or the cluster (I tried with 
> every power of two
>       >       > cores between 2
>       >       > >             and 64, as well
>       >       > >                             as 48 and 96) and I always see 
> 16 cores being
>       >       > used (or fewer, if
>       >       > >             the system has
>       >       > >                             fewer). To me, this makes it 
> unlikely that the
>       >       > system really uses
>       >       > >             the semantics
>       >       > >                             for p=0 or p<0, but looks more 
> like some
> hard-
>       >       > coded value.
>       >       > >
>       >       > >                                 Cheers,
>       >       > >                                 Ingo
>       >       > >
>       >       > >
>       >       > >                                 > -----Original Message-----
>       >       > >                                 > From: Dmitry Lychagin
>       >       > > <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                                 > Sent: Monday, August 9, 
> 2021 7:25 PM
>       >       > >                                 > To: 
> [email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                                 > Subject: Re: Increasing 
> degree of
>       >       > parallelism when reading
>       >       > >             Parquet files
>       >       > >                                 >
>       >       > >                                 > Ingo,
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > We have 
> `compiler.parallelism`
> parameter
>       >       > that controls
>       >       > >             how many cores are
>       >       > >                                 > used for query execution.
>       >       > >                                 >
>       >       > >                                 > See
>       >       > >                                 >
>       >       > >
>       >       > >
>       >       > >
>       >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
>       >       > >             _param
>       >       > >                                 > eter
>       >       > >                                 >
>       >       > >
>       >       > >
>       >       > >
>       >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
>       >       > >             m_para
>       >       > >                                 >
>       >       > >
>       >       >
>       >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_p
> ara>
>       >       > >
>       >       > >
>       >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
>       >       > > m_para>
>       >       > >
>       >       >
>       >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_p
> ara>
>       >       > >             meter>
>       >       > >                                 >
>       >       > >                                 > You can either set it per 
> query (e.g. SET
>       >       > >             `compiler.parallelism` "-1";) ,
>       >       > >                                 >
>       >       > >                                 > or globally in the 
> cluster configuration:
>       >       > >                                 >
>       >       > >                                 >
>       >       > >
>       >       > >
>       >
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>       >       > >                                 > 
> app/src/main/resources/cc2.conf#L57
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > Thanks,
>       >       > >                                 >
>       >       > >                                 > -- Dmitry
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > From: Müller Ingo
>       >       > > <[email protected]
> <mailto:[email protected]>  <mailto:[email protected]
> <mailto:[email protected]> > >
>       > <mailto:[email protected]
> <mailto:[email protected]>  <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                                 > Reply-To: 
> "[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > "
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >   <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                                 > Date: Monday, August 9, 
> 2021 at 10:05
> AM
>       >       > >                                 > To: 
> "[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > "
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >   <[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >             <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > > <mailto:[email protected]
> <mailto:[email protected]>
>       > <mailto:[email protected]
> <mailto:[email protected]> > >
>       >       > >                                 > Subject: Increasing 
> degree of parallelism
>       >       > > when reading
>       >       > >             Parquet files
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >  EXTERNAL EMAIL:  Use 
> caution when
>       >       > > opening attachments
>       >       > >             or clicking on
>       >       > >                             links
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > Dear AsterixDB devs,
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > I am currently trying out 
> the new support
>       >       > > for Parquet files
>       >       > >             on S3 (still in the
>       >       > >                                 > context of my High-energy 
> Physics use
> case
>       >       > > [1]). This works
>       >       > >             great so far and
>       >       > >                             has
>       >       > >                                 > generally decent 
> performance. However,
> I
>       >       > > realized that it
>       >       > >             does not use more
>       >       > >                                 > than 16 cores, even 
> though 96 logical
> cores
>       >       > > are available
>       >       > >             and even though I
>       >       > >                             run
>       >       > >                                 > long-running queries 
> (several minutes)
> on
>       >       > > large data sets
>       >       > >             with a large
>       >       > >                             number of
>       >       > >                                 > files (I tried 128 files 
> of 17GB each). Is this
>       >       > > an
>       >       > >             arbitrary/artificial limitation
>       >       > >                             that
>       >       > >                                 > can be changed somehow 
> (potentially
> with
>       >       > > a small
>       >       > >             patch+recompiling) or is
>       >       > >                                 > there more serious 
> development
> required
>       >       > > to lift it? FYI, I am
>       >       > >             currently using
>       >       > >                                 > 03fd6d0f, which should 
> include all
>       >       > > S3/Parquet commits on
>       >       > >             master.
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > Cheers,
>       >       > >                                 >
>       >       > >                                 > Ingo
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 >
>       >       > >                                 > [1] 
> https://arxiv.org/abs/2104.12615
>       >       > >                                 >
>       >       > >                                 >
>       >       > >
>       >       > >
>       >       > >
>       >       > >
>       >       > >
>       >
>       >
>       >
>       >
>       >
>       > --
>       >
>       >
>       > Regards,
>       > Wail Alkowaileet
> 
> 
> 
> 
> 
> --
> 
> 
> Regards,
> Wail Alkowaileet

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to