Re: Increasing degree of parallelism when reading Parquet files

Michael Carey Tue, 10 Aug 2021 07:36:15 -0700

Ingo,

Got it! It sounds like we indeed have a parallelism performance bug inthe area of threading for S3, then. Weird! We'll look into it...


Cheers,

Mike

On 8/9/21 11:21 PM, Müller Ingo wrote:

Hey Mike,

Just to clarify: "partitions" is the same thing as I/O devices, right? I have configured 
48 of those via "[nc]\niodevices=..." and see the corresponding folders with content show 
up on the file system. When I vary the number of these devices, I see that all other storage format 
change the degree of parallelism with my queries. That mechanism thus seems to work in general. It 
just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried other file formats on 
S3.)

I have also tried to set compiler.parallelism to 4 for Parquet files on HDFS 
with a file:// path and did not see any effect, i.e., it used 48 threads, which 
corresponds to the number of I/O devices. However, with what Dmitry said, I 
guess that this is expected behavior and the flag should only influence the 
degree of parallelism after exchanges (which I don't have in my queries).

Cheers,
Ingo

-----Original Message-----
From: Michael Carey <[email protected]>
Sent: Monday, August 9, 2021 10:10 PM
To: [email protected]
Subject: Re: Increasing degree of parallelism when reading Parquet files

Ingo,

Q: In your Parquet/S3 testing, what does your current cluster configuration look
like?  (I.e., how many partitions have you configured it with - physical storage
partitions that is?)  Even though your S3 data isn't stored inside AsterixDB in 
this
case, the system still uses that info to decide how many parallel threads to use
at the base of its query plans.  (Obviously there is room for improvement on 
that
behavior for use cases involving external storage. :-))


Cheers,

Mike


On 8/9/21 12:28 PM, Müller Ingo wrote:


        Hi Dmitry,

        Thanks a lot for checking! Indeed, my queries do not have an exchange.
However, the number of I/O devices has indeed worked well in many cases:
when I tried the various VM instance sizes, I always created as many I/O devices
as there were physical cores (i.e., half the number of logical CPUs). For 
internal
storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw
the full system being utilized. However, just for the case of Parquet on S3, I
cannot seem to make it use more than 16 cores.

        Cheers,
        Ingo



                -----Original Message-----
                From: Dmitry Lychagin <[email protected]>
<mailto:[email protected]>
                Sent: Monday, August 9, 2021 9:10 PM
                To: [email protected]
<mailto:[email protected]>
                Subject: Re: Increasing degree of parallelism when reading
Parquet files

                Hi Ingo,

                I checked the code and it seems that when scanning external
datasource we're
                using the same number of cores as there are configured storage
partitions (I/O
                devices).
                Therefore, if you want 96 cores to be used when scanning
Parquet files then you
                need to configure 96 I/O devices.

                Compiler.parallelism setting is supposed to affect how many
cores we use after
                the first EXCHANGE operator. However, if your query doesn't
have any
                EXCHANGEs then it'll use the number of cores assigned for the
initial data scan
                operator (number of I/O devices)

                Thanks,
                -- Dmitry


                On 8/9/21, 11:42 AM, "Müller  Ingo"
<[email protected]> <mailto:[email protected]>  wrote:

                     EXTERNAL EMAIL:  Use caution when opening attachments
or clicking on links





                    Dear Dmitry,

                    Thanks a lot for the quick reply! I had not though of this.
However, I have tried
                out both ways just now (per query and in the cluster
configuration) and did not
                see any changes. Is there any way I can control that the setting
was applied
                successfully? I have also tried setting compiler.parallelism to 
4
and still observed
                16 cores being utilized.

                    Note that the observed degree of parallelism does not
correspond to anything
                related to the data set (I tried with every power of two files
between 1 and 128)
                or the cluster (I tried with every power of two cores between 2
and 64, as well
                as 48 and 96) and I always see 16 cores being used (or fewer, if
the system has
                fewer). To me, this makes it unlikely that the system really 
uses
the semantics
                for p=0 or p<0, but looks more like some hard-coded value.

                    Cheers,
                    Ingo


                    > -----Original Message-----
                    > From: Dmitry Lychagin <[email protected]>
<mailto:[email protected]>
                    > Sent: Monday, August 9, 2021 7:25 PM
                    > To: [email protected]
<mailto:[email protected]>
                    > Subject: Re: Increasing degree of parallelism when reading
Parquet files
                    >
                    > Ingo,
                    >
                    >
                    >
                    > We have `compiler.parallelism` parameter that controls
how many cores are
                    > used for query execution.
                    >
                    > See
                    >

        https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
_param
                    > eter
                    >

        <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
m_para
                    >
<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
meter>
                    >
                    > You can either set it per query (e.g. SET
`compiler.parallelism` "-1";) ,
                    >
                    > or globally in the cluster configuration:
                    >
                    >
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
                    > app/src/main/resources/cc2.conf#L57
                    >
                    >
                    >
                    > Thanks,
                    >
                    > -- Dmitry
                    >
                    >
                    >
                    >
                    >
                    > From: Müller Ingo <[email protected]>
<mailto:[email protected]>
                    > Reply-To: "[email protected]"
<mailto:[email protected]>  <[email protected]>
<mailto:[email protected]>
                    > Date: Monday, August 9, 2021 at 10:05 AM
                    > To: "[email protected]"
<mailto:[email protected]>  <[email protected]>
<mailto:[email protected]>
                    > Subject: Increasing degree of parallelism when reading
Parquet files
                    >
                    >
                    >
                    >  EXTERNAL EMAIL:  Use caution when opening attachments
or clicking on
                links
                    >
                    >
                    >
                    >
                    >
                    > Dear AsterixDB devs,
                    >
                    >
                    >
                    > I am currently trying out the new support for Parquet 
files
on S3 (still in the
                    > context of my High-energy Physics use case [1]). This 
works
great so far and
                has
                    > generally decent performance. However, I realized that it
does not use more
                    > than 16 cores, even though 96 logical cores are available
and even though I
                run
                    > long-running queries (several minutes) on large data sets
with a large
                number of
                    > files (I tried 128 files of 17GB each). Is this an
arbitrary/artificial limitation
                that
                    > can be changed somehow (potentially with a small
patch+recompiling) or is
                    > there more serious development required to lift it? FYI, 
I am
currently using
                    > 03fd6d0f, which should include all S3/Parquet commits on
master.
                    >
                    >
                    >
                    > Cheers,
                    >
                    > Ingo
                    >
                    >
                    >
                    >
                    >
                    > [1] https://arxiv.org/abs/2104.12615
                    >
                    >

Re: Increasing degree of parallelism when reading Parquet files

Reply via email to