RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Mon, 09 Aug 2021 22:51:47 -0700

Hey Wail,

Thanks a lot for helping! I am reading from an EC2 m5d.24xlarge instance from 
AWS's S3. I am pretty sure that S3 is not the problem: First, others have 
measured [1] up to 2.7GB/s from S3. When I measure the network bandwidth of 
AsterixDB, I see in the order of 600MB/s. (The 60-80MB/s you mention are 
typical per connection, but one application can get a much higher bandwidth 
using multiple connections.) Indeed, with other systems in the comparison, I 
can read from S3 *much* faster from S3 than AsterixDB; my current understanding 
is that AserixDB is compute-bound. Also, when I scale up the instance size, I 
get almost perfectly linear speed up until exactly the point where I use 16 
cores (and no speed-up after that) -- this is unlikely to happen if the network 
is getting saturated, where you'd see some slow-down before the saturation 
point. Finally, I see 16 threads using 100% of one core each and all other 
threads being completely idle -- this doesn't look like 48 threads waiting for 
the network either.


I have a few other things to try, then I'll report back.

Cheers,
Ingo


[1] https://github.com/dvassallo/s3-benchmark

> -----Original Message-----
> From: Wail Alkowaileet <[email protected]>
> Sent: Monday, August 9, 2021 10:04 PM
> To: [email protected]
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Hi Ingo,
> 
> Were you reading from an actual S3 bucket? or was it a local S3 mock server?
> The reason I ask is because reading from a remote bucket is slow (the fastest 
> I
> have seen was ~60MB/s). If your HDFS server(s) are backed by NVMe drives,
> then the read speed could be in GBs/s. For the remote S3 bucket case, other
> cores would be idle as they will be waiting for the data to arrive.
> 
> 
> On Mon, Aug 9, 2021 at 12:28 PM Müller Ingo <[email protected]
> <mailto:[email protected]> > wrote:
> 
> 
>       Hi Dmitry,
> 
>       Thanks a lot for checking! Indeed, my queries do not have an exchange.
> However, the number of I/O devices has indeed worked well in many cases:
> when I tried the various VM instance sizes, I always created as many I/O 
> devices
> as there were physical cores (i.e., half the number of logical CPUs). For 
> internal
> storage as well as HDFS (both using the hdfs:// and the file:// protocol), I 
> saw
> the full system being utilized. However, just for the case of Parquet on S3, I
> cannot seem to make it use more than 16 cores.
> 
>       Cheers,
>       Ingo
> 
> 
>       > -----Original Message-----
>       > From: Dmitry Lychagin <[email protected]
> <mailto:[email protected]> >
>       > Sent: Monday, August 9, 2021 9:10 PM
>       > To: [email protected]
> <mailto:[email protected]>
>       > Subject: Re: Increasing degree of parallelism when reading Parquet
> files
>       >
>       > Hi Ingo,
>       >
>       > I checked the code and it seems that when scanning external
> datasource we're
>       > using the same number of cores as there are configured storage
> partitions (I/O
>       > devices).
>       > Therefore, if you want 96 cores to be used when scanning Parquet
> files then you
>       > need to configure 96 I/O devices.
>       >
>       > Compiler.parallelism setting is supposed to affect how many cores we
> use after
>       > the first EXCHANGE operator. However, if your query doesn't have any
>       > EXCHANGEs then it'll use the number of cores assigned for the initial
> data scan
>       > operator (number of I/O devices)
>       >
>       > Thanks,
>       > -- Dmitry
>       >
>       >
>       > On 8/9/21, 11:42 AM, "Müller  Ingo" <[email protected]
> <mailto:[email protected]> > wrote:
>       >
>       >      EXTERNAL EMAIL:  Use caution when opening attachments or
> clicking on links
>       >
>       >
>       >
>       >
>       >
>       >     Dear Dmitry,
>       >
>       >     Thanks a lot for the quick reply! I had not though of this. 
> However, I
> have tried
>       > out both ways just now (per query and in the cluster configuration)
> and did not
>       > see any changes. Is there any way I can control that the setting was
> applied
>       > successfully? I have also tried setting compiler.parallelism to 4 and 
> still
> observed
>       > 16 cores being utilized.
>       >
>       >     Note that the observed degree of parallelism does not correspond
> to anything
>       > related to the data set (I tried with every power of two files 
> between 1
> and 128)
>       > or the cluster (I tried with every power of two cores between 2 and
> 64, as well
>       > as 48 and 96) and I always see 16 cores being used (or fewer, if the
> system has
>       > fewer). To me, this makes it unlikely that the system really uses the
> semantics
>       > for p=0 or p<0, but looks more like some hard-coded value.
>       >
>       >     Cheers,
>       >     Ingo
>       >
>       >
>       >     > -----Original Message-----
>       >     > From: Dmitry Lychagin <[email protected]
> <mailto:[email protected]> >
>       >     > Sent: Monday, August 9, 2021 7:25 PM
>       >     > To: [email protected]
> <mailto:[email protected]>
>       >     > Subject: Re: Increasing degree of parallelism when reading
> Parquet files
>       >     >
>       >     > Ingo,
>       >     >
>       >     >
>       >     >
>       >     > We have `compiler.parallelism` parameter that controls how many
> cores are
>       >     > used for query execution.
>       >     >
>       >     > See
>       >     >
>       >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
>       >     > eter
>       >     >
>       >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
>       >     > meter>
>       >     >
>       >     > You can either set it per query (e.g. SET 
> `compiler.parallelism` "-
> 1";) ,
>       >     >
>       >     > or globally in the cluster configuration:
>       >     >
>       >     >
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>       >     > app/src/main/resources/cc2.conf#L57
>       >     >
>       >     >
>       >     >
>       >     > Thanks,
>       >     >
>       >     > -- Dmitry
>       >     >
>       >     >
>       >     >
>       >     >
>       >     >
>       >     > From: Müller Ingo <[email protected]
> <mailto:[email protected]> >
>       >     > Reply-To: "[email protected]
> <mailto:[email protected]> " <[email protected]
> <mailto:[email protected]> >
>       >     > Date: Monday, August 9, 2021 at 10:05 AM
>       >     > To: "[email protected]
> <mailto:[email protected]> " <[email protected]
> <mailto:[email protected]> >
>       >     > Subject: Increasing degree of parallelism when reading Parquet
> files
>       >     >
>       >     >
>       >     >
>       >     >  EXTERNAL EMAIL:  Use caution when opening attachments or
> clicking on
>       > links
>       >     >
>       >     >
>       >     >
>       >     >
>       >     >
>       >     > Dear AsterixDB devs,
>       >     >
>       >     >
>       >     >
>       >     > I am currently trying out the new support for Parquet files on 
> S3
> (still in the
>       >     > context of my High-energy Physics use case [1]). This works 
> great
> so far and
>       > has
>       >     > generally decent performance. However, I realized that it does 
> not
> use more
>       >     > than 16 cores, even though 96 logical cores are available and 
> even
> though I
>       > run
>       >     > long-running queries (several minutes) on large data sets with a
> large
>       > number of
>       >     > files (I tried 128 files of 17GB each). Is this an 
> arbitrary/artificial
> limitation
>       > that
>       >     > can be changed somehow (potentially with a small
> patch+recompiling) or is
>       >     > there more serious development required to lift it? FYI, I am
> currently using
>       >     > 03fd6d0f, which should include all S3/Parquet commits on master.
>       >     >
>       >     >
>       >     >
>       >     > Cheers,
>       >     >
>       >     > Ingo
>       >     >
>       >     >
>       >     >
>       >     >
>       >     >
>       >     > [1] https://arxiv.org/abs/2104.12615
>       >     >
>       >     >
>       >
> 
> 
> 
> 
> 
> --
> 
> 
> Regards,
> Wail Alkowaileet

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to