Re: Increasing degree of parallelism when reading Parquet files

Wail Alkowaileet Mon, 09 Aug 2021 13:06:26 -0700

Hi Ingo,

Were you reading from an actual S3 bucket? or was it a local S3 mock
server? The reason I ask is because reading from a remote bucket is slow
(the fastest I have seen was ~60MB/s). If your HDFS server(s) are backed by
NVMe drives, then the read speed could be in GBs/s. For the remote S3
bucket case, other cores would be idle as they will be waiting for the data
to arrive.



On Mon, Aug 9, 2021 at 12:28 PM Müller Ingo <[email protected]>
wrote:

> Hi Dmitry,
>
> Thanks a lot for checking! Indeed, my queries do not have an exchange.
> However, the number of I/O devices has indeed worked well in many cases:
> when I tried the various VM instance sizes, I always created as many I/O
> devices as there were physical cores (i.e., half the number of logical
> CPUs). For internal storage as well as HDFS (both using the hdfs:// and the
> file:// protocol), I saw the full system being utilized. However, just for
> the case of Parquet on S3, I cannot seem to make it use more than 16 cores.
>
> Cheers,
> Ingo
>
>
> > -----Original Message-----
> > From: Dmitry Lychagin <[email protected]>
> > Sent: Monday, August 9, 2021 9:10 PM
> > To: [email protected]
> > Subject: Re: Increasing degree of parallelism when reading Parquet files
> >
> > Hi Ingo,
> >
> > I checked the code and it seems that when scanning external datasource
> we're
> > using the same number of cores as there are configured storage
> partitions (I/O
> > devices).
> > Therefore, if you want 96 cores to be used when scanning Parquet files
> then you
> > need to configure 96 I/O devices.
> >
> > Compiler.parallelism setting is supposed to affect how many cores we use
> after
> > the first EXCHANGE operator. However, if your query doesn't have any
> > EXCHANGEs then it'll use the number of cores assigned for the initial
> data scan
> > operator (number of I/O devices)
> >
> > Thanks,
> > -- Dmitry
> >
> >
> > On 8/9/21, 11:42 AM, "Müller  Ingo" <[email protected]> wrote:
> >
> >      EXTERNAL EMAIL:  Use caution when opening attachments or clicking
> on links
> >
> >
> >
> >
> >
> >     Dear Dmitry,
> >
> >     Thanks a lot for the quick reply! I had not though of this. However,
> I have tried
> > out both ways just now (per query and in the cluster configuration) and
> did not
> > see any changes. Is there any way I can control that the setting was
> applied
> > successfully? I have also tried setting compiler.parallelism to 4 and
> still observed
> > 16 cores being utilized.
> >
> >     Note that the observed degree of parallelism does not correspond to
> anything
> > related to the data set (I tried with every power of two files between 1
> and 128)
> > or the cluster (I tried with every power of two cores between 2 and 64,
> as well
> > as 48 and 96) and I always see 16 cores being used (or fewer, if the
> system has
> > fewer). To me, this makes it unlikely that the system really uses the
> semantics
> > for p=0 or p<0, but looks more like some hard-coded value.
> >
> >     Cheers,
> >     Ingo
> >
> >
> >     > -----Original Message-----
> >     > From: Dmitry Lychagin <[email protected]>
> >     > Sent: Monday, August 9, 2021 7:25 PM
> >     > To: [email protected]
> >     > Subject: Re: Increasing degree of parallelism when reading Parquet
> files
> >     >
> >     > Ingo,
> >     >
> >     >
> >     >
> >     > We have `compiler.parallelism` parameter that controls how many
> cores are
> >     > used for query execution.
> >     >
> >     > See
> >     >
> >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
> >     > eter
> >     >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >     > meter>
> >     >
> >     > You can either set it per query (e.g. SET `compiler.parallelism`
> "-1";) ,
> >     >
> >     > or globally in the cluster configuration:
> >     >
> >     > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> >     > app/src/main/resources/cc2.conf#L57
> >     >
> >     >
> >     >
> >     > Thanks,
> >     >
> >     > -- Dmitry
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > From: Müller Ingo <[email protected]>
> >     > Reply-To: "[email protected]" <[email protected]
> >
> >     > Date: Monday, August 9, 2021 at 10:05 AM
> >     > To: "[email protected]" <[email protected]>
> >     > Subject: Increasing degree of parallelism when reading Parquet
> files
> >     >
> >     >
> >     >
> >     >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking
> on
> > links
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > Dear AsterixDB devs,
> >     >
> >     >
> >     >
> >     > I am currently trying out the new support for Parquet files on S3
> (still in the
> >     > context of my High-energy Physics use case [1]). This works great
> so far and
> > has
> >     > generally decent performance. However, I realized that it does not
> use more
> >     > than 16 cores, even though 96 logical cores are available and even
> though I
> > run
> >     > long-running queries (several minutes) on large data sets with a
> large
> > number of
> >     > files (I tried 128 files of 17GB each). Is this an
> arbitrary/artificial limitation
> > that
> >     > can be changed somehow (potentially with a small
> patch+recompiling) or is
> >     > there more serious development required to lift it? FYI, I am
> currently using
> >     > 03fd6d0f, which should include all S3/Parquet commits on master.
> >     >
> >     >
> >     >
> >     > Cheers,
> >     >
> >     > Ingo
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > [1] https://arxiv.org/abs/2104.12615
> >     >
> >     >
> >
>
>

-- 

*Regards,*
Wail Alkowaileet

Re: Increasing degree of parallelism when reading Parquet files

Reply via email to