Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Müller Ingo
Dear AsterixDB devs, I am currently trying out the new support for Parquet files on S3 (still in the context of my High-energy Physics use case [1]). This works great so far and has generally decent performance. However, I realized that it does not use more than 16 cores, even though 96

Re: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Dmitry Lychagin
Ingo, We have `compiler.parallelism` parameter that controls how many cores are used for query execution. See https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_parameter You can either set it per query (e.g. SET `compiler.parallelism` "-1";) , or globally in the cluster

RE: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Müller Ingo
Dear Dmitry, Thanks a lot for the quick reply! I had not though of this. However, I have tried out both ways just now (per query and in the cluster configuration) and did not see any changes. Is there any way I can control that the setting was applied successfully? I have also tried setting

RE: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Müller Ingo
Hi Dmitry, Thanks a lot for checking! Indeed, my queries do not have an exchange. However, the number of I/O devices has indeed worked well in many cases: when I tried the various VM instance sizes, I always created as many I/O devices as there were physical cores (i.e., half the number of

Re: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Dmitry Lychagin
Hi Ingo, I checked the code and it seems that when scanning external datasource we're using the same number of cores as there are configured storage partitions (I/O devices). Therefore, if you want 96 cores to be used when scanning Parquet files then you need to configure 96 I/O devices.

Re: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Michael Carey
Ingo, Q: In your Parquet/S3 testing, what does your current cluster configuration look like?  (I.e., how many partitions have you configured it with - physical storage partitions that is?)  Even though your S3 data isn't stored inside AsterixDB in this case, the system still uses that info

Re: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Wail Alkowaileet
Hi Ingo, Were you reading from an actual S3 bucket? or was it a local S3 mock server? The reason I ask is because reading from a remote bucket is slow (the fastest I have seen was ~60MB/s). If your HDFS server(s) are backed by NVMe drives, then the read speed could be in GBs/s. For the remote S3

RE: Increasing degree of parallelism when reading Parquet files

2021-08-09 Thread Müller Ingo
Hey Wail, Thanks a lot for helping! I am reading from an EC2 m5d.24xlarge instance from AWS's S3. I am pretty sure that S3 is not the problem: First, others have measured [1] up to 2.7GB/s from S3. When I measure the network bandwidth of AsterixDB, I see in the order of 600MB/s. (The 60-80MB/s