Dear AsterixDB devs,
I am currently trying out the new support for Parquet files on S3 (still in the
context of my High-energy Physics use case [1]). This works great so far and
has generally decent performance. However, I realized that it does not use more
than 16 cores, even though 96
Ingo,
We have `compiler.parallelism` parameter that controls how many cores are used
for query execution.
See
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_parameter
You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
or globally in the cluster
Dear Dmitry,
Thanks a lot for the quick reply! I had not though of this. However, I have
tried out both ways just now (per query and in the cluster configuration) and
did not see any changes. Is there any way I can control that the setting was
applied successfully? I have also tried setting
Hi Dmitry,
Thanks a lot for checking! Indeed, my queries do not have an exchange. However,
the number of I/O devices has indeed worked well in many cases: when I tried
the various VM instance sizes, I always created as many I/O devices as there
were physical cores (i.e., half the number of
Hi Ingo,
I checked the code and it seems that when scanning external datasource we're
using the same number of cores as there are configured storage partitions (I/O
devices).
Therefore, if you want 96 cores to be used when scanning Parquet files then you
need to configure 96 I/O devices.
Ingo,
Q: In your Parquet/S3 testing, what does your current cluster
configuration look like? (I.e., how many partitions have you configured
it with - physical storage partitions that is?) Even though your S3
data isn't stored inside AsterixDB in this case, the system still uses
that info
Hi Ingo,
Were you reading from an actual S3 bucket? or was it a local S3 mock
server? The reason I ask is because reading from a remote bucket is slow
(the fastest I have seen was ~60MB/s). If your HDFS server(s) are backed by
NVMe drives, then the read speed could be in GBs/s. For the remote S3
Hey Wail,
Thanks a lot for helping! I am reading from an EC2 m5d.24xlarge instance from
AWS's S3. I am pretty sure that S3 is not the problem: First, others have
measured [1] up to 2.7GB/s from S3. When I measure the network bandwidth of
AsterixDB, I see in the order of 600MB/s. (The 60-80MB/s