Slowness in query execution of Impala parquet tables on S3

Vibhath Ileperuma Mon, 13 Jun 2022 10:17:25 -0700

Hi All,

We are using apache kudu and parquet data on S3 with impala. We copy the
data from a kudu table into a S3 table using a select insert query.
Note that we haven't used the 'SORT BY' clause when creating parquet tables
on S3.


When a sql is executed on this S3 table, it consumes more time than we
expected. Note that this S3 table is a partitioned table and we are using
256MB as the row group size.
When a sql is executed for a partition with 60 parquet files, it takes
about 15 seconds to execute even though this sql fetches only about 2000
rows.
When I checked the query profile, I noticed that NumStatsFilteredRowGroups
is always 0 and TotalBytesRead is 1.97GB even though TotalBytesSent is
449.61KB.
Does impala show this behavior since we haven't used the 'SORT BY' clause?
Further, is there any way we can reduce the query execution time by
increasing parallelism of S3 scan.?

Thanks & Regards

*Vibhath Ileperuma*

Slowness in query execution of Impala parquet tables on S3

Reply via email to