Hi All, We are using apache kudu and parquet data on S3 with impala. We copy the data from a kudu table into a S3 table using a select insert query. Note that we haven't used the 'SORT BY' clause when creating parquet tables on S3.
When a sql is executed on this S3 table, it consumes more time than we expected. Note that this S3 table is a partitioned table and we are using 256MB as the row group size. When a sql is executed for a partition with 60 parquet files, it takes about 15 seconds to execute even though this sql fetches only about 2000 rows. When I checked the query profile, I noticed that NumStatsFilteredRowGroups is always 0 and TotalBytesRead is 1.97GB even though TotalBytesSent is 449.61KB. Does impala show this behavior since we haven't used the 'SORT BY' clause? Further, is there any way we can reduce the query execution time by increasing parallelism of S3 scan.? Thanks & Regards *Vibhath Ileperuma*