Hi team,
I observed the following scenarios when querying data in s3 and hdfs using impala. There are few behaviors that I want to get clarified. *Scenario 1* impala view, test_view was created in : · a local environment taking the union of a table in hdfs & another table (not in hdfs) · a cloud environment taking the union of a table in s3 & another table (not in s3) query In hdfs In s3 Impala scans HDFS partitions scanned as per the query plan Parquet files scanned as per the query plan Impala scans S3 partitions scanned as per the query plan Parquet files scanned as per the query plan Number of GET requests sent to S3 select * from test_view where time>=1 and time<= 3; 4/4 3/3 3 (3 parquet files are available in hdfs and s3) 4/4 3/3 3 9 select * from test_view where time>=2 and time<=3; 3/3 2/3 2 3/3 2/3 2 6 *Notes* · No. of Impala scans are the same in both hdfs and s3. (this seems possible only if the size of each parquet file is small) · No. of GET requests > Total S3 parquet files scanned. *Question 1 – How the number of GET requests sent to S3 is derived. ?* *Scenario 2* table_hdfs (in hdfs) and table_s3(in s3) are partitioned by value column query In hdfs In s3 Impala scans HDFS partitions scanned as per the query plan Parquet files scanned as per the query plan Impala scans S3 partitions scanned as per the query plan Parquet files scanned as per the query plan Number of GET requests sent to S3 select count(*) from table_hdfs where value = 10; 36/36 1/33 (parquet files relevant to 33 different values are stored in hdfs.) 36 (36 parquet files are available in hdfs relevant to value = 10) N/A select count(*) from table_s3 where value = 10; N/A 286/286 1/33 (parquet files relevant to 33 different values are stored in s3.) 36 (36 parquet files are available in s3 relevant to value = 10) 286 *Notes* · No. of Impala scans are different in hdfs and s3. (but number of files that need to be scanned are the same) · No. of GET requests = Total impala scans > Total S3 parquet files scanned *Question 2 – Though we are querying on the same set of parquet files (only the storage is different), the number of impala scans are different, and it is the same as the total no. of GET requests sent to S3. Any explanation on this behavior**, the number of scans and number of GET requests* *?* *Scenario 3* query In hdfs In s3 Impala scans HDFS partitions scanned as per the query plan Parquet files scanned as per the query plan Impala scans S3 partitions scanned as per the query plan Parquet files scanned as per the query plan Number of GET requests sent to S3 select count(*) from table_hdfs where value_range > 15700 and value_range <= 15800 1072/1072 33/33 (parquet files relevant to 33 different values are stored in hdfs.) 1072 (1072 parquet files are available in hdfs) N/A select count(*) from table_s3 where value_range > 15700 and value_range <= 15800 N/A 8440/8440 33/33 (parquet files relevant to 33 different values are stored in s3.) 1072 (1072 parquet files are available in s3) 9837 *Notes* · No. of Impala scans are different in hdfs and s3. (but number of files that need to be scanned are the same) · No. of GET requests > Total impala scans > Total S3 parquet files scanned *Question 3 – As in scenario 2, here also we are querying on the same set of parquet files (only the storage is different), the number of impala scans are different. But here it is not equal to the total no. of GET requests sent to S3. Any explanation on this behavior, the number of scans and number of GET requests ?* *Question 4 – When parquet files in S3 are accessed via an impala query from an EC2 instance, do all files be downloaded to the instance volume or an attached EBS from S3? And if downloaded where those files are stored. ?* It would be grateful if it is possible to make clarifications for above questions. -- Regards, Thisun Dayarathna