https://impala.apache.org/docs/build/html/topics/impala_parquet.html explains how splitting files into scan ranges works on S3.
Parquet is a columnar format and Impala implements a variety of strategies for only reading the relevant columns and parts of columns in files. https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ has some good illustrations that may be useful for understanding the file layout and strategies in play. I think you're under the incorrect impression that a parquet reader needs to read the entire file, which is definitely not true - one of the main design points of Parquet is lay out the file so that only relevant portions for a query are read. I also don't think you can predict the number of S3 get requests without understanding all the details of the particular Parquet reader - there are lots of ways to read parquet depending on the query and the reader. Impala reads the relevant ranges of the file directly from S3 and stores a copy into the remote data cache if that's enabled - https://impala.apache.org/docs/build/html/topics/impala_data_cache.html On Wed, 3 Feb 2021 at 19:33, Thisun Dayarathna <thisundayarat...@gmail.com> wrote: > Hi team, > > > I observed the following scenarios when querying data in s3 and hdfs using > impala. There are few behaviors that I want to get clarified. > > > > *Scenario 1* > > > > impala view, test_view was created in : > > · a local environment taking the union of a table in hdfs & > another table (not in hdfs) > > · a cloud environment taking the union of a table in s3 & another > table (not in s3) > > > > query > > In hdfs > > In s3 > > Impala scans > > HDFS partitions scanned as per the query plan > > Parquet files scanned as per the query plan > > Impala scans > > S3 partitions scanned as per the query plan > > Parquet files scanned as per the query plan > > Number of GET requests sent to S3 > > select * from test_view where time>=1 and time<= 3; > > 4/4 > > 3/3 > > 3 (3 parquet files are available in hdfs and s3) > > 4/4 > > 3/3 > > 3 > > 9 > > select * from test_view where time>=2 and time<=3; > > 3/3 > > 2/3 > > 2 > > 3/3 > > 2/3 > > 2 > > 6 > > > > *Notes* > > · No. of Impala scans are the same in both hdfs and s3. (this > seems possible only if the size of each parquet file is small) > > · No. of GET requests > Total S3 parquet files scanned. > > > > *Question 1 – How the number of GET requests sent to S3 is derived. ?* > > > > > *Scenario 2* > > > table_hdfs (in hdfs) and table_s3(in s3) are partitioned by value column > > > > query > > In hdfs > > In s3 > > Impala scans > > HDFS partitions scanned as per the query plan > > Parquet files scanned as per the query plan > > Impala scans > > S3 partitions scanned as per the query plan > > Parquet files scanned as per the query plan > > Number of GET requests sent to S3 > > select count(*) from table_hdfs where value = 10; > > 36/36 > > 1/33 (parquet files relevant to 33 different values are stored in hdfs.) > > 36 (36 parquet files are available in hdfs relevant to value = 10) > > N/A > > select count(*) from table_s3 where value = 10; > > N/A > > 286/286 > > 1/33 (parquet files relevant to 33 different values are stored in s3.) > > 36 (36 parquet files are available in s3 relevant to value = 10) > > 286 > > > *Notes* > > · No. of Impala scans are different in hdfs and s3. (but number of > files that need to be scanned are the same) > > · No. of GET requests = Total impala scans > Total S3 parquet > files scanned > > > > *Question 2 – Though we are querying on the same set of parquet files > (only the storage is different), the number of impala scans are different, > and it is the same as the total no. of GET requests sent to S3. Any > explanation on this behavior**, the number of scans and number of GET > requests* *?* > > > > > *Scenario 3* > > > > query > > In hdfs > > In s3 > > Impala scans > > HDFS partitions scanned as per the query plan > > Parquet files scanned as per the query plan > > Impala scans > > S3 partitions scanned as per the query plan > > Parquet files scanned as per the query plan > > Number of GET requests sent to S3 > > select count(*) from table_hdfs > > where value_range > 15700 and value_range <= 15800 > > 1072/1072 > > 33/33 (parquet files relevant to 33 different values are stored in hdfs.) > > 1072 (1072 parquet files are available in hdfs) > > N/A > > select count(*) from table_s3 > > where value_range > 15700 and value_range <= 15800 > > N/A > > 8440/8440 > > 33/33 (parquet files relevant to 33 different values are stored in s3.) > > 1072 (1072 parquet files are available in s3) > > 9837 > > > > *Notes* > > · No. of Impala scans are different in hdfs and s3. (but number of > files that need to be scanned are the same) > > · No. of GET requests > Total impala scans > Total S3 parquet > files scanned > > > > *Question 3 – As in scenario 2, here also we are querying on the same set > of parquet files (only the storage is different), the number of impala > scans are different. But here it is not equal to the total no. of GET > requests sent to S3. Any explanation on this behavior, the number of scans > and number of GET requests ?* > > > > *Question 4 – When parquet files in S3 are accessed via an impala query > from an EC2 instance, do all files be downloaded to the instance volume or > an attached EBS from S3? And if downloaded where those files are stored. ?* > > > > It would be grateful if it is possible to make clarifications for above > questions. > > > -- > Regards, > Thisun Dayarathna >