https://impala.apache.org/docs/build/html/topics/impala_parquet.html
explains how splitting files into scan ranges works on S3.

Parquet is a columnar format and Impala implements a variety of strategies
for only reading the relevant columns and parts of columns in files.
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
has some good illustrations that may be useful for understanding the file
layout and strategies in play. I think you're under the incorrect
impression that a parquet reader needs to read the entire file, which is
definitely not true - one of the main design points of Parquet is lay out
the file so that only relevant portions for a query are read. I also don't
think you can predict the number of S3 get requests without understanding
all the details of the particular Parquet reader - there are lots of ways
to read parquet depending on the query and the reader.

Impala reads the relevant ranges of the file directly from S3 and stores a
copy into the remote data cache if that's enabled -
https://impala.apache.org/docs/build/html/topics/impala_data_cache.html

On Wed, 3 Feb 2021 at 19:33, Thisun Dayarathna <thisundayarat...@gmail.com>
wrote:

> Hi team,
>
>
> I observed the following scenarios when querying data in s3 and hdfs using
> impala. There are few behaviors that I want to get clarified.
>
>
>
> *Scenario 1*
>
>
>
> impala view, test_view was created in :
>
> ·        a local environment taking the union of a table in hdfs &
> another table (not in hdfs)
>
> ·        a cloud environment taking the union of a table in s3 & another
> table (not in s3)
>
>
>
> query
>
> In hdfs
>
> In s3
>
> Impala scans
>
> HDFS partitions scanned as per the query plan
>
> Parquet files scanned as per the query plan
>
> Impala scans
>
> S3 partitions scanned as per the query plan
>
> Parquet files scanned as per the query plan
>
> Number of GET requests sent to S3
>
> select * from test_view where time>=1 and time<= 3;
>
> 4/4
>
> 3/3
>
> 3 (3 parquet files are available in hdfs and s3)
>
> 4/4
>
> 3/3
>
> 3
>
> 9
>
> select * from test_view where time>=2 and time<=3;
>
> 3/3
>
> 2/3
>
> 2
>
> 3/3
>
> 2/3
>
> 2
>
> 6
>
>
>
> *Notes*
>
> ·        No. of Impala scans are the same in both hdfs and s3. (this
> seems possible only if the size of each parquet file is small)
>
> ·        No. of GET requests > Total S3 parquet files scanned.
>
>
>
> *Question 1 –  How the number of GET requests sent to S3 is derived. ?*
>
>
>
>
> *Scenario 2*
>
>
> table_hdfs (in hdfs)  and table_s3(in s3) are partitioned by value column
>
>
>
> query
>
> In hdfs
>
> In s3
>
> Impala scans
>
> HDFS partitions scanned as per the query plan
>
> Parquet files scanned as per the query plan
>
> Impala scans
>
> S3 partitions scanned as per the query plan
>
> Parquet files scanned as per the query plan
>
> Number of GET requests sent to S3
>
> select count(*) from table_hdfs where value = 10;
>
>  36/36
>
> 1/33 (parquet files relevant to 33 different values are stored in hdfs.)
>
> 36 (36 parquet files are available in hdfs relevant to value = 10)
>
> N/A
>
> select count(*) from table_s3 where value = 10;
>
> N/A
>
>  286/286
>
> 1/33 (parquet files relevant to 33 different values are stored in s3.)
>
> 36 (36 parquet files are available in s3 relevant to value = 10)
>
> 286
>
>
> *Notes*
>
> ·        No. of Impala scans are different in hdfs and s3. (but number of
> files that need to be scanned are the same)
>
> ·        No. of GET requests = Total impala scans > Total S3 parquet
> files scanned
>
>
>
> *Question 2 – Though we are querying on the same set of parquet files
> (only the storage is different), the number of impala scans are different,
> and it is the same as the total no. of GET requests sent to S3. Any
> explanation on this behavior**, the number of scans and number of GET
> requests* *?*
>
>
>
>
> *Scenario 3*
>
>
>
> query
>
> In hdfs
>
> In s3
>
> Impala scans
>
> HDFS partitions scanned as per the query plan
>
> Parquet files scanned as per the query plan
>
> Impala scans
>
> S3 partitions scanned as per the query plan
>
> Parquet files scanned as per the query plan
>
> Number of GET requests sent to S3
>
> select count(*) from table_hdfs
>
> where value_range > 15700 and value_range  <= 15800
>
>  1072/1072
>
> 33/33 (parquet files relevant to 33 different values are stored in hdfs.)
>
> 1072 (1072 parquet files are available in hdfs)
>
> N/A
>
> select count(*) from table_s3
>
> where value_range > 15700 and value_range  <= 15800
>
> N/A
>
> 8440/8440
>
> 33/33 (parquet files relevant to 33 different values are stored in s3.)
>
> 1072 (1072 parquet files are available in s3)
>
> 9837
>
>
>
> *Notes*
>
> ·        No. of Impala scans are different in hdfs and s3. (but number of
> files that need to be scanned are the same)
>
> ·        No. of GET requests > Total impala scans > Total S3 parquet
> files scanned
>
>
>
> *Question 3 –  As in scenario 2, here also we are querying on the same set
> of parquet files (only the storage is different), the number of impala
> scans are different. But here it is not equal to the total no. of GET
> requests sent to S3. Any explanation on this behavior, the number of scans
> and number of GET requests ?*
>
>
>
> *Question 4 – When parquet files in S3 are accessed via an impala query
> from an EC2 instance, do all files be downloaded to the instance volume or
> an attached EBS from S3? And if downloaded where those files are stored. ?*
>
>
>
> It would be grateful if it is possible to make clarifications for above
> questions.
>
>
> --
> Regards,
> Thisun Dayarathna
>

Reply via email to