Hi team,

I observed the following scenarios when querying data in s3 and hdfs using
impala. There are few behaviors that I want to get clarified.



*Scenario 1*



impala view, test_view was created in :

·        a local environment taking the union of a table in hdfs & another
table (not in hdfs)

·        a cloud environment taking the union of a table in s3 & another
table (not in s3)



query

In hdfs

In s3

Impala scans

HDFS partitions scanned as per the query plan

Parquet files scanned as per the query plan

Impala scans

S3 partitions scanned as per the query plan

Parquet files scanned as per the query plan

Number of GET requests sent to S3

select * from test_view where time>=1 and time<= 3;

4/4

3/3

3 (3 parquet files are available in hdfs and s3)

4/4

3/3

3

9

select * from test_view where time>=2 and time<=3;

3/3

2/3

2

3/3

2/3

2

6



*Notes*

·        No. of Impala scans are the same in both hdfs and s3. (this seems
possible only if the size of each parquet file is small)

·        No. of GET requests > Total S3 parquet files scanned.



*Question 1 –  How the number of GET requests sent to S3 is derived. ?*




*Scenario 2*


table_hdfs (in hdfs)  and table_s3(in s3) are partitioned by value column



query

In hdfs

In s3

Impala scans

HDFS partitions scanned as per the query plan

Parquet files scanned as per the query plan

Impala scans

S3 partitions scanned as per the query plan

Parquet files scanned as per the query plan

Number of GET requests sent to S3

select count(*) from table_hdfs where value = 10;

 36/36

1/33 (parquet files relevant to 33 different values are stored in hdfs.)

36 (36 parquet files are available in hdfs relevant to value = 10)

N/A

select count(*) from table_s3 where value = 10;

N/A

 286/286

1/33 (parquet files relevant to 33 different values are stored in s3.)

36 (36 parquet files are available in s3 relevant to value = 10)

286


*Notes*

·        No. of Impala scans are different in hdfs and s3. (but number of
files that need to be scanned are the same)

·        No. of GET requests = Total impala scans > Total S3 parquet files
scanned



*Question 2 – Though we are querying on the same set of parquet files (only
the storage is different), the number of impala scans are different, and it
is the same as the total no. of GET requests sent to S3. Any explanation on
this behavior**, the number of scans and number of GET requests* *?*




*Scenario 3*



query

In hdfs

In s3

Impala scans

HDFS partitions scanned as per the query plan

Parquet files scanned as per the query plan

Impala scans

S3 partitions scanned as per the query plan

Parquet files scanned as per the query plan

Number of GET requests sent to S3

select count(*) from table_hdfs

where value_range > 15700 and value_range  <= 15800

 1072/1072

33/33 (parquet files relevant to 33 different values are stored in hdfs.)

1072 (1072 parquet files are available in hdfs)

N/A

select count(*) from table_s3

where value_range > 15700 and value_range  <= 15800

N/A

8440/8440

33/33 (parquet files relevant to 33 different values are stored in s3.)

1072 (1072 parquet files are available in s3)

9837



*Notes*

·        No. of Impala scans are different in hdfs and s3. (but number of
files that need to be scanned are the same)

·        No. of GET requests > Total impala scans > Total S3 parquet files
scanned



*Question 3 –  As in scenario 2, here also we are querying on the same set
of parquet files (only the storage is different), the number of impala
scans are different. But here it is not equal to the total no. of GET
requests sent to S3. Any explanation on this behavior, the number of scans
and number of GET requests ?*



*Question 4 – When parquet files in S3 are accessed via an impala query
from an EC2 instance, do all files be downloaded to the instance volume or
an attached EBS from S3? And if downloaded where those files are stored. ?*



It would be grateful if it is possible to make clarifications for above
questions.


-- 
Regards,
Thisun Dayarathna

Reply via email to