Re: Clarification on few behaviors observed when querying data in s3 and hdfs using impala

Tim Armstrong Wed, 17 Feb 2021 20:48:54 -0800

1. It's often going to be the number of row groups times the number of
columns plus one for the footer, but it depends, the parquet format gives a
lot of flexibility to lay out files in different ways and different parquet
readers have a different mix of algorithms that kick in for different files


2. It's read directly across the network into memory of the impalad process
and processed there no need for any intermediate staging.

On Wed, Feb 17, 2021, 18:41 Thisun Dayarathna <thisundayarat...@gmail.com>
wrote:

> Hi Tim,
>
> Thank you for your response. But I still need to clarify more on the
> following cases.
>
> 1. Isn't there a proper way to derive the number of GET requests ? (Like
> number of bytes read per request)
>
> 2. Currently caching is disabled. In that case, is data read from S3
> stored in EC2 for impala to process in a temporary location of the instance
> volume?
>
> Thank you.
>
>
> On Thu, 4 Feb 2021, 09:59 Tim Armstrong, <tim.g.armstr...@gmail.com>
> wrote:
>
>> https://impala.apache.org/docs/build/html/topics/impala_parquet.html
>> explains how splitting files into scan ranges works on S3.
>>
>> Parquet is a columnar format and Impala implements a variety of
>> strategies for only reading the relevant columns and parts of columns in
>> files.
>> https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
>> has some good illustrations that may be useful for understanding the file
>> layout and strategies in play. I think you're under the incorrect
>> impression that a parquet reader needs to read the entire file, which is
>> definitely not true - one of the main design points of Parquet is lay out
>> the file so that only relevant portions for a query are read. I also don't
>> think you can predict the number of S3 get requests without understanding
>> all the details of the particular Parquet reader - there are lots of ways
>> to read parquet depending on the query and the reader.
>>
>> Impala reads the relevant ranges of the file directly from S3 and stores
>> a copy into the remote data cache if that's enabled -
>> https://impala.apache.org/docs/build/html/topics/impala_data_cache.html
>>
>> On Wed, 3 Feb 2021 at 19:33, Thisun Dayarathna <
>> thisundayarat...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>>
>>> I observed the following scenarios when querying data in s3 and hdfs
>>> using impala. There are few behaviors that I want to get clarified.
>>>
>>>
>>>
>>> *Scenario 1*
>>>
>>>
>>>
>>> impala view, test_view was created in :
>>>
>>> ·        a local environment taking the union of a table in hdfs &
>>> another table (not in hdfs)
>>>
>>> ·        a cloud environment taking the union of a table in s3 &
>>> another table (not in s3)
>>>
>>>
>>>
>>> query
>>>
>>> In hdfs
>>>
>>> In s3
>>>
>>> Impala scans
>>>
>>> HDFS partitions scanned as per the query plan
>>>
>>> Parquet files scanned as per the query plan
>>>
>>> Impala scans
>>>
>>> S3 partitions scanned as per the query plan
>>>
>>> Parquet files scanned as per the query plan
>>>
>>> Number of GET requests sent to S3
>>>
>>> select * from test_view where time>=1 and time<= 3;
>>>
>>> 4/4
>>>
>>> 3/3
>>>
>>> 3 (3 parquet files are available in hdfs and s3)
>>>
>>> 4/4
>>>
>>> 3/3
>>>
>>> 3
>>>
>>> 9
>>>
>>> select * from test_view where time>=2 and time<=3;
>>>
>>> 3/3
>>>
>>> 2/3
>>>
>>> 2
>>>
>>> 3/3
>>>
>>> 2/3
>>>
>>> 2
>>>
>>> 6
>>>
>>>
>>>
>>> *Notes*
>>>
>>> ·        No. of Impala scans are the same in both hdfs and s3. (this
>>> seems possible only if the size of each parquet file is small)
>>>
>>> ·        No. of GET requests > Total S3 parquet files scanned.
>>>
>>>
>>>
>>> *Question 1 –  How the number of GET requests sent to S3 is derived. ?*
>>>
>>>
>>>
>>>
>>> *Scenario 2*
>>>
>>>
>>> table_hdfs (in hdfs)  and table_s3(in s3) are partitioned by value column
>>>
>>>
>>>
>>> query
>>>
>>> In hdfs
>>>
>>> In s3
>>>
>>> Impala scans
>>>
>>> HDFS partitions scanned as per the query plan
>>>
>>> Parquet files scanned as per the query plan
>>>
>>> Impala scans
>>>
>>> S3 partitions scanned as per the query plan
>>>
>>> Parquet files scanned as per the query plan
>>>
>>> Number of GET requests sent to S3
>>>
>>> select count(*) from table_hdfs where value = 10;
>>>
>>>  36/36
>>>
>>> 1/33 (parquet files relevant to 33 different values are stored in hdfs.)
>>>
>>> 36 (36 parquet files are available in hdfs relevant to value = 10)
>>>
>>> N/A
>>>
>>> select count(*) from table_s3 where value = 10;
>>>
>>> N/A
>>>
>>>  286/286
>>>
>>> 1/33 (parquet files relevant to 33 different values are stored in s3.)
>>>
>>> 36 (36 parquet files are available in s3 relevant to value = 10)
>>>
>>> 286
>>>
>>>
>>> *Notes*
>>>
>>> ·        No. of Impala scans are different in hdfs and s3. (but number
>>> of files that need to be scanned are the same)
>>>
>>> ·        No. of GET requests = Total impala scans > Total S3 parquet
>>> files scanned
>>>
>>>
>>>
>>> *Question 2 – Though we are querying on the same set of parquet files
>>> (only the storage is different), the number of impala scans are different,
>>> and it is the same as the total no. of GET requests sent to S3. Any
>>> explanation on this behavior**, the number of scans and number of GET
>>> requests* *?*
>>>
>>>
>>>
>>>
>>> *Scenario 3*
>>>
>>>
>>>
>>> query
>>>
>>> In hdfs
>>>
>>> In s3
>>>
>>> Impala scans
>>>
>>> HDFS partitions scanned as per the query plan
>>>
>>> Parquet files scanned as per the query plan
>>>
>>> Impala scans
>>>
>>> S3 partitions scanned as per the query plan
>>>
>>> Parquet files scanned as per the query plan
>>>
>>> Number of GET requests sent to S3
>>>
>>> select count(*) from table_hdfs
>>>
>>> where value_range > 15700 and value_range  <= 15800
>>>
>>>  1072/1072
>>>
>>> 33/33 (parquet files relevant to 33 different values are stored in hdfs.)
>>>
>>> 1072 (1072 parquet files are available in hdfs)
>>>
>>> N/A
>>>
>>> select count(*) from table_s3
>>>
>>> where value_range > 15700 and value_range  <= 15800
>>>
>>> N/A
>>>
>>> 8440/8440
>>>
>>> 33/33 (parquet files relevant to 33 different values are stored in s3.)
>>>
>>> 1072 (1072 parquet files are available in s3)
>>>
>>> 9837
>>>
>>>
>>>
>>> *Notes*
>>>
>>> ·        No. of Impala scans are different in hdfs and s3. (but number
>>> of files that need to be scanned are the same)
>>>
>>> ·        No. of GET requests > Total impala scans > Total S3 parquet
>>> files scanned
>>>
>>>
>>>
>>> *Question 3 –  As in scenario 2, here also we are querying on the same
>>> set of parquet files (only the storage is different), the number of impala
>>> scans are different. But here it is not equal to the total no. of GET
>>> requests sent to S3. Any explanation on this behavior, the number of scans
>>> and number of GET requests ?*
>>>
>>>
>>>
>>> *Question 4 – When parquet files in S3 are accessed via an impala query
>>> from an EC2 instance, do all files be downloaded to the instance volume or
>>> an attached EBS from S3? And if downloaded where those files are stored. ?*
>>>
>>>
>>>
>>> It would be grateful if it is possible to make clarifications for above
>>> questions.
>>>
>>>
>>> --
>>> Regards,
>>> Thisun Dayarathna
>>>
>>

Re: Clarification on few behaviors observed when querying data in s3 and hdfs using impala

Reply via email to