Re: Too many S3 API calls for simple queries like select and create external table

Sungwoo Park Sun, 20 Feb 2022 20:46:12 -0800

My understanding is that additional calls to S3 APi is the price to pay for
using the Hadoop library which only emulates FileSystem on top of S3. S3 is
not a distributed file system like HDFS, so some of the API calls cannot be
optimized in an ideal way.


For (i), a more serious problem is the cost of traversing the entire
directory, which is totally unnecessary. This was fixed in HIVE-24849:

https://issues.apache.org/jira/browse/HIVE-24849

You can find several JIRAs that try to reduce the overhead of calling S3 at
a higher level, e.g.:

https://issues.apache.org/jira/browse/HIVE-25277
https://issues.apache.org/jira/browse/HIVE-24546

We can also remove some of S3 calls with a technique described in
HIVE-24546. However, I think, unless some optimization is implemented at
the level of Hadoop, the overhead cannot be completely eliminated.

--- Sungwoo

On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <anupsdtiw...@gmail.com> wrote:

> Hi Team,
>
> We are using Hive heavily for our ETL where our data is stored on S3 and
> so we have seen a strange behaviour between Hive & S3 interaction in terms
> of S3 API calls i.e. Actual number of API calls for simple select
> statements were too much compared to expected so let us know why it is
> behaving like this because if we execute the same select statement via
> Athena then the number of API calls are very less.
>
>
>
> *Background :-*
> We are incurring some S3 API cost and to understand each API call better,
> we decided to do simple testing.
>
> 1. We have a non partition table containing a lot of objects in parquet
> format on S3.
>
> 2. We copied one parquet file object(data) to a separate S3 bucket(target)
> so now our target bucket contains one parquet file data in following
> hierarchy on S3 :-
> s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)
>
> 3. After that, we have executed following 3 command in Apache Hive 2.1.1
> managed by us on EC2 cluster :-
>
> (i) Create External table on top of above S3 location :-
>
> CREATE EXTERNAL TABLE `anup.Test`(
>   `id` int,
>   `cname` varchar(45),
>   `mef` decimal(10,3),
>   `mlpr` int,
>   `qperiod` int,
>   `validity` int,
>   `rpmult` decimal(10,3))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a://<BUCKET>/Test' ;
>
> (ii) msck repair table Test(Just to test behaviour) ;
> (iii) Simple select statement :- select * from anup.Test ;
>
>
>
>
> *Results :-*
> Ideally, we were *expecting max 5-10 API calls* with below breakdown
>
>
> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
> create "Test/" object.
> 2. msck repair : 1-2 API calls ; since we have single object behind table
> 3. select *  : 1-2 API calls ; since we have single object behind table
>
>
> But *Actual number of Total API calls was 37* and we have fetched this
> from S3 Access Logs via Athena. Breakdown of these calls are as follows :-
>
> 1. Create External : 9 API calls
> 2. msck repair : 3 API calls
> 3. select *  : 25 API calls
>
>
> Attaching actual results of S3 Access Logs for select command along with
> DEBUG logs of Hive for select statement.
>
> Let us know why so many API calls are happening for the Create External /
> select statement because if we execute the same select statement *via
> Athena* then the number of API calls are very less i.e. *2*.
>
>
>
>
> *Tools / S3 library details :-*
> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
> aws-java-sdk-core-1.10.6.jar
>
> Regards,
> Anup Tiwari
>

Re: Too many S3 API calls for simple queries like select and create external table

Reply via email to