My understanding is that additional calls to S3 APi is the price to pay for
using the Hadoop library which only emulates FileSystem on top of S3. S3 is
not a distributed file system like HDFS, so some of the API calls cannot be
optimized in an ideal way.

For (i), a more serious problem is the cost of traversing the entire
directory, which is totally unnecessary. This was fixed in HIVE-24849:

https://issues.apache.org/jira/browse/HIVE-24849

You can find several JIRAs that try to reduce the overhead of calling S3 at
a higher level, e.g.:

https://issues.apache.org/jira/browse/HIVE-25277
https://issues.apache.org/jira/browse/HIVE-24546

We can also remove some of S3 calls with a technique described in
HIVE-24546. However, I think, unless some optimization is implemented at
the level of Hadoop, the overhead cannot be completely eliminated.

--- Sungwoo

On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <anupsdtiw...@gmail.com> wrote:

> Hi Team,
>
> We are using Hive heavily for our ETL where our data is stored on S3 and
> so we have seen a strange behaviour between Hive & S3 interaction in terms
> of S3 API calls i.e. Actual number of API calls for simple select
> statements were too much compared to expected so let us know why it is
> behaving like this because if we execute the same select statement via
> Athena then the number of API calls are very less.
>
>
>
> *Background :-*
> We are incurring some S3 API cost and to understand each API call better,
> we decided to do simple testing.
>
> 1. We have a non partition table containing a lot of objects in parquet
> format on S3.
>
> 2. We copied one parquet file object(data) to a separate S3 bucket(target)
> so now our target bucket contains one parquet file data in following
> hierarchy on S3 :-
> s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)
>
> 3. After that, we have executed following 3 command in Apache Hive 2.1.1
> managed by us on EC2 cluster :-
>
> (i) Create External table on top of above S3 location :-
>
> CREATE EXTERNAL TABLE `anup.Test`(
>   `id` int,
>   `cname` varchar(45),
>   `mef` decimal(10,3),
>   `mlpr` int,
>   `qperiod` int,
>   `validity` int,
>   `rpmult` decimal(10,3))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a://<BUCKET>/Test' ;
>
> (ii) msck repair table Test(Just to test behaviour) ;
> (iii) Simple select statement :- select * from anup.Test ;
>
>
>
>
> *Results :-*
> Ideally, we were *expecting max 5-10 API calls* with below breakdown
>
>
> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
> create "Test/" object.
> 2. msck repair : 1-2 API calls ; since we have single object behind table
> 3. select *  : 1-2 API calls ; since we have single object behind table
>
>
> But *Actual number of Total API calls was 37* and we have fetched this
> from S3 Access Logs via Athena. Breakdown of these calls are as follows :-
>
> 1. Create External : 9 API calls
> 2. msck repair : 3 API calls
> 3. select *  : 25 API calls
>
>
> Attaching actual results of S3 Access Logs for select command along with
> DEBUG logs of Hive for select statement.
>
> Let us know why so many API calls are happening for the Create External /
> select statement because if we execute the same select statement *via
> Athena* then the number of API calls are very less i.e. *2*.
>
>
>
>
> *Tools / S3 library details :-*
> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
> aws-java-sdk-core-1.10.6.jar
>
> Regards,
> Anup Tiwari
>

Reply via email to