Too many S3 API calls for simple queries in hive and databricks spark like select and create external table

Anup Tiwari Fri, 18 Feb 2022 03:07:22 -0800

Hi Team,

We are using Hive/Spark(DataBricks) heavily for our ETL where our data is
stored on S3 and so we have seen a strange behaviour between Hive/Spark &
S3 interaction in terms of S3 API calls i.e. Actual number of API calls for
simple select statements were too much compared to expected and *since
Hive/Spark internally uses hadoop provided libraries which are
hadoop-aws*jar/aws-java-sdk*jar* so asking this question here.


Let us know why it is behaving like this because if we execute the same
select statement via Athena then the number of API calls are very less.



*Background :-*
We are incurring some S3 API cost and to understand each API call better,
we decided to do simple testing.

1. We have a non partition table containing a lot of objects in parquet
format on S3.

2. We copied one parquet file object(data) to a separate S3 bucket(target)
so now our target bucket contains one parquet file data in following
hierarchy on S3 :-
s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)

3. After that, we have executed following 3 command in Apache Hive 2.1.1
managed by us on EC2 cluster :-

(i) Create External table on top of above S3 location :-

CREATE EXTERNAL TABLE `anup.Test`(
  `id` int,
  `cname` varchar(45),
  `mef` decimal(10,3),
  `mlpr` int,
  `qperiod` int,
  `validity` int,
  `rpmult` decimal(10,3))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://<BUCKET>/Test' ;

(ii) msck repair table Test(Just to test behaviour) ;
(iii) Simple select statement :- select * from anup.Test ;




*Results :-*
Ideally, we were *expecting max 5-10 API calls* with below breakdown


1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
create "Test/" object.
2. msck repair : 1-2 API calls ; since we have single object behind table
3. select *  : 1-2 API calls ; since we have single object behind table


But *Actual number of Total API calls was 37* and we have fetched this from
S3 Access Logs via Athena. Breakdown of these calls are as follows :-

1. Create External : 9 API calls
2. msck repair : 3 API calls
3. select *  : 25 API calls


Attaching actual results of S3 Access Logs for select command along with
DEBUG logs of Hive for select statement.

Let us know why so many API calls are happening for the Create External /
select statement because if we execute the same select statement *via
Athena* then the number of API calls are very less i.e. *2*.




*Tools / S3 library details :-*
Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
aws-java-sdk-core-1.10.6.jar


Regards,
Anup Tiwari

Hive commands _ S3 Access Logs.xlsx
Description: MS-Excel 2007 spreadsheet

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Too many S3 API calls for simple queries in hive and databricks spark like select and create external table

Reply via email to