Hi Team, We are using Hive/Spark(DataBricks) heavily for our ETL where our data is stored on S3 and so we have seen a strange behaviour between Hive/Spark & S3 interaction in terms of S3 API calls i.e. Actual number of API calls for simple select statements were too much compared to expected and *since Hive/Spark internally uses hadoop provided libraries which are hadoop-aws*jar/aws-java-sdk*jar* so asking this question here.
Let us know why it is behaving like this because if we execute the same select statement via Athena then the number of API calls are very less. *Background :-* We are incurring some S3 API cost and to understand each API call better, we decided to do simple testing. 1. We have a non partition table containing a lot of objects in parquet format on S3. 2. We copied one parquet file object(data) to a separate S3 bucket(target) so now our target bucket contains one parquet file data in following hierarchy on S3 :- s3://<BUCKET>/Test/000000_0 (Size of object : 1218 Bytes) 3. After that, we have executed following 3 command in Apache Hive 2.1.1 managed by us on EC2 cluster :- (i) Create External table on top of above S3 location :- CREATE EXTERNAL TABLE `anup.Test`( `id` int, `cname` varchar(45), `mef` decimal(10,3), `mlpr` int, `qperiod` int, `validity` int, `rpmult` decimal(10,3)) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<BUCKET>/Test' ; (ii) msck repair table Test(Just to test behaviour) ; (iii) Simple select statement :- select * from anup.Test ; *Results :-* Ideally, we were *expecting max 5-10 API calls* with below breakdown 1. Create External : max 2-3 API calls ; which could be GET.BUCKET, HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to create "Test/" object. 2. msck repair : 1-2 API calls ; since we have single object behind table 3. select * : 1-2 API calls ; since we have single object behind table But *Actual number of Total API calls was 37* and we have fetched this from S3 Access Logs via Athena. Breakdown of these calls are as follows :- 1. Create External : 9 API calls 2. msck repair : 3 API calls 3. select * : 25 API calls Attaching actual results of S3 Access Logs for select command along with DEBUG logs of Hive for select statement. Let us know why so many API calls are happening for the Create External / select statement because if we execute the same select statement *via Athena* then the number of API calls are very less i.e. *2*. *Tools / S3 library details :-* Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar / aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar / aws-java-sdk-core-1.10.6.jar Regards, Anup Tiwari
Hive commands _ S3 Access Logs.xlsx
Description: MS-Excel 2007 spreadsheet
--------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org