My understanding is that additional calls to S3 APi is the price to pay for using the Hadoop library which only emulates FileSystem on top of S3. S3 is not a distributed file system like HDFS, so some of the API calls cannot be optimized in an ideal way.
For (i), a more serious problem is the cost of traversing the entire directory, which is totally unnecessary. This was fixed in HIVE-24849: https://issues.apache.org/jira/browse/HIVE-24849 You can find several JIRAs that try to reduce the overhead of calling S3 at a higher level, e.g.: https://issues.apache.org/jira/browse/HIVE-25277 https://issues.apache.org/jira/browse/HIVE-24546 We can also remove some of S3 calls with a technique described in HIVE-24546. However, I think, unless some optimization is implemented at the level of Hadoop, the overhead cannot be completely eliminated. --- Sungwoo On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <anupsdtiw...@gmail.com> wrote: > Hi Team, > > We are using Hive heavily for our ETL where our data is stored on S3 and > so we have seen a strange behaviour between Hive & S3 interaction in terms > of S3 API calls i.e. Actual number of API calls for simple select > statements were too much compared to expected so let us know why it is > behaving like this because if we execute the same select statement via > Athena then the number of API calls are very less. > > > > *Background :-* > We are incurring some S3 API cost and to understand each API call better, > we decided to do simple testing. > > 1. We have a non partition table containing a lot of objects in parquet > format on S3. > > 2. We copied one parquet file object(data) to a separate S3 bucket(target) > so now our target bucket contains one parquet file data in following > hierarchy on S3 :- > s3://<BUCKET>/Test/000000_0 (Size of object : 1218 Bytes) > > 3. After that, we have executed following 3 command in Apache Hive 2.1.1 > managed by us on EC2 cluster :- > > (i) Create External table on top of above S3 location :- > > CREATE EXTERNAL TABLE `anup.Test`( > `id` int, > `cname` varchar(45), > `mef` decimal(10,3), > `mlpr` int, > `qperiod` int, > `validity` int, > `rpmult` decimal(10,3)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 's3a://<BUCKET>/Test' ; > > (ii) msck repair table Test(Just to test behaviour) ; > (iii) Simple select statement :- select * from anup.Test ; > > > > > *Results :-* > Ideally, we were *expecting max 5-10 API calls* with below breakdown > > > 1. Create External : max 2-3 API calls ; which could be GET.BUCKET, > HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to > create "Test/" object. > 2. msck repair : 1-2 API calls ; since we have single object behind table > 3. select * : 1-2 API calls ; since we have single object behind table > > > But *Actual number of Total API calls was 37* and we have fetched this > from S3 Access Logs via Athena. Breakdown of these calls are as follows :- > > 1. Create External : 9 API calls > 2. msck repair : 3 API calls > 3. select * : 25 API calls > > > Attaching actual results of S3 Access Logs for select command along with > DEBUG logs of Hive for select statement. > > Let us know why so many API calls are happening for the Create External / > select statement because if we execute the same select statement *via > Athena* then the number of API calls are very less i.e. *2*. > > > > > *Tools / S3 library details :-* > Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar / > aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar / > aws-java-sdk-core-1.10.6.jar > > Regards, > Anup Tiwari >