If you are using parquet format, HIVE-25827 <https://issues.apache.org/jira/browse/HIVE-25827>would be causing additional calls to s3 as the footer is read atleast twice. Add to this atleast 9+ list_status calls being made for split gen.
~Rajesh.B On Mon, Feb 21, 2022 at 10:16 AM Sungwoo Park <glap...@gmail.com> wrote: > My understanding is that additional calls to S3 APi is the price to pay > for using the Hadoop library which only emulates FileSystem on top of S3. > S3 is not a distributed file system like HDFS, so some of the API calls > cannot be optimized in an ideal way. > > For (i), a more serious problem is the cost of traversing the entire > directory, which is totally unnecessary. This was fixed in HIVE-24849: > > https://issues.apache.org/jira/browse/HIVE-24849 > > You can find several JIRAs that try to reduce the overhead of calling S3 > at a higher level, e.g.: > > https://issues.apache.org/jira/browse/HIVE-25277 > https://issues.apache.org/jira/browse/HIVE-24546 > > We can also remove some of S3 calls with a technique described in > HIVE-24546. However, I think, unless some optimization is implemented at > the level of Hadoop, the overhead cannot be completely eliminated. > > --- Sungwoo > > On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <anupsdtiw...@gmail.com> > wrote: > >> Hi Team, >> >> We are using Hive heavily for our ETL where our data is stored on S3 and >> so we have seen a strange behaviour between Hive & S3 interaction in terms >> of S3 API calls i.e. Actual number of API calls for simple select >> statements were too much compared to expected so let us know why it is >> behaving like this because if we execute the same select statement via >> Athena then the number of API calls are very less. >> >> >> >> *Background :-* >> We are incurring some S3 API cost and to understand each API call better, >> we decided to do simple testing. >> >> 1. We have a non partition table containing a lot of objects in parquet >> format on S3. >> >> 2. We copied one parquet file object(data) to a separate S3 >> bucket(target) so now our target bucket contains one parquet file data in >> following hierarchy on S3 :- >> s3://<BUCKET>/Test/000000_0 (Size of object : 1218 Bytes) >> >> 3. After that, we have executed following 3 command in Apache Hive 2.1.1 >> managed by us on EC2 cluster :- >> >> (i) Create External table on top of above S3 location :- >> >> CREATE EXTERNAL TABLE `anup.Test`( >> `id` int, >> `cname` varchar(45), >> `mef` decimal(10,3), >> `mlpr` int, >> `qperiod` int, >> `validity` int, >> `rpmult` decimal(10,3)) >> ROW FORMAT SERDE >> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' >> STORED AS INPUTFORMAT >> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' >> OUTPUTFORMAT >> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' >> LOCATION >> 's3a://<BUCKET>/Test' ; >> >> (ii) msck repair table Test(Just to test behaviour) ; >> (iii) Simple select statement :- select * from anup.Test ; >> >> >> >> >> *Results :-* >> Ideally, we were *expecting max 5-10 API calls* with below breakdown >> >> >> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET, >> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to >> create "Test/" object. >> 2. msck repair : 1-2 API calls ; since we have single object behind table >> 3. select * : 1-2 API calls ; since we have single object behind table >> >> >> But *Actual number of Total API calls was 37* and we have fetched this >> from S3 Access Logs via Athena. Breakdown of these calls are as follows :- >> >> 1. Create External : 9 API calls >> 2. msck repair : 3 API calls >> 3. select * : 25 API calls >> >> >> Attaching actual results of S3 Access Logs for select command along with >> DEBUG logs of Hive for select statement. >> >> Let us know why so many API calls are happening for the Create External / >> select statement because if we execute the same select statement *via >> Athena* then the number of API calls are very less i.e. *2*. >> >> >> >> >> *Tools / S3 library details :-* >> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar / >> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar / >> aws-java-sdk-core-1.10.6.jar >> >> Regards, >> Anup Tiwari >> >