If you are using parquet format, HIVE-25827
<https://issues.apache.org/jira/browse/HIVE-25827>would be causing
additional calls to s3 as the footer is read atleast twice. Add to this
atleast 9+ list_status calls being made for split gen.

~Rajesh.B

On Mon, Feb 21, 2022 at 10:16 AM Sungwoo Park <glap...@gmail.com> wrote:

> My understanding is that additional calls to S3 APi is the price to pay
> for using the Hadoop library which only emulates FileSystem on top of S3.
> S3 is not a distributed file system like HDFS, so some of the API calls
> cannot be optimized in an ideal way.
>
> For (i), a more serious problem is the cost of traversing the entire
> directory, which is totally unnecessary. This was fixed in HIVE-24849:
>
> https://issues.apache.org/jira/browse/HIVE-24849
>
> You can find several JIRAs that try to reduce the overhead of calling S3
> at a higher level, e.g.:
>
> https://issues.apache.org/jira/browse/HIVE-25277
> https://issues.apache.org/jira/browse/HIVE-24546
>
> We can also remove some of S3 calls with a technique described in
> HIVE-24546. However, I think, unless some optimization is implemented at
> the level of Hadoop, the overhead cannot be completely eliminated.
>
> --- Sungwoo
>
> On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <anupsdtiw...@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> We are using Hive heavily for our ETL where our data is stored on S3 and
>> so we have seen a strange behaviour between Hive & S3 interaction in terms
>> of S3 API calls i.e. Actual number of API calls for simple select
>> statements were too much compared to expected so let us know why it is
>> behaving like this because if we execute the same select statement via
>> Athena then the number of API calls are very less.
>>
>>
>>
>> *Background :-*
>> We are incurring some S3 API cost and to understand each API call better,
>> we decided to do simple testing.
>>
>> 1. We have a non partition table containing a lot of objects in parquet
>> format on S3.
>>
>> 2. We copied one parquet file object(data) to a separate S3
>> bucket(target) so now our target bucket contains one parquet file data in
>> following hierarchy on S3 :-
>> s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)
>>
>> 3. After that, we have executed following 3 command in Apache Hive 2.1.1
>> managed by us on EC2 cluster :-
>>
>> (i) Create External table on top of above S3 location :-
>>
>> CREATE EXTERNAL TABLE `anup.Test`(
>>   `id` int,
>>   `cname` varchar(45),
>>   `mef` decimal(10,3),
>>   `mlpr` int,
>>   `qperiod` int,
>>   `validity` int,
>>   `rpmult` decimal(10,3))
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>> LOCATION
>>   's3a://<BUCKET>/Test' ;
>>
>> (ii) msck repair table Test(Just to test behaviour) ;
>> (iii) Simple select statement :- select * from anup.Test ;
>>
>>
>>
>>
>> *Results :-*
>> Ideally, we were *expecting max 5-10 API calls* with below breakdown
>>
>>
>> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
>> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
>> create "Test/" object.
>> 2. msck repair : 1-2 API calls ; since we have single object behind table
>> 3. select *  : 1-2 API calls ; since we have single object behind table
>>
>>
>> But *Actual number of Total API calls was 37* and we have fetched this
>> from S3 Access Logs via Athena. Breakdown of these calls are as follows :-
>>
>> 1. Create External : 9 API calls
>> 2. msck repair : 3 API calls
>> 3. select *  : 25 API calls
>>
>>
>> Attaching actual results of S3 Access Logs for select command along with
>> DEBUG logs of Hive for select statement.
>>
>> Let us know why so many API calls are happening for the Create External /
>> select statement because if we execute the same select statement *via
>> Athena* then the number of API calls are very less i.e. *2*.
>>
>>
>>
>>
>> *Tools / S3 library details :-*
>> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
>> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
>> aws-java-sdk-core-1.10.6.jar
>>
>> Regards,
>> Anup Tiwari
>>
>

Reply via email to