slow create external table on s3

Bartek Siudeja Thu, 11 Feb 2021 11:58:53 -0800

Hello,

I was running some create partitioned external table queries looking like:
# 30 partitions inside
CREATE EXTERNAL TABLE table1 (value string) PARTITIONED BY (shard string) 
LOCATION 's3a://path/date=2021-02-01/';
INFO  : Completed compiling command(queryId=); Time taken: 7.753 seconds
# 60 dates * 30 shards partitions inside
CREATE EXTERNAL TABLE table2 (value string) PARTITIONED BY (`date` string, 
shard string) LOCATION 's3a://path/';
INFO  : Completed compiling command(queryId=); Time taken: 488.435 seconds
I also tried without location first, then alter table, and the same slowness. 
So like this:
CREATE EXTERNAL TABLE t (value string) PARTITIONED BY (`date` string, shard 
string);
ALTER TABLE t SET LOCATION 's3a://path/‘;


It seems that create partitioned external table should be nearly instantaneous, 
given that data does not need to be examined, not even partitions. Or at least 
it should not scale with amount of data/folders inside location. All these 
commands give empty tables that need to be repaired, or partitions manually 
added. As expected. And msck repair table table2 runs in 20 seconds, and then I 
can query actual csv files stored in those partitions.

I managed to find https://issues.apache.org/jira/browse/HIVE-20316 
<https://issues.apache.org/jira/browse/HIVE-20316>, which seems to address this 
exact issue. But I tried hive 3.1.2 and hive 4.0.0 (using hive4 on mr3 as a 
quick deploy option), and both were showing the same slowdowns.

Is this expected? Maybe there is some other reason hive lists the whole s3 path 
when doing create partitioned external table? Maybe this is some 
misconfiguration on my side, but I am not sure where to look.

Thanks,
- Bartek

slow create external table on s3

Reply via email to