Hello, I was running some create partitioned external table queries looking like: # 30 partitions inside CREATE EXTERNAL TABLE table1 (value string) PARTITIONED BY (shard string) LOCATION 's3a://path/date=2021-02-01/'; INFO : Completed compiling command(queryId=); Time taken: 7.753 seconds # 60 dates * 30 shards partitions inside CREATE EXTERNAL TABLE table2 (value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://path/'; INFO : Completed compiling command(queryId=); Time taken: 488.435 seconds I also tried without location first, then alter table, and the same slowness. So like this: CREATE EXTERNAL TABLE t (value string) PARTITIONED BY (`date` string, shard string); ALTER TABLE t SET LOCATION 's3a://path/‘;
It seems that create partitioned external table should be nearly instantaneous, given that data does not need to be examined, not even partitions. Or at least it should not scale with amount of data/folders inside location. All these commands give empty tables that need to be repaired, or partitions manually added. As expected. And msck repair table table2 runs in 20 seconds, and then I can query actual csv files stored in those partitions. I managed to find https://issues.apache.org/jira/browse/HIVE-20316 <https://issues.apache.org/jira/browse/HIVE-20316>, which seems to address this exact issue. But I tried hive 3.1.2 and hive 4.0.0 (using hive4 on mr3 as a quick deploy option), and both were showing the same slowdowns. Is this expected? Maybe there is some other reason hive lists the whole s3 path when doing create partitioned external table? Maybe this is some misconfiguration on my side, but I am not sure where to look. Thanks, - Bartek