I am trying to figure out the best and safest way to swap data in a
production Kudu table with data from a staging table.
Basically, once in a while we need to perform a full reload of some tables
(once in a few months). These tables are pretty large with billions of rows
and we want to minimize the risk and downtime for users if something bad
happens in the middle of that process.
With Hive and Impala on HDFS, we can use a very cool handy command LOAD
DATA INPATH. We can prepare data for reload in a staging table upfront and
this process might take many hours. Once staging table is ready, we can
issue LOAD DATA INPATH command which will move underlying HDFS files to a
production table - this operation is almost instant and the very last step
in our pipeline.
Alternatively, we can swap partitions using ALTER TABLE EXCHANGE PARTITION
Now with Kudu, I cannot seem to find a good strategy. The only thing came
to my mind is to drop the production table and rename a staging table to
production table as the last step of the job, but in this case we are going
to lose statistics and security permissions.
Any other ideas?