swap data in Kudu table

Boris Tyukin Thu, 22 Feb 2018 06:40:53 -0800

Hello,

I am trying to figure out the best and safest way to swap data in a
production Kudu table with data from a staging table.


Basically, once in a while we need to perform a full reload of some tables
(once in a few months). These tables are pretty large with billions of rows
and we want to minimize the risk and downtime for users if something bad
happens in the middle of that process.

With Hive and Impala on HDFS, we can use a very cool handy command LOAD
DATA INPATH. We can prepare data for reload in a staging table upfront and
this process might take many hours. Once staging table is ready, we can
issue LOAD DATA INPATH command which will move underlying HDFS files to a
production table - this operation is almost instant and the very last step
in our pipeline.

Alternatively, we can swap partitions using ALTER TABLE EXCHANGE PARTITION
command.

Now with Kudu, I cannot seem to find a good strategy. The only thing came
to my mind is to drop the production table and rename a staging table to
production table as the last step of the job, but in this case we are going
to lose statistics and security permissions.

Any other ideas?

Thanks!
Boris

swap data in Kudu table

Reply via email to