Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
load capabilities or staging abilities. Theoretically renaming a partition
atomically shouldn't be that hard to implement, since it's just a master
metadata operation which can be done atomically, but it's not yet
There is a JIRA to track a generic bulk load API here:
Since I couldn't find anything to track the specific features you
mentioned, I just filed the following improvement JIRAs so we can track it:
- KUDU-2326: Support atomic bulk load operation
- KUDU-2327: Support atomic swap of tables or partitions
On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <bo...@boristyukin.com> wrote:
> I am trying to figure out the best and safest way to swap data in a
> production Kudu table with data from a staging table.
> Basically, once in a while we need to perform a full reload of some tables
> (once in a few months). These tables are pretty large with billions of rows
> and we want to minimize the risk and downtime for users if something bad
> happens in the middle of that process.
> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
> DATA INPATH. We can prepare data for reload in a staging table upfront and
> this process might take many hours. Once staging table is ready, we can
> issue LOAD DATA INPATH command which will move underlying HDFS files to a
> production table - this operation is almost instant and the very last step
> in our pipeline.
> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE PARTITION
> Now with Kudu, I cannot seem to find a good strategy. The only thing came
> to my mind is to drop the production table and rename a staging table to
> production table as the last step of the job, but in this case we are going
> to lose statistics and security permissions.
> Any other ideas?