Hi guys, thanks again for your help! I just blogged about this https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
BTW I did not have to invalidate or refresh metadata - it just worked with ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster so not sure if it is because of that but Impala/Kudu docs also do not mention anything about metadata refresh. Looks like Impala is keeping a reference to uuid of the Kudu table not its actual name. One thing I am still puzzled is how Impala was able to finish my long-running SELECT statement, that I had kicked off right before the swap. I did not get any error messages and I could clearly see that Kudu tables were getting renamed and dropped, while the query was still running in a different session and completed 10 seconds after the swap. This is still a mystery to me. The only explanation I have is that data was already in Impala daemons memory and did not need Kudu tables at that point. Boris On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <[email protected]> wrote: > you are guys are awesome, thanks! > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week. > Views might work as well but for a number of reasons want to keep it as my > last resort :) > > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <[email protected]> wrote: > >> A couple other ideas from the Impala side: >> >> - could you use a view and alter the view to point to a different table? >> Then all readers would be pointed at the view, and security permissions >> could be on that view rather than the underlying tables? >> >> - I think if you use an external table in Impala you could use an ALTER >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a >> different table. Then issue a 'refresh' on the impalads so that they load >> the new metadata. Subsequent queries would hit the new underlying Kudu >> table, but permissions and stats would be unchanged. >> >> -Todd >> >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <[email protected]> wrote: >> >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk >>> load capabilities or staging abilities. Theoretically renaming a partition >>> atomically shouldn't be that hard to implement, since it's just a master >>> metadata operation which can be done atomically, but it's not yet >>> implemented. >>> >>> There is a JIRA to track a generic bulk load API here: >>> https://issues.apache.org/jira/browse/KUDU-1370 >>> >>> Since I couldn't find anything to track the specific features you >>> mentioned, I just filed the following improvement JIRAs so we can track it: >>> >>> - KUDU-2326: Support atomic bulk load operation >>> <https://issues.apache.org/jira/browse/KUDU-2326> >>> - KUDU-2327: Support atomic swap of tables or partitions >>> <https://issues.apache.org/jira/browse/KUDU-2327> >>> >>> Mike >>> >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> I am trying to figure out the best and safest way to swap data in a >>>> production Kudu table with data from a staging table. >>>> >>>> Basically, once in a while we need to perform a full reload of some >>>> tables (once in a few months). These tables are pretty large with billions >>>> of rows and we want to minimize the risk and downtime for users if >>>> something bad happens in the middle of that process. >>>> >>>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD >>>> DATA INPATH. We can prepare data for reload in a staging table upfront and >>>> this process might take many hours. Once staging table is ready, we can >>>> issue LOAD DATA INPATH command which will move underlying HDFS files to a >>>> production table - this operation is almost instant and the very last step >>>> in our pipeline. >>>> >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE >>>> PARTITION command. >>>> >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing >>>> came to my mind is to drop the production table and rename a staging table >>>> to production table as the last step of the job, but in this case we are >>>> going to lose statistics and security permissions. >>>> >>>> Any other ideas? >>>> >>>> Thanks! >>>> Boris >>>> >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > >
