Thanks Boris for a great article! Tomas
On 2018/07/25 19:56:10, Boris Tyukin <bo...@boristyukin.com> wrote: > Hi guys, > > thanks again for your help! I just blogged about this > https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/ > > BTW I did not have to invalidate or refresh metadata - it just worked with > ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster > so not sure if it is because of that but Impala/Kudu docs also do not > mention anything about metadata refresh. Looks like Impala is keeping a > reference to uuid of the Kudu table not its actual name. > > One thing I am still puzzled is how Impala was able to finish my > long-running SELECT statement, that I had kicked off right before the swap. > I did not get any error messages and I could clearly see that Kudu tables > were getting renamed and dropped, while the query was still running in a > different session and completed 10 seconds after the swap. This is still a > mystery to me. The only explanation I have is that data was already in > Impala daemons memory and did not need Kudu tables at that point. > > Boris > > > > On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <bo...@boristyukin.com> wrote: > > > you are guys are awesome, thanks! > > > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week. > > Views might work as well but for a number of reasons want to keep it as my > > last resort :) > > > > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <t...@cloudera.com> wrote: > > > >> A couple other ideas from the Impala side: > >> > >> - could you use a view and alter the view to point to a different table? > >> Then all readers would be pointed at the view, and security permissions > >> could be on that view rather than the underlying tables? > >> > >> - I think if you use an external table in Impala you could use an ALTER > >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a > >> different table. Then issue a 'refresh' on the impalads so that they load > >> the new metadata. Subsequent queries would hit the new underlying Kudu > >> table, but permissions and stats would be unchanged. > >> > >> -Todd > >> > >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mpe...@apache.org> wrote: > >> > >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk > >>> load capabilities or staging abilities. Theoretically renaming a partition > >>> atomically shouldn't be that hard to implement, since it's just a master > >>> metadata operation which can be done atomically, but it's not yet > >>> implemented. > >>> > >>> There is a JIRA to track a generic bulk load API here: > >>> https://issues.apache.org/jira/browse/KUDU-1370 > >>> > >>> Since I couldn't find anything to track the specific features you > >>> mentioned, I just filed the following improvement JIRAs so we can track > >>> it: > >>> > >>> - KUDU-2326: Support atomic bulk load operation > >>> <https://issues.apache.org/jira/browse/KUDU-2326> > >>> - KUDU-2327: Support atomic swap of tables or partitions > >>> <https://issues.apache.org/jira/browse/KUDU-2327> > >>> > >>> Mike > >>> > >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <bo...@boristyukin.com> > >>> wrote: > >>> > >>>> Hello, > >>>> > >>>> I am trying to figure out the best and safest way to swap data in a > >>>> production Kudu table with data from a staging table. > >>>> > >>>> Basically, once in a while we need to perform a full reload of some > >>>> tables (once in a few months). These tables are pretty large with > >>>> billions > >>>> of rows and we want to minimize the risk and downtime for users if > >>>> something bad happens in the middle of that process. > >>>> > >>>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD > >>>> DATA INPATH. We can prepare data for reload in a staging table upfront > >>>> and > >>>> this process might take many hours. Once staging table is ready, we can > >>>> issue LOAD DATA INPATH command which will move underlying HDFS files to a > >>>> production table - this operation is almost instant and the very last > >>>> step > >>>> in our pipeline. > >>>> > >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE > >>>> PARTITION command. > >>>> > >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing > >>>> came to my mind is to drop the production table and rename a staging > >>>> table > >>>> to production table as the last step of the job, but in this case we are > >>>> going to lose statistics and security permissions. > >>>> > >>>> Any other ideas? > >>>> > >>>> Thanks! > >>>> Boris > >>>> > >>> > >>> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > > >