Re: swap data in Kudu table

farkas Thu, 02 Aug 2018 07:16:05 -0700

Thanks Boris for a great article!
Tomas


On 2018/07/25 19:56:10, Boris Tyukin <[email protected]> wrote: 
> Hi guys,
> 
> thanks again for your help!  I just blogged about this
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> 
> BTW I did not have to invalidate or refresh metadata - it just worked with
>  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
> so not sure if it is because of that but Impala/Kudu docs also do not
> mention anything about metadata refresh.  Looks like Impala is keeping a
> reference to uuid of the Kudu table not its actual name.
> 
> One thing I am still puzzled is how Impala was able to finish my
> long-running SELECT statement, that I had kicked off right before the swap.
> I did not get any error messages and I could clearly see that Kudu tables
> were getting renamed and dropped, while the query was still running in a
> different session and completed 10 seconds after the swap. This is still a
> mystery to me. The only explanation I have is that data was already in
> Impala daemons memory and did not need Kudu tables at that point.
> 
> Boris
> 
> 
> 
> On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <[email protected]> wrote:
> 
> > you are guys are awesome, thanks!
> >
> > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > Views might work as well but for a number of reasons want to keep it as my
> > last resort :)
> >
> > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <[email protected]> wrote:
> >
> >> A couple other ideas from the Impala side:
> >>
> >> - could you use a view and alter the view to point to a different table?
> >> Then all readers would be pointed at the view, and security permissions
> >> could be on that view rather than the underlying tables?
> >>
> >> - I think if you use an external table in Impala you could use an ALTER
> >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
> >> different table. Then issue a 'refresh' on the impalads so that they load
> >> the new metadata. Subsequent queries would hit the new underlying Kudu
> >> table, but permissions and stats would be unchanged.
> >>
> >> -Todd
> >>
> >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <[email protected]> wrote:
> >>
> >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
> >>> load capabilities or staging abilities. Theoretically renaming a partition
> >>> atomically shouldn't be that hard to implement, since it's just a master
> >>> metadata operation which can be done atomically, but it's not yet
> >>> implemented.
> >>>
> >>> There is a JIRA to track a generic bulk load API here:
> >>> https://issues.apache.org/jira/browse/KUDU-1370
> >>>
> >>> Since I couldn't find anything to track the specific features you
> >>> mentioned, I just filed the following improvement JIRAs so we can track 
> >>> it:
> >>>
> >>>    - KUDU-2326: Support atomic bulk load operation
> >>>    <https://issues.apache.org/jira/browse/KUDU-2326>
> >>>    - KUDU-2327: Support atomic swap of tables or partitions
> >>>    <https://issues.apache.org/jira/browse/KUDU-2327>
> >>>
> >>> Mike
> >>>
> >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <[email protected]>
> >>> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I am trying to figure out the best and safest way to swap data in a
> >>>> production Kudu table with data from a staging table.
> >>>>
> >>>> Basically, once in a while we need to perform a full reload of some
> >>>> tables (once in a few months). These tables are pretty large with 
> >>>> billions
> >>>> of rows and we want to minimize the risk and downtime for users if
> >>>> something bad happens in the middle of that process.
> >>>>
> >>>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
> >>>> DATA INPATH. We can prepare data for reload in a staging table upfront 
> >>>> and
> >>>> this process might take many hours. Once staging table is ready, we can
> >>>> issue LOAD DATA INPATH command which will move underlying HDFS files to a
> >>>> production table - this operation is almost instant and the very last 
> >>>> step
> >>>> in our pipeline.
> >>>>
> >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
> >>>> PARTITION command.
> >>>>
> >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing
> >>>> came to my mind is to drop the production table and rename a staging 
> >>>> table
> >>>> to production table as the last step of the job, but in this case we are
> >>>> going to lose statistics and security permissions.
> >>>>
> >>>> Any other ideas?
> >>>>
> >>>> Thanks!
> >>>> Boris
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
>

Re: swap data in Kudu table

Reply via email to