Hi guys,

thanks again for your help!  I just blogged about this
https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/

BTW I did not have to invalidate or refresh metadata - it just worked with
 ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
so not sure if it is because of that but Impala/Kudu docs also do not
mention anything about metadata refresh.  Looks like Impala is keeping a
reference to uuid of the Kudu table not its actual name.

One thing I am still puzzled is how Impala was able to finish my
long-running SELECT statement, that I had kicked off right before the swap.
I did not get any error messages and I could clearly see that Kudu tables
were getting renamed and dropped, while the query was still running in a
different session and completed 10 seconds after the swap. This is still a
mystery to me. The only explanation I have is that data was already in
Impala daemons memory and did not need Kudu tables at that point.

Boris



On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <[email protected]> wrote:

> you are guys are awesome, thanks!
>
> Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> Views might work as well but for a number of reasons want to keep it as my
> last resort :)
>
> On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <[email protected]> wrote:
>
>> A couple other ideas from the Impala side:
>>
>> - could you use a view and alter the view to point to a different table?
>> Then all readers would be pointed at the view, and security permissions
>> could be on that view rather than the underlying tables?
>>
>> - I think if you use an external table in Impala you could use an ALTER
>> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
>> different table. Then issue a 'refresh' on the impalads so that they load
>> the new metadata. Subsequent queries would hit the new underlying Kudu
>> table, but permissions and stats would be unchanged.
>>
>> -Todd
>>
>> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <[email protected]> wrote:
>>
>>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
>>> load capabilities or staging abilities. Theoretically renaming a partition
>>> atomically shouldn't be that hard to implement, since it's just a master
>>> metadata operation which can be done atomically, but it's not yet
>>> implemented.
>>>
>>> There is a JIRA to track a generic bulk load API here:
>>> https://issues.apache.org/jira/browse/KUDU-1370
>>>
>>> Since I couldn't find anything to track the specific features you
>>> mentioned, I just filed the following improvement JIRAs so we can track it:
>>>
>>>    - KUDU-2326: Support atomic bulk load operation
>>>    <https://issues.apache.org/jira/browse/KUDU-2326>
>>>    - KUDU-2327: Support atomic swap of tables or partitions
>>>    <https://issues.apache.org/jira/browse/KUDU-2327>
>>>
>>> Mike
>>>
>>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am trying to figure out the best and safest way to swap data in a
>>>> production Kudu table with data from a staging table.
>>>>
>>>> Basically, once in a while we need to perform a full reload of some
>>>> tables (once in a few months). These tables are pretty large with billions
>>>> of rows and we want to minimize the risk and downtime for users if
>>>> something bad happens in the middle of that process.
>>>>
>>>> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
>>>> DATA INPATH. We can prepare data for reload in a staging table upfront and
>>>> this process might take many hours. Once staging table is ready, we can
>>>> issue LOAD DATA INPATH command which will move underlying HDFS files to a
>>>> production table - this operation is almost instant and the very last step
>>>> in our pipeline.
>>>>
>>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
>>>> PARTITION command.
>>>>
>>>> Now with Kudu, I cannot seem to find a good strategy. The only thing
>>>> came to my mind is to drop the production table and rename a staging table
>>>> to production table as the last step of the job, but in this case we are
>>>> going to lose statistics and security permissions.
>>>>
>>>> Any other ideas?
>>>>
>>>> Thanks!
>>>> Boris
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Reply via email to