Re: swap data in Kudu table

2018-08-04 Thread Boris
Thanks so much Tomas, glad you liked it. But as you might have seen another
thread already, the workaround I've described won't work with Impala 2.12
due to a breaking change.

On Thu, Aug 2, 2018, 07:18 far...@tf-bic.sk  wrote:

> Thanks Boris for a great article!
> Tomas
>
> On 2018/07/25 19:56:10, Boris Tyukin  wrote:
> > Hi guys,
> >
> > thanks again for your help!  I just blogged about this
> >
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> >
> > BTW I did not have to invalidate or refresh metadata - it just worked
> with
> >  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev
> cluster
> > so not sure if it is because of that but Impala/Kudu docs also do not
> > mention anything about metadata refresh.  Looks like Impala is keeping a
> > reference to uuid of the Kudu table not its actual name.
> >
> > One thing I am still puzzled is how Impala was able to finish my
> > long-running SELECT statement, that I had kicked off right before the
> swap.
> > I did not get any error messages and I could clearly see that Kudu tables
> > were getting renamed and dropped, while the query was still running in a
> > different session and completed 10 seconds after the swap. This is still
> a
> > mystery to me. The only explanation I have is that data was already in
> > Impala daemons memory and did not need Kudu tables at that point.
> >
> > Boris
> >
> >
> >
> > On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin 
> wrote:
> >
> > > you are guys are awesome, thanks!
> > >
> > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > > Views might work as well but for a number of reasons want to keep it
> as my
> > > last resort :)
> > >
> > > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon 
> wrote:
> > >
> > >> A couple other ideas from the Impala side:
> > >>
> > >> - could you use a view and alter the view to point to a different
> table?
> > >> Then all readers would be pointed at the view, and security
> permissions
> > >> could be on that view rather than the underlying tables?
> > >>
> > >> - I think if you use an external table in Impala you could use an
> ALTER
> > >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point
> to a
> > >> different table. Then issue a 'refresh' on the impalads so that they
> load
> > >> the new metadata. Subsequent queries would hit the new underlying Kudu
> > >> table, but permissions and stats would be unchanged.
> > >>
> > >> -Todd
> > >>
> > >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy 
> wrote:
> > >>
> > >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic
> bulk
> > >>> load capabilities or staging abilities. Theoretically renaming a
> partition
> > >>> atomically shouldn't be that hard to implement, since it's just a
> master
> > >>> metadata operation which can be done atomically, but it's not yet
> > >>> implemented.
> > >>>
> > >>> There is a JIRA to track a generic bulk load API here:
> > >>> https://issues.apache.org/jira/browse/KUDU-1370
> > >>>
> > >>> Since I couldn't find anything to track the specific features you
> > >>> mentioned, I just filed the following improvement JIRAs so we can
> track it:
> > >>>
> > >>>- KUDU-2326: Support atomic bulk load operation
> > >>>
> > >>>- KUDU-2327: Support atomic swap of tables or partitions
> > >>>
> > >>>
> > >>> Mike
> > >>>
> > >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin  >
> > >>> wrote:
> > >>>
> >  Hello,
> > 
> >  I am trying to figure out the best and safest way to swap data in a
> >  production Kudu table with data from a staging table.
> > 
> >  Basically, once in a while we need to perform a full reload of some
> >  tables (once in a few months). These tables are pretty large with
> billions
> >  of rows and we want to minimize the risk and downtime for users if
> >  something bad happens in the middle of that process.
> > 
> >  With Hive and Impala on HDFS, we can use a very cool handy command
> LOAD
> >  DATA INPATH. We can prepare data for reload in a staging table
> upfront and
> >  this process might take many hours. Once staging table is ready, we
> can
> >  issue LOAD DATA INPATH command which will move underlying HDFS
> files to a
> >  production table - this operation is almost instant and the very
> last step
> >  in our pipeline.
> > 
> >  Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
> >  PARTITION command.
> > 
> >  Now with Kudu, I cannot seem to find a good strategy. The only thing
> >  came to my mind is to drop the production table and rename a
> staging table
> >  to production table as the last step of the job, but in this case
> we are
> >  going to lose statistics and security permissions.
> > 
> >  Any other ideas?
> > 
> >  Thanks!
> >  

Re: swap data in Kudu table

2018-08-02 Thread farkas
Thanks Boris for a great article!
Tomas

On 2018/07/25 19:56:10, Boris Tyukin  wrote: 
> Hi guys,
> 
> thanks again for your help!  I just blogged about this
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> 
> BTW I did not have to invalidate or refresh metadata - it just worked with
>  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
> so not sure if it is because of that but Impala/Kudu docs also do not
> mention anything about metadata refresh.  Looks like Impala is keeping a
> reference to uuid of the Kudu table not its actual name.
> 
> One thing I am still puzzled is how Impala was able to finish my
> long-running SELECT statement, that I had kicked off right before the swap.
> I did not get any error messages and I could clearly see that Kudu tables
> were getting renamed and dropped, while the query was still running in a
> different session and completed 10 seconds after the swap. This is still a
> mystery to me. The only explanation I have is that data was already in
> Impala daemons memory and did not need Kudu tables at that point.
> 
> Boris
> 
> 
> 
> On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin  wrote:
> 
> > you are guys are awesome, thanks!
> >
> > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > Views might work as well but for a number of reasons want to keep it as my
> > last resort :)
> >
> > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon  wrote:
> >
> >> A couple other ideas from the Impala side:
> >>
> >> - could you use a view and alter the view to point to a different table?
> >> Then all readers would be pointed at the view, and security permissions
> >> could be on that view rather than the underlying tables?
> >>
> >> - I think if you use an external table in Impala you could use an ALTER
> >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
> >> different table. Then issue a 'refresh' on the impalads so that they load
> >> the new metadata. Subsequent queries would hit the new underlying Kudu
> >> table, but permissions and stats would be unchanged.
> >>
> >> -Todd
> >>
> >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy  wrote:
> >>
> >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
> >>> load capabilities or staging abilities. Theoretically renaming a partition
> >>> atomically shouldn't be that hard to implement, since it's just a master
> >>> metadata operation which can be done atomically, but it's not yet
> >>> implemented.
> >>>
> >>> There is a JIRA to track a generic bulk load API here:
> >>> https://issues.apache.org/jira/browse/KUDU-1370
> >>>
> >>> Since I couldn't find anything to track the specific features you
> >>> mentioned, I just filed the following improvement JIRAs so we can track 
> >>> it:
> >>>
> >>>- KUDU-2326: Support atomic bulk load operation
> >>>
> >>>- KUDU-2327: Support atomic swap of tables or partitions
> >>>
> >>>
> >>> Mike
> >>>
> >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin 
> >>> wrote:
> >>>
>  Hello,
> 
>  I am trying to figure out the best and safest way to swap data in a
>  production Kudu table with data from a staging table.
> 
>  Basically, once in a while we need to perform a full reload of some
>  tables (once in a few months). These tables are pretty large with 
>  billions
>  of rows and we want to minimize the risk and downtime for users if
>  something bad happens in the middle of that process.
> 
>  With Hive and Impala on HDFS, we can use a very cool handy command LOAD
>  DATA INPATH. We can prepare data for reload in a staging table upfront 
>  and
>  this process might take many hours. Once staging table is ready, we can
>  issue LOAD DATA INPATH command which will move underlying HDFS files to a
>  production table - this operation is almost instant and the very last 
>  step
>  in our pipeline.
> 
>  Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
>  PARTITION command.
> 
>  Now with Kudu, I cannot seem to find a good strategy. The only thing
>  came to my mind is to drop the production table and rename a staging 
>  table
>  to production table as the last step of the job, but in this case we are
>  going to lose statistics and security permissions.
> 
>  Any other ideas?
> 
>  Thanks!
>  Boris
> 
> >>>
> >>>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
> 


Re: swap data in Kudu table

2018-07-25 Thread Boris Tyukin
Hi guys,

thanks again for your help!  I just blogged about this
https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/

BTW I did not have to invalidate or refresh metadata - it just worked with
 ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
so not sure if it is because of that but Impala/Kudu docs also do not
mention anything about metadata refresh.  Looks like Impala is keeping a
reference to uuid of the Kudu table not its actual name.

One thing I am still puzzled is how Impala was able to finish my
long-running SELECT statement, that I had kicked off right before the swap.
I did not get any error messages and I could clearly see that Kudu tables
were getting renamed and dropped, while the query was still running in a
different session and completed 10 seconds after the swap. This is still a
mystery to me. The only explanation I have is that data was already in
Impala daemons memory and did not need Kudu tables at that point.

Boris



On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin  wrote:

> you are guys are awesome, thanks!
>
> Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> Views might work as well but for a number of reasons want to keep it as my
> last resort :)
>
> On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon  wrote:
>
>> A couple other ideas from the Impala side:
>>
>> - could you use a view and alter the view to point to a different table?
>> Then all readers would be pointed at the view, and security permissions
>> could be on that view rather than the underlying tables?
>>
>> - I think if you use an external table in Impala you could use an ALTER
>> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
>> different table. Then issue a 'refresh' on the impalads so that they load
>> the new metadata. Subsequent queries would hit the new underlying Kudu
>> table, but permissions and stats would be unchanged.
>>
>> -Todd
>>
>> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy  wrote:
>>
>>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
>>> load capabilities or staging abilities. Theoretically renaming a partition
>>> atomically shouldn't be that hard to implement, since it's just a master
>>> metadata operation which can be done atomically, but it's not yet
>>> implemented.
>>>
>>> There is a JIRA to track a generic bulk load API here:
>>> https://issues.apache.org/jira/browse/KUDU-1370
>>>
>>> Since I couldn't find anything to track the specific features you
>>> mentioned, I just filed the following improvement JIRAs so we can track it:
>>>
>>>- KUDU-2326: Support atomic bulk load operation
>>>
>>>- KUDU-2327: Support atomic swap of tables or partitions
>>>
>>>
>>> Mike
>>>
>>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin 
>>> wrote:
>>>
 Hello,

 I am trying to figure out the best and safest way to swap data in a
 production Kudu table with data from a staging table.

 Basically, once in a while we need to perform a full reload of some
 tables (once in a few months). These tables are pretty large with billions
 of rows and we want to minimize the risk and downtime for users if
 something bad happens in the middle of that process.

 With Hive and Impala on HDFS, we can use a very cool handy command LOAD
 DATA INPATH. We can prepare data for reload in a staging table upfront and
 this process might take many hours. Once staging table is ready, we can
 issue LOAD DATA INPATH command which will move underlying HDFS files to a
 production table - this operation is almost instant and the very last step
 in our pipeline.

 Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
 PARTITION command.

 Now with Kudu, I cannot seem to find a good strategy. The only thing
 came to my mind is to drop the production table and rename a staging table
 to production table as the last step of the job, but in this case we are
 going to lose statistics and security permissions.

 Any other ideas?

 Thanks!
 Boris

>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


Re: swap data in Kudu table

2018-02-23 Thread Mike Percy
Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
load capabilities or staging abilities. Theoretically renaming a partition
atomically shouldn't be that hard to implement, since it's just a master
metadata operation which can be done atomically, but it's not yet
implemented.

There is a JIRA to track a generic bulk load API here:
https://issues.apache.org/jira/browse/KUDU-1370

Since I couldn't find anything to track the specific features you
mentioned, I just filed the following improvement JIRAs so we can track it:

   - KUDU-2326: Support atomic bulk load operation
   
   - KUDU-2327: Support atomic swap of tables or partitions
   

Mike

On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin  wrote:

> Hello,
>
> I am trying to figure out the best and safest way to swap data in a
> production Kudu table with data from a staging table.
>
> Basically, once in a while we need to perform a full reload of some tables
> (once in a few months). These tables are pretty large with billions of rows
> and we want to minimize the risk and downtime for users if something bad
> happens in the middle of that process.
>
> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
> DATA INPATH. We can prepare data for reload in a staging table upfront and
> this process might take many hours. Once staging table is ready, we can
> issue LOAD DATA INPATH command which will move underlying HDFS files to a
> production table - this operation is almost instant and the very last step
> in our pipeline.
>
> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE PARTITION
> command.
>
> Now with Kudu, I cannot seem to find a good strategy. The only thing came
> to my mind is to drop the production table and rename a staging table to
> production table as the last step of the job, but in this case we are going
> to lose statistics and security permissions.
>
> Any other ideas?
>
> Thanks!
> Boris
>