We implemented this by upserting changed elements and dropping others. On a 
given cluster is takes 4.5 hours to load HBase, the trim and cleanup as 
currently implemented takes 4 days. Back to the drawing board.

I’ve read the references but still don’t grok what to do. I have a table with 
an event stream, containing duplicates and expired data. I’d like to find the 
most time-efficient way to remove duplicates and drop expired data from what 
I’ll call the main_table. This is being queried and added to all the time.

My first thought was to create a new clean_table with Spark by reading 
main_table, processing and writing clean_table then renaming main_table to 
old_table, and renaming clean_table to main_table. I can now drop old_table. 
Ignoring what happens to events during renaming, this would be efficient 
because it would be equivalent to loading, no complex updates to tables in 
place and under load. 

Snapshots and clones seem to miss the issue which is writing the cleaned data 
to some place that can now act like main_table but clearly I don’t understand 
snapshots and clones. They seem to be some way to alias a table so only changes 
are logged, without actually copying the data. I’m not sure i care about 
copying the data into an RDD, which will then undergo some transforms into a 
final RDD. This can be written efficiently into clean_table with no upserts or 
droping of elements, which seems to be cause things to slow to a halt.

So assuming I have clean_table, how do I get all queries to go to it, instead 
of main_table? Elasticsearch has an alias that I can just point somewhere new. 
Do I need to keep track of something like this outside of HBase and change it 
after creating clean_table or am I missing how to do this with shapshots and 
clones?



From: Ted Yu <[email protected] <mailto:[email protected]>>
Subject: Re: Rename tables or swap alias
Date: February 16, 2016 at 6:48:53 AM PST
To: "[email protected] <mailto:[email protected]>" 
<[email protected] <mailto:[email protected]>>
Reply-To: [email protected] <mailto:[email protected]>

Please see http://hbase.apache.org/book.html#ops.snapshots 
<http://hbase.apache.org/book.html#ops.snapshots> for background
on snapshots.

In Anil's description, table_old is the result of cloning the snapshot
which is taken in step #1. See
http://hbase.apache.org/book.html#ops.snapshots.clone 
<http://hbase.apache.org/book.html#ops.snapshots.clone>

Cheers

On Tue, Feb 16, 2016 at 6:35 AM, Pat Ferrel <[email protected]> wrote:

> I think I can work out the algorithm if I knew precisely what a “snapshot"
> does. From my reading it seems to be a lightweight fast alias (for lack of
> a better word) since it creates something that refers to the same physical
> data.So if I create a new table with cleaned data, call it table_new. Then
> I drop table_old and “snapshot” table_new into table_old? Is this what is
> suggested?
> 
> This leaves me with a small time where there is no table_old, which is the
> time between dropping table_old and creating a snapshot. Is it feasible to
> lock the DB for this time?
> 
>> On Feb 15, 2016, at 7:13 PM, Ted Yu <[email protected]> wrote:
>> 
>> Keep in mind that if the writes to this table are not paused, there would
>> be some data coming in between steps #1 and #2 which would not be in the
>> snapshot.
>> 
>> Cheers
>> 
>> On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <[email protected]>
> wrote:
>> 
>>> I dont think there is any atomic operations in hbase to support ddl
> across
>>> 2 tables.
>>> 
>>> But, maybe you can use hbase snapshots.
>>> 1.Create a hbase snapshot.
>>> 2.Truncate the table.
>>> 3.Write data to the table.
>>> 4.Create a table from snapshot taken in step #1 as table_old.
>>> 
>>> Now you have two tables. One with current run data and other with last
> run
>>> data.
>>> I think above process will suffice. But, keep in mind that it is not
>>> atomic.
>>> 
>>> HTH,
>>> Anil
>>> Sent from my iPhone
>>> 
>>>> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <[email protected]> wrote:
>>>> 
>>>> Any other way to do what I was asking. With Spark this is a very normal
>>> thing to treat a table as immutable and create another to replace the
> old.
>>>> 
>>>> Can you lock two tables and rename them in 2 actions then unlock in a
>>> very short period of time?
>>>> 
>>>> Or an alias for table names?
>>>> 
>>>> Didn’t see these in any docs or Googling, any help is appreciated.
>>> Writing all this data back to the original table would be a huge load
> on a
>>> table being written to by external processes and therefore under large
> load
>>> to begin with.
>>>> 
>>>>> On Feb 14, 2016, at 5:03 PM, Ted Yu <[email protected]> wrote:
>>>>> 
>>>>> There is currently no native support for renaming two tables in one
>>> atomic
>>>>> action.
>>>>> 
>>>>> FYI
>>>>> 
>>>>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> I use Spark to take an old table, clean it up to create an RDD of
>>> cleaned
>>>>>> data. What I’d like to do is write all of the data to a new table in
>>> HBase,
>>>>>> then rename the table to the old name. If possible it could be done
> by
>>>>>> changing an alias to point to the new table as long as all external
>>> code
>>>>>> uses the alias, or by a 2 table rename operation. But I don’t see how
>>> to do
>>>>>> this for HBase. I am dealing with a lot of data so don’t want to do
>>> table
>>>>>> modifications with deletes and upserts, this would be incredibly
> slow.
>>>>>> Furthermore I don’t want to disable the table for more than a tiny
>>> span of
>>>>>> time.
>>>>>> 
>>>>>> Is it possible to have 2 tables and rename both in an atomic action,
> or
>>>>>> change some alias to point to the new table in an atomic action. If
> not
>>>>>> what is the quickest way to achieve this to minimize time disabled.
>>>> 
>>> 
> 
> 


Reply via email to