I think I can work out the algorithm if I knew precisely what a “snapshot" does. From my reading it seems to be a lightweight fast alias (for lack of a better word) since it creates something that refers to the same physical data.So if I create a new table with cleaned data, call it table_new. Then I drop table_old and “snapshot” table_new into table_old? Is this what is suggested?
This leaves me with a small time where there is no table_old, which is the time between dropping table_old and creating a snapshot. Is it feasible to lock the DB for this time? > On Feb 15, 2016, at 7:13 PM, Ted Yu <[email protected]> wrote: > > Keep in mind that if the writes to this table are not paused, there would > be some data coming in between steps #1 and #2 which would not be in the > snapshot. > > Cheers > > On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <[email protected]> wrote: > >> I dont think there is any atomic operations in hbase to support ddl across >> 2 tables. >> >> But, maybe you can use hbase snapshots. >> 1.Create a hbase snapshot. >> 2.Truncate the table. >> 3.Write data to the table. >> 4.Create a table from snapshot taken in step #1 as table_old. >> >> Now you have two tables. One with current run data and other with last run >> data. >> I think above process will suffice. But, keep in mind that it is not >> atomic. >> >> HTH, >> Anil >> Sent from my iPhone >> >>> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <[email protected]> wrote: >>> >>> Any other way to do what I was asking. With Spark this is a very normal >> thing to treat a table as immutable and create another to replace the old. >>> >>> Can you lock two tables and rename them in 2 actions then unlock in a >> very short period of time? >>> >>> Or an alias for table names? >>> >>> Didn’t see these in any docs or Googling, any help is appreciated. >> Writing all this data back to the original table would be a huge load on a >> table being written to by external processes and therefore under large load >> to begin with. >>> >>>> On Feb 14, 2016, at 5:03 PM, Ted Yu <[email protected]> wrote: >>>> >>>> There is currently no native support for renaming two tables in one >> atomic >>>> action. >>>> >>>> FYI >>>> >>>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <[email protected]> >> wrote: >>>>> >>>>> I use Spark to take an old table, clean it up to create an RDD of >> cleaned >>>>> data. What I’d like to do is write all of the data to a new table in >> HBase, >>>>> then rename the table to the old name. If possible it could be done by >>>>> changing an alias to point to the new table as long as all external >> code >>>>> uses the alias, or by a 2 table rename operation. But I don’t see how >> to do >>>>> this for HBase. I am dealing with a lot of data so don’t want to do >> table >>>>> modifications with deletes and upserts, this would be incredibly slow. >>>>> Furthermore I don’t want to disable the table for more than a tiny >> span of >>>>> time. >>>>> >>>>> Is it possible to have 2 tables and rename both in an atomic action, or >>>>> change some alias to point to the new table in an atomic action. If not >>>>> what is the quickest way to achieve this to minimize time disabled. >>> >>
