We're using Trumpet (http://verisign.github.io/trumpet/), a iNotify-like for HDFS, as the fondation of such replication inter-cluster replication. In a nutshell, every new fiels created in Cluster A does notify a replication system, which copy the file to cluster B (see https://github.com/verisign/trumpet/blob/master/examples/src/main/java/com/verisign/vscc/hdfs/trumpet/client/example/TestApp.java for an example) For keeping Hive partitions in sync, https://github.com/daplab/hive-auto-partitioner should make it (also relies on Trumpet).
Benoit On Wed, Feb 10, 2016 at 7:37 PM David Whitmore < [email protected]> wrote: > Vivek, > > > > You are correct, distcp will overwrite a file if it has changed or is new. > > As to running this realtime (ie: as soon as data is deposited on the > source cluster, you will have to handle that). > > Please be aware if you are talking about hive tables, you will also need > the hive metastore. > > We copy our critical data from a Production Cluster to another Production > Cluster and to a Test Cluster on a daily basis. > > Also, the contents of the Hive Metastore database. > > Be aware if you restore the Hive Metastore database on the destination > cluster, any tables created solely on the destination cluster may disappear. > > > > David > > > > > > *From:* Vivek Singh Raghuwanshi [mailto:[email protected]] > *Sent:* Wednesday, February 10, 2016 1:28 PM > *To:* [email protected] > *Subject:* Re: Hadoop Backup and Archival Cluster > > > > Thanks David, > > > > I want to replicate the data once it reached on the cluster, and delete > from source Cluster after one year. I want Cluster works as Hot Backup and > Archival and Cluster A only having latest data. > > > > And as per my information distcp copy all the data and over-right. Please > correct me if i am wrong. > > > > > > On Wed, Feb 10, 2016 at 12:21 PM, David Whitmore < > [email protected]> wrote: > > Yes, you can run a distcp to copy data from one cluster to another, also > distcp has an option to tell if it will delete files on the destination if > they are NOT on the source. > > > > > > *From:* Vivek Singh Raghuwanshi [mailto:[email protected]] > *Sent:* Wednesday, February 10, 2016 1:16 PM > *To:* [email protected] > *Subject:* Hadoop Backup and Archival Cluster > > > > > Hi Friends, > > > > I am planning to setup a Hadoop Cluster (A) with Cluster replication (B). > so that once data is reached to Cluster A it will replicated to Cluster D. > I am having one question if i delete data from Cluster A on the basis of > Time like one month old data is it also removed from Cluster B. if yes how > i can avoid this. > > What i want to achieve. > > 1. Once data is reached to Cluster A it will automatically replicated to > Cluster B. > > 2. After one year old data from Cluster A remove automatically but not > from Cluster B. > > 3. If any one wants to run query on latest data Cluster A is available but > for Older data Cluster B is available. > > > > > > Regards > > -- > > ViVek Raghuwanshi > Mobile -+91-09595950504 > Skype - vivek_raghuwanshi > IRC vivekraghuwanshi > http://vivekraghuwanshi.wordpress.com/ > http://in.linkedin.com/in/vivekraghuwanshi > > > > > > -- > > ViVek Raghuwanshi > Mobile -+91-09595950504 > Skype - vivek_raghuwanshi > IRC vivekraghuwanshi > http://vivekraghuwanshi.wordpress.com/ > http://in.linkedin.com/in/vivekraghuwanshi >
