Hi Anu, Thank for the information, the link you provided does not work.
@Hari, Let me do some quick research on what you guys can provide and get back to you. On Wed, Jun 15, 2016, 10:59 AM Anu Engineer <[email protected]> wrote: > Hi Max, > > > > Unfortunately, we don’t have a better solution at the moment. I am > wondering if the right approach might be to use user-defined metadata ( > http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) and > put that information along with the object that we are backing up. > > > > However, that would be a code change in DistCp, and not as easy as a > script. But that would address the scalability issue that you are worried > about. > > > > Thanks > > Anu > > > > > > > > *From: *max scalf <[email protected]> > *Date: *Wednesday, June 15, 2016 at 7:15 AM > *To: *HDP mailing list <[email protected]> > *Subject: *HDFS backup to S3 > > > > Hello Hadoop community, > > > > we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 > instance. Everything is all setup and working as expected. Our design > calls for running HDFS/data nodes on local/ephemeral storage and we have 3X > replication enabled by default, all of the metastore (hive, oozie, ranger, > ambari etc etc ..) are external to the cluster using RDS/mysql. > > > > The question that I have is with regards to backups. We want to run a > night job that copies data from HDFS into S3. Knowing that we our cluster > lives in AWS, the obvious choice is to run our backup to S3. We do not > want a warm backup(backup this cluster to another cluster), our RTO/RPO is > 5 days for this cluster. So we can run distcp (something like below link) > to backup our hdfs to S3 and we have tested this and works just fine, but > how do we go about storage the ownership/permission on these files. > > > > http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script > > > > As S3 is a blob storage and does not store any ownership/permission, how > do we go about backing that up? One of the ideas I had was to run hdfs dfs > -lsr (and recursively get all files and folders permissions/ownership) and > dump that into a file and send that file over to S3 as well, but I am > guessing it will work now but as the cluster grows it might not scale... > > > > So I wanted to find out how are people managed backing up > ownership/permission of HDFS file/folder when sending back up to a blob > storage like S3. > > > > >
