Hi Max,

Unfortunately, we don’t have a better solution at the moment. I am wondering if 
the right approach might be to use user-defined metadata 
(http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) and put 
that information along with the object that we are backing up.

However, that would be a code change in DistCp, and not as easy as a script. 
But that would address the scalability issue that you are worried about.

Thanks
Anu



From: max scalf <[email protected]>
Date: Wednesday, June 15, 2016 at 7:15 AM
To: HDP mailing list <[email protected]>
Subject: HDFS backup to S3

Hello Hadoop community,

we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 instance.  
Everything is all setup and working as expected.  Our design calls for running 
HDFS/data nodes on local/ephemeral storage and we have 3X replication enabled 
by default, all of the metastore (hive, oozie, ranger, ambari etc etc ..) are 
external to the cluster using RDS/mysql.

The question that I have is with regards to backups.  We want to run a night 
job that copies data from HDFS into S3.  Knowing that we our cluster lives in 
AWS, the obvious choice is to run our backup to S3.  We do not want a warm 
backup(backup this cluster to another cluster), our RTO/RPO is 5 days for this 
cluster.  So we can run distcp (something like below link) to backup our hdfs 
to S3 and we have tested this and works just fine, but how do we go about 
storage the ownership/permission on these files.

http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script

As S3 is a blob storage and does not store any ownership/permission, how do we 
go about backing that up?  One of the ideas I had was to run hdfs dfs -lsr (and 
recursively get all files and folders permissions/ownership) and dump that into 
a file and send that file over to S3 as well, but I am guessing it will work 
now but as the cluster grows it might not scale...

So I wanted to find out how are people managed backing up ownership/permission 
of HDFS file/folder when sending back up to a blob storage like S3.


Reply via email to