Hi Max, Unfortunately, we don’t have a better solution at the moment. I am wondering if the right approach might be to use user-defined metadata (http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) and put that information along with the object that we are backing up.
However, that would be a code change in DistCp, and not as easy as a script. But that would address the scalability issue that you are worried about. Thanks Anu From: max scalf <[email protected]> Date: Wednesday, June 15, 2016 at 7:15 AM To: HDP mailing list <[email protected]> Subject: HDFS backup to S3 Hello Hadoop community, we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 instance. Everything is all setup and working as expected. Our design calls for running HDFS/data nodes on local/ephemeral storage and we have 3X replication enabled by default, all of the metastore (hive, oozie, ranger, ambari etc etc ..) are external to the cluster using RDS/mysql. The question that I have is with regards to backups. We want to run a night job that copies data from HDFS into S3. Knowing that we our cluster lives in AWS, the obvious choice is to run our backup to S3. We do not want a warm backup(backup this cluster to another cluster), our RTO/RPO is 5 days for this cluster. So we can run distcp (something like below link) to backup our hdfs to S3 and we have tested this and works just fine, but how do we go about storage the ownership/permission on these files. http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script As S3 is a blob storage and does not store any ownership/permission, how do we go about backing that up? One of the ideas I had was to run hdfs dfs -lsr (and recursively get all files and folders permissions/ownership) and dump that into a file and send that file over to S3 as well, but I am guessing it will work now but as the cluster grows it might not scale... So I wanted to find out how are people managed backing up ownership/permission of HDFS file/folder when sending back up to a blob storage like S3.
