I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster with 16.73 Tb storage, using distcp. The dataset is a collection of tar files of about 1.7 Tb each. Nothing else was stored in the HDFS, but after completing the download, the namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see that the dataset only takes up 3.8 Tb as expected. I navigated through the entire HDFS hierarchy from /, and don't see where the missing space is. Any ideas what is going on and how to rectify it?
I'm using the spark-ec2 script to launch, with the command spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge --placement-group=pcavariants --copy-aws-credentials --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch conversioncluster and am not modifying any configuration files for Hadoop. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org