I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
with 16.73 Tb storage, using
distcp. The dataset is a collection of tar files of about 1.7 Tb each.
Nothing else was stored in the HDFS, but after completing the download, the
namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
that the dataset only takes up 3.8 Tb as expected. I navigated through the
entire HDFS hierarchy from /, and don't see where the missing space is. Any
ideas what is going on and how to rectify it?

I'm using the spark-ec2 script to launch, with the command

spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
--placement-group=pcavariants --copy-aws-credentials
--hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
conversioncluster

and am not modifying any configuration files for Hadoop.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to