Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi
HDFS has a default replication factor of 3 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread ajackson92
Most Hadoop installations use a block replication of 3.  What you're seeing
is your dataset (3.8T) replicated 3 times (11.4TB).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node
goes down the data is lost - in case that's not obvious.
On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens  wrote:

> Thanks, the issue was indeed the dfs replication factor. To fix it without
> entirely clearing out HDFS and rebooting, I first ran
> hdfs dfs -setrep -R -w 1 /
> to reduce all the current files' replication factor to 1 recursively from
> the root, then I changed the dfs.replication factor in
> ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-all.sh
> and start-all.sh
>
> Alex
>
> On Tue, Nov 24, 2015 at 10:43 PM, Ye Xianjin  wrote:
>
>> Hi AlexG:
>>
>> Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 *
>> 3 = 11.4TB.
>>
>> --
>> Ye Xianjin
>> Sent with Sparrow 
>>
>> On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
>>
>> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
>> cluster
>> with 16.73 Tb storage, using
>> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
>> Nothing else was stored in the HDFS, but after completing the download,
>> the
>> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I
>> see
>> that the dataset only takes up 3.8 Tb as expected. I navigated through the
>> entire HDFS hierarchy from /, and don't see where the missing space is.
>> Any
>> ideas what is going on and how to rectify it?
>>
>> I'm using the spark-ec2 script to launch, with the command
>>
>> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
>> --placement-group=pcavariants --copy-aws-credentials
>> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
>> conversioncluster
>>
>> and am not modifying any configuration files for Hadoop.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>


Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens
Thanks, the issue was indeed the dfs replication factor. To fix it without
entirely clearing out HDFS and rebooting, I first ran
hdfs dfs -setrep -R -w 1 /
to reduce all the current files' replication factor to 1 recursively from
the root, then I changed the dfs.replication factor in
ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-all.sh
and start-all.sh

Alex

On Tue, Nov 24, 2015 at 10:43 PM, Ye Xianjin  wrote:

> Hi AlexG:
>
> Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 *
> 3 = 11.4TB.
>
> --
> Ye Xianjin
> Sent with Sparrow 
>
> On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
>
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I
> see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
>
> I'm using the spark-ec2 script to launch, with the command
>
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
>
> and am not modifying any configuration files for Hadoop.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG:

Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 
11.4TB.  

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
> 
> I'm using the spark-ec2 script to launch, with the command
> 
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
> 
> and am not modifying any configuration files for Hadoop.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers
what is your hdfs replication set to?

On Wed, Nov 25, 2015 at 1:31 AM, AlexG  wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I
> see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
>
> I'm using the spark-ec2 script to launch, with the command
>
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
>
> and am not modifying any configuration files for Hadoop.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>