a disk doesn’t loss the node.
> But
> > if we have to replace that lost disk, its again scheduling the whole node
> > down, kicking replication
> >
> >
> >
> > From: Matt Foley [mailto:mfo...@hortonworks.com]
> > Sent: Friday, November 11, 2011 1:58 AM
>
king replication
>
>
>
> From: Matt Foley [mailto:mfo...@hortonworks.com]
> Sent: Friday, November 11, 2011 1:58 AM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: Sizing help
>
>
>
> I agree with Ted's argument that 3x replication is way better than 2x.
]
*Sent:* Friday, November 11, 2011 1:58 AM
*To:* hdfs-user@hadoop.apache.org
*Subject:* Re: Sizing help
I agree with Ted's argument that 3x replication is way better than 2x. But
I do have to point out that, since 0.20.204, the loss of a disk no longer
causes the loss of a whole node (thank
-user@hadoop.apache.org
Subject: Re: Sizing help
I agree with Ted's argument that 3x replication is way better than 2x. But
I do have to point out that, since 0.20.204, the loss of a disk no longer
causes the loss of a whole node (thankfully!) unless it's the system disk.
So in the example given, if
Matt,
Thanks for pointing that out. I was talking about machine chassis failure
since it is the more serious case, but should have pointed out that losing
single disks is subject to the same logic with smaller amounts of data.
If, however, an installation uses RAID-0 for higher read speed then a
Another factor to consider, when disk is bad you may have corrupted blocks
which may only get detected by the periodic DataBlockScanner check.
I believe each datanode tries to finish the entire scan in
dfs.datanode.scan.period.hours (3weeks default) period.
So with 2x replication and some undetec
I agree with Ted's argument that 3x replication is way better than 2x. But
I do have to point out that, since 0.20.204, the loss of a disk no longer
causes the loss of a whole node (thankfully!) unless it's the system disk.
So in the example given, if you estimate a disk failure every 2 hours,
ea
For archival purposes, you don't need speed (mostly). That eliminates one
argument for 3x replication.
If you have RAID-5 or RAID-6 on your storage nodes, then you eliminate most
of your disk failure costs at the cluster level. This gives you something
like 2.2x replication cost.
You can also u
Thats a good point. What is hdfs is used as an archive? We dont really use
it for mapreduce more for archival purposes.
On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning wrote:
> 3x replication has two effects. One is reliability. This is probably
> more important in large clusters than small.
>
>
3x replication has two effects. One is reliability. This is probably more
important in large clusters than small.
Another important effect is data locality during map-reduce. Having 3x
replication allows mappers to have almost all invocations read from local
disk. 2x replication compromises th
I have been running with 2x replication on a 500tb cluster. No issues
whatsoever. 3x is for super paranoid.
On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning wrote:
> Depending on which distribution and what your data center power limits are
> you may save a lot of money by going with machines that h
Depending on which distribution and what your data center power limits are
you may save a lot of money by going with machines that have 12 x 2 or 3 tb
drives. With suitable engineering margins and 3 x replication you can have
5 tb net data per node and 20 nodes per rack. If you want to go all cow
For a 1PB installation you would need close to 170 servers with 12 TB disk
pack installed on them (with replication factor of 2). Thats a conservative
estimate
CPUs: 4 cores with 16gb of memory
Namenode: 4 core with 32gb of memory should be ok.
On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed wrote
13 matches
Mail list logo