On Tue, Sep 30, 2008 at 09:54:04PM -0400, Miles Nordin wrote:
> ok, I get that S3 went down due to corruption, and that the network
> checksums I mentioned failed to prevent the corruption.  The missing
> piece is: belief that the corruption occurred on the network rather
> than somewhere else.
> 
> Their post-mortem sounds to me as though a bit flipped inside the
> memory of one server could be spread via this ``gossip'' protocol to
> infect the entire cluster.  The replication and spreadability of the
> data makes their cluster into a many-terabyte gamma ray detector.

A bit flipped inside an end of an end-to-end system will not be
detected by that system.  So the CPU, memory and memory bus of an end
have to be trusted and so require their own corruption detection
mechanisms (e.g., ECC memory).

In the S3 case it sounds like there's a lot of networking involved, and
that they weren't providing integrity protection for the gossip
protocol.  Given a two-bit-flip-that-passed-all-Ethernet-and-TCP-CRCs
event that we had within Sun a few years ago (much alluded to elsewhere
in this thread), and which happened in one faulty switch, I would
suspect the switch.  Also, years ago when 100Mbps Ethernet first came on
the market I saw lots of bad cat-5 wiring issues, where a wire would go
bad and start introducing errors just a few months into its useful life.
I don't trust the networking equipment -- I prefer end-to-end
protection.

Just because you have to trust that the ends behave correctly doesn't
mean that you should have to trust everything in the middle too.

Nico
-- 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to