Quoting Bill Broadley ([EMAIL PROTECTED]): > ECC memory doesn't protect from a dead dimm, it protects from a silent > corruption of data.
I saw an example of that, back in 1989. I was working in what was then called the MIS Department at Blyth Software in Foster City: The VP of Engineering passed along a requirement for MIS to build a new engineering NetWare 3.12 server. He wanted that server to run DOS and MacOS namespaces (to do SMB and AppleTalk-based file and print services), be an NFS server, run the source code repository (whatever that was; can't remember), _and_ run prototyping installations of the Oracle and Sybase RDBMSes, _and_ handle all Engineering e-mail. The task was handed to me, with a budget of something like $20k. Even though I was just the PFY, I balked: I countered that it would be smarter to divide those functions among about five or six servers, at no more total dollars and possibly fewer. The VP told me to never mind my opinion, but just implement his plan. I politely dug in my heels and talked about the advantages of doing it the other way, and alluded to eggs and baskets. The VP was annoyed (and complained to my boss), but couldn't claim I'd refused, because I'd carefully never said "no", not exactly. Losing patience, the VP took his specs to an outside VAR in Burlingame, who was quite happy to spec a do-it-all HP NetServer something-or-other with immensely large amounts of disk and RAM (for those days). The VAR deployed it. Backups (weekly full on Friday, differential daily M-Th) occurred per MIS Dept.'s standard practice onto 8mm Exabyte tapes. Months passed. And then they started noticing that the data stored on the array were corrupted. Test restores were done from various tapes: It emerged that _all_ of the tape sets featured data corruption in incrementally increasing degrees, going back about four months to the new server's deployment. Engineering thus got to decide how much random file corruption it was willing to tolerate, versus how many months' work it was willing to throw away. After a few days' debate, they decided to jettison _all_ of those four months of everyone's work -- plus the VP of Engineering. I did my best to not even look like I wanted to say "I told you so" -- not least because I hadn't actually anticipated that particular scenario at all. The HP NetServer was subjected to extensive testing, in an effort to save it. The VAR used, among other things, all available memory-testing software tools in an effort to isolate the problem -- and I believe I remember them actually swapping out all of the RAM, at one point. I vaguely recall that it was still a useless hulk when I left the firm in 1994. It was a very striking experience. And it's also something I've never seen since then. (I've seen plenty of bad sticks of RAM on *ix servers, but never progressive & silent data corruption without signs that there's bad RAM needing immediate replacement.) If I _had_ been seeing that, even rarely, my current view would be different -- and of course I _will_ change my view if and when what I see changes. _______________________________________________ vox-tech mailing list [email protected] http://lists.lugod.org/mailman/listinfo/vox-tech
