Rick Moen wrote: > You have a small point, but only for trivial values of "survive": The > lion's share of those bit flips will turn out to be harmless for any of > sundry reasons. (I'd specualate that some non-zero percentage of > prematurely deceased httpd instances owed to that, for example -- but
Sure, killing a process that is corrupted is the best thing that could happen, that way the corruption can't spread and the error is contained. Corruptions can cascade, granted not all do, but a bad bit in memory, could be a pointer, which then corrupts another region of memory, if one is written to disk and then used for future operations you could quickly have millions if not billions (I.e. a dead filesystem) of corrupt bytes. I'm quite grateful every time I see an ECC error, one potential major issue stopped in it's tracks. I think AMD was quite smart to include ECC support on all their processors and wish that intel did the same. After all the importance of your data isn't always related to the cost of the machine manipulating said data. It's the bit flips that don't cause a process crash that you worry about, since you now have a corrupt process with (generally) the ability to read and write part of the disk. Same with the kernel, a bit flip on a kernel owned page resulting in an immediate panic is the best case scenario and is much more likely if you have ECC. > those just respawn.) Dunno, for instance linux caches the filesystem aggressively, if any dirty page has a bit flip when purged you have a corruption on disk. If any of the meta data that is cached has a bit flip you potentially have a corrupt filesystem. Every disk write is at risk. Most open process could do something bad. Granted limited permissions per process/daemon helps... unless of course the error is in the kernel. > If that were a concern meriting real-world concern in situations where > the RAM _doesn't_ give unmistakeable signs of defects, my data would have > gone to mush a decade ago. Er, ECC exists exactly because of real-world concerns. I've seen entire clusters replaced to add ECC because of exactly those real-world concerns. Researchers don't like to hear that the results are usually right and it's unlikely that there results are wrong. I've heard cases where a month long calculation on 64 nodes gave an exciting answer, and to be sure they repeated it and got a second answer. They were pretty sure one was a memory problem.... till they got a 3rd. Not sure they ever figured it out, but they did end up adding ECC. Sure if 99% of your data is disposable, say mp3 files that you can re-rip or jpegs where you aren't going to notice a pixel being off (at least until the viewer crashes) then sure. Then again other systems have a greater percentage of valuable data. > Frankly, HD defects are a many orders of > magnitude more significant threat. Not sure where this comes from, how many orders are you suggesting? Disks also use ECC, and single bit errors are on the order of 1 per 10^14 sectors and an annual failure rate of 1-3%. RAID is relatively common among servers and in my mind provides similar protection against similar risk and is similar justified for those who want reliability and higher uptimes. So you are saying that HD defects are 10 or 100 times likely then the 1 bit per GB per month? Frankly if that was true I would expect your "data would have gone to mush a decade ago". Try flipping a 10 or 100 random bits on your disk once a month and report back ;-). So, anyways, sure don't run ECC if you don't want, and sure many desktop users won't notice. But since the original goal of the thread was a machine that "will be up for months between reboots" spending an extra $10 [1] for ECC dimms is reasonable. I'd also suggest that running redundant disks would be worth it. BTW, out of 180 nodes with 4GB ram I did manage to find quite a few ECC errors, I'd consider it a major deal if I had to contact every that had run in the last month about potentially erroneous results. [1] random data point on the price difference, both kingston 1GB modules: http://www.newegg.com/Product/Product.aspx?Item=N82E16820144153 http://www.newegg.com/Product/Product.aspx?Item=N82E16820134045 _______________________________________________ vox-tech mailing list [email protected] http://lists.lugod.org/mailman/listinfo/vox-tech
