On Wed, Aug 04, 2004 at 11:16:25AM +0300, Ehab Heikal wrote: > Avery Pennarun wrote: > > >We always run a memory test on every server before it ships, and we return > >about 25% of motherboards to the manufacturer before shipping because of > >this :( Some computers do go bad about a year later, eg. because of the > >"exploding capacitor" problem that started a couple of years ago. > > I know this is not the core of this list but could you elaborate on how > is hardware bad these days. What kinds of tests do you run to reduce > this. I see that you have very very valueable know-how and would really > appreciate it :)
Okay, you asked for it: I'll try to keep it clean, but this will turn into a bit of a commercial plug for our products. The most important test you can run is memtest86 (http://www.memtest86.com/) or another heavy memory testing tool. Many motherboards (not particular models; just individual boards of many models) just fail this outright, especially phase 4 and 5. If it fails memtest86, it *will* corrupt data, end of story. We always test our servers for 24 hours with memtest86 before shipping. (You have to do this again every time you make a change. For example, adding extra memory or a PCI card can upset the electronics and make the test fail, even if the memory tests out fine on another system.) We also put a lot of work into our general "burn-in" diagnostics tools, which stress the system by copying 1/6-disk-sized files around on a reiserfs while blasting data through the network. We run this for at least 24 hours as well before shipping, and it often finds harder-to-identify problems (for example, some of the IDE drivers in Linux have been known to "rarely" corrupt data, but our tests discovered the bugs and we talked to people until they were fixed). Our diagnostic tools are proprietary, but they come on our free bootable Nitix trial CD image (about 38 megs) and you can use them whenever you want; http://nitix.com. It includes memtest86, too. (Note that if you do the disk test, it wipes out all data on your disk!!) Hardware engineers can also hook an oscilloscope to the various critical signals on the motherboard to check out how clean they are; this is usually depressing, because *most* power supplies, CPU voltages, and PCI bus signal quality is at least partly out-of-spec. You can find hardware that is actually designed correctly, but it's difficult, because of course that costs a few extra dollars per board, and companies that charge a few extra dollars per motherboard quickly go out of business. Usually you have to settle with hardware that's at least "mostly" within specifications... When people ask us to "certify" their hardware as compatible with our Linux distribution, we do all the hardware-engineering tests too, and the complete set of tests can take several weeks. That's for certifying *types* of hardware, so you know the majority of boxes made like that will work. Even without engineering certification, though, you're usually pretty safe if you run our full (software) burn-in test for a couple of days on each box you ship. I don't know how many of our customers do, but we try to make it easy for them. To learn more about the joys of our hardware certification process, visit the third-party discussion forums http://nitix.net-itech.com/. Have fun, Avery P.S. ObOnTopicNote: I'm on this list because we're thinking of using linux-vserver in an upcoming version of Nitix :) P.P.S. I'd be more than happy if everyone downloaded and ran our burn-in diagnostics tests on their hardware for free. The more hardware that gets returned because it's crap, the smarter it becomes, economically, for the manufacturer to not ship crap in the first place. _______________________________________________ Vserver mailing list [EMAIL PROTECTED] http://list.linux-vserver.org/mailman/listinfo/vserver
