Hal Murray skrev:
The reason for the fans is to prevent premature failures of the
silicon devices due to thermal degradation. The life of a silicon
chip is halved for every 10C temperature increase, more or less.

I was going to make a similar comment, but got sidetracked poking around google. I didn't find a good/clean article. Does anybody have a good URL?

Doubling every 10C is the normal recipe for chemical reactions. I think that translated to IC failure rate back in the old days. Is that still correct? Has modern quality control tracked down and eliminated most of the temperature dependent failure mechanisms?

I remember reading a paper 5 or 10 years ago.  The context was disks.  I think 
the main failure was electronics rather than mechanical.  It really really 
really helped to keep them cool.


There are being books written about this. One that I have found being a fairly short but useful one is the AT&T Reliability Manual. A criticism on reliability calculations (say from Bob Pease) is that if you remove protection circuits from the design, the MTBF calculations says the design improves (as there are fewer devices contributing their FITS) while the actual design is less reliable as protection would have avoided premature failure. This criticism is valid only if blindfolded beleif in MTBF is allowed to rule the judgement of reliability, since the methodology assumes otherwise sound engineering practices to avoid over-voltage, over-current, over-heating and other forms of excess stress that is within the limits for which the design is intended to operate and be stored in.

Anyway, there have been much research into the reliability of electrical devices and in general, keeping a sufficiently low temperature is among the things which improves reliability. For silicon the junction temperature limit needs to be ensured by having the component ambient temperature limited (usually to 70 degrees as measured inbetween two DIP-packages for intance), which has yeat another temperature drop into the (self) convected air and the ambient temperature of an rack of electronic (as measured 1 meter from the floor, 3 dm from the rack) to a maximum of 45 degrees. The 19" rack standard was originally designed for a total of 300 W per rack, so self convection up through the installed boxes would work. Having 1-10 kW per rack is not uncommon these days, so forced convection needs to be done, which puts requirement on different manufactures to have a common air-flow dicipline, which also needs to consider that no heat may be put out on the front for safety reasons (you shall never hesetate to hit the on/off switch).

It is unfortunatly common to see racks where one box has an airflow left-to-right while ontop of it is one with right-to-left and the rack has very narrow space between the side of the boxes and the sides, so effectively the boxes feeds each other with pre-heated air until one of them dies. Another example where a particular switch which had a left-to-right airflow, which was sitting at the top of a line of computing rack for a parallel computing setup. They where feeding a step wise increasing temperature air flow going through all the racks. The last switched died prematurely.

In parallel computing heat management and power management can be a much more troublesome issue than load balancing between the CPUs, which is the kind of luxury problem you can deal with at a later stage.

Cray was a refrigeration company, which also delivered alot of CPU cycles along the way.

Cheers,
Magnus

_______________________________________________
time-nuts mailing list -- time-nuts@febo.com
To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts
and follow the instructions there.

Reply via email to