Re: [gentoo-user] Dell Precision Workstation Overheating

2018-05-07 Thread mad.scientist.at.large
One last thought, it could be a problem with the thermal regulation itself, if 
you can figure out the fan wiring you could set it so they are always full on, 
might help, might not.  you could also put in fans the same size but higher 
current/airflow.  as this is a server, apparently rack mounted you probably 
can't do much else about the cooling.

mad.scientist.at.large (a good madscientist)
--
Read, Scream, Fight >



20. Apr 2018 06:21 by michaelkintz...@gmail.com 
:


> On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
>> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
>> has numerous heat failures.
>>
>> Due to poor cooling ... surprised?
>>
>> The cooling is not working right. Something is still wrong.
>>
>> On 04/19/2018 09:33 PM, R0b0t1 wrote:
>> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
>> > cards and a Tesla card.
>> > 
>> > The system is a few years old at this point. Old enough that the
>> > thermal compound could have hardened, which is why I replaced it.
>
> If the problem started suddenly, rather than getting progressively worse over 
> time, it may have something to do with kernel drivers, or some change in 
> firmware.
>
> If the cause is mechanical, I'd also suggest checking the heat sink contact 
> surface.  Some heat sinks are poorly manufactured and require flattening with 
> wet 'n dry sandpaper to get a flat enough surface and improve their contact 
> with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler after excess 
> metal was removed from copper pipes, which were manufactured proud.  Hardcore 
> O/C's flatten the CPU too, but I'd avoid anything as radical because it can 
> go 
> badly wrong if you remove more than the surface varnish from the chip.
>
> In the interim, opening the side panel may also help in hot weather.
>
> -- 
> Regards,
> Mick

Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread Corbin Bird
Suggestion :

 ... upgrade the cooling capacity.

The CPU in my box is a AMD FX-9590. TDP is 220 watts. Running at 4.7 Ghz.

With cooling for TDP 250 watts, it ran hot under load.

With cooling for TDP 900 watts, it rarely gets close to 110 F under
heavy load.


On 04/20/2018 09:11 AM, R0b0t1 wrote:
> On Fri, Apr 20, 2018 at 7:21 AM, Mick  wrote:
>> On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
>>> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
>>> has numerous heat failures.
>>>
>>> Due to poor cooling ... surprised?
>>>
>>> The cooling is not working right. Something is still wrong.
>>>
>>> On 04/19/2018 09:33 PM, R0b0t1 wrote:
 Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
 cards and a Tesla card.

 The system is a few years old at this point. Old enough that the
 thermal compound could have hardened, which is why I replaced it.
>> If the problem started suddenly, rather than getting progressively worse over
>> time, it may have something to do with kernel drivers, or some change in
>> firmware.
>>
> As far as I know it has always been like this. It may be why it was
> hardly used before it came into my care. Looking at the server I could
> blame poor design; the inside is rather cramped, despite the care
> taken with the internal baffles. They may not have run a good flow
> simulation.
>
> Mr. Bird's observation seems to support this.
>
>> If the cause is mechanical, I'd also suggest checking the heat sink contact
>> surface.  Some heat sinks are poorly manufactured and require flattening with
>> wet 'n dry sandpaper to get a flat enough surface and improve their contact
>> with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler after excess
>> metal was removed from copper pipes, which were manufactured proud.  Hardcore
>> O/C's flatten the CPU too, but I'd avoid anything as radical because it can 
>> go
>> badly wrong if you remove more than the surface varnish from the chip.
>>
>> In the interim, opening the side panel may also help in hot weather.
>>
> The internals are custom made to fit the motherboard, cards, and drive
> slots. It may work better if I move it to another tower but it will be
> a while before I can find one. I will look at the interface between
> the heatsink and processor again, but it looked fine.
>
>
> How concerned should I be about overheating machine check errors? I
> used to think that it was best to avoid them, as the threshold was
> high enough that very small parts of the die could overshoot and fail,
> but I was informed that is not the case. Besides the throttling (which
> is fairly bad) I am not sure if there are any drawbacks to the
> overheating.
>
> I am wondering what the point of 32 threads is if you can't use them at 100%.
>
> Cheers,
>  R0b0t1




Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread Dale
R0b0t1 wrote:
> On Fri, Apr 20, 2018 at 7:21 AM, Mick  wrote:
>> On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
>>> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
>>> has numerous heat failures.
>>>
>>> Due to poor cooling ... surprised?
>>>
>>> The cooling is not working right. Something is still wrong.
>>>
>>> On 04/19/2018 09:33 PM, R0b0t1 wrote:
 Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
 cards and a Tesla card.

 The system is a few years old at this point. Old enough that the
 thermal compound could have hardened, which is why I replaced it.
>> If the problem started suddenly, rather than getting progressively worse over
>> time, it may have something to do with kernel drivers, or some change in
>> firmware.
>>
> As far as I know it has always been like this. It may be why it was
> hardly used before it came into my care. Looking at the server I could
> blame poor design; the inside is rather cramped, despite the care
> taken with the internal baffles. They may not have run a good flow
> simulation.
>
> Mr. Bird's observation seems to support this.
>
>> If the cause is mechanical, I'd also suggest checking the heat sink contact
>> surface.  Some heat sinks are poorly manufactured and require flattening with
>> wet 'n dry sandpaper to get a flat enough surface and improve their contact
>> with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler after excess
>> metal was removed from copper pipes, which were manufactured proud.  Hardcore
>> O/C's flatten the CPU too, but I'd avoid anything as radical because it can 
>> go
>> badly wrong if you remove more than the surface varnish from the chip.
>>
>> In the interim, opening the side panel may also help in hot weather.
>>
> The internals are custom made to fit the motherboard, cards, and drive
> slots. It may work better if I move it to another tower but it will be
> a while before I can find one. I will look at the interface between
> the heatsink and processor again, but it looked fine.
>
>
> How concerned should I be about overheating machine check errors? I
> used to think that it was best to avoid them, as the threshold was
> high enough that very small parts of the die could overshoot and fail,
> but I was informed that is not the case. Besides the throttling (which
> is fairly bad) I am not sure if there are any drawbacks to the
> overheating.
>
> I am wondering what the point of 32 threads is if you can't use them at 100%.
>
> Cheers,
>  R0b0t1
>
>


Just a thought.  It may be worth checking, if you remove it again, the
heat sink compound/grease/paste or whatever you call it between the CPU
and the heat sink.  I used Arctic Silver 5 on mine several years ago but
they may have something even better than that nowadays.  Some of the
stuff will work OK for low powered chips but fail badly in high powered
chips.  I found this link that is fairly recent.  Plan to read it myself
here shortly.  I still have a little nerd/geek in me. 

https://www.tomshardware.com/reviews/thermal-paste-comparison,5108.html

Dale

:-)  :-)



Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread Mick
On Friday, 20 April 2018 15:11:43 BST R0b0t1 wrote:
> On Fri, Apr 20, 2018 at 7:21 AM, Mick  wrote:
> > On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
> >> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
> >> has numerous heat failures.
> >> 
> >> Due to poor cooling ... surprised?
> >> 
> >> The cooling is not working right. Something is still wrong.
> >> 
> >> On 04/19/2018 09:33 PM, R0b0t1 wrote:
> >> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
> >> > cards and a Tesla card.
> >> > 
> >> > The system is a few years old at this point. Old enough that the
> >> > thermal compound could have hardened, which is why I replaced it.
> > 
> > If the problem started suddenly, rather than getting progressively worse
> > over time, it may have something to do with kernel drivers, or some
> > change in firmware.
> 
> As far as I know it has always been like this. It may be why it was
> hardly used before it came into my care. Looking at the server I could
> blame poor design; the inside is rather cramped, despite the care
> taken with the internal baffles. They may not have run a good flow
> simulation.
> 
> Mr. Bird's observation seems to support this.
> 
> > If the cause is mechanical, I'd also suggest checking the heat sink
> > contact
> > surface.  Some heat sinks are poorly manufactured and require flattening
> > with wet 'n dry sandpaper to get a flat enough surface and improve their
> > contact with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler
> > after excess metal was removed from copper pipes, which were manufactured
> > proud.  Hardcore O/C's flatten the CPU too, but I'd avoid anything as
> > radical because it can go badly wrong if you remove more than the surface
> > varnish from the chip.
> > 
> > In the interim, opening the side panel may also help in hot weather.
> 
> The internals are custom made to fit the motherboard, cards, and drive
> slots. It may work better if I move it to another tower but it will be
> a while before I can find one. I will look at the interface between
> the heatsink and processor again, but it looked fine.
> 
> 
> How concerned should I be about overheating machine check errors? I
> used to think that it was best to avoid them, as the threshold was
> high enough that very small parts of the die could overshoot and fail,
> but I was informed that is not the case. Besides the throttling (which
> is fairly bad) I am not sure if there are any drawbacks to the
> overheating.

Semiconductors eventually fail when overheated.  So it is not a good idea to 
continue trying to fry your CPU.

You can confirm the reason of these exceptions by installing and running 'app-
admin/mcelog'.  If the tower design is poor and air circulation within the 
case is creating recirculatory thermal race conditions, your choices would 
typically be:

1. Install more effective after market CPU coolers.  This means you have to 
spend money, which may be better spent on a new tower/PC.  It may also be 
there isn't enough space in the case to fit them, although low profile/compact 
CPU coolers exist and you may have better luck with them.

2. Install bigger or additional case fans, to help getting the heated air out 
of the case and minimising hot spots and hot air recirculation.  You could try 
forcing some more air through the case with a small desktop fan to see if this 
option has any legs.

3. Modify the case, by drilling/cutting holes to improve air flow, e.g. at the 
top of the case.

4. Migrating components to a diffent case/MoBo, which you have already 
considered.


> I am wondering what the point of 32 threads is if you can't use them at
> 100%.
> 
> Cheers,
>  R0b0t1

Quite, but the box may have not been intended to come across the pressures of 
running gentoo to compile software on a regular basis.  I've found many 
cheaper laptops in particular are so poorly designed from a cooling 
perspective, they struggle to run a lengthy gentoo emerge.  I've also had 
desktops which struggled, although nothing as critical as yours.  The 
permanent solutions I came up with involved after market cooling fans.  With 
boxen I was not keen to spend money on for cooling improvements I would just 
open the side panel during an emerge, which allowed the CPU temperature to 
drop sufficiently to avoid further thermal throttling.

-- 
Regards,
Mick

signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread Rich Freeman
On Thu, Apr 19, 2018 at 9:22 PM, R0b0t1  wrote:
>
> Is there a way to at least mimic the conservative CPU usage that
> Windows exhibits?
>

I'll set aside the obvious hardware issues and touch on workarounds.
I had an old system that had some heating issues (my guess is due to
heatsink seating or something like that), and I used cpufreqd at the
time.  This is a daemon that polls CPU temp (among other things) and
can be set to change the frequency policy of the kernel based on
thresholds.  I had it set to force the kernel into a powersave
governor when the temp exceeded a threshold.  The resulting behavior
was that the clock speed would cycle between min/max and maintain the
temp at the threshold when the system was under a sustained load.
However, for short bursts of activity the CPU was unconstrained.

That software is obsolete today and was removed from the Gentoo repos
a while ago.

I'd take a look at thermald or ncpufreqd in the repo as starting
points.  The docs I googled on thermald suggest that it might do the
job with zero config tweaks.

Without using any software you could also just directly set the
frequency limits in the kernel (accessible somewhere in the bowels of
/proc or /sys).  However, that is going to cause a performance hit
even for short demands that the heatsink can handle.

Ultimately though you're going to be better off if you can actually
fix the thermal issues themselves.

-- 
Rich



Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread R0b0t1
On Fri, Apr 20, 2018 at 7:21 AM, Mick  wrote:
> On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
>> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
>> has numerous heat failures.
>>
>> Due to poor cooling ... surprised?
>>
>> The cooling is not working right. Something is still wrong.
>>
>> On 04/19/2018 09:33 PM, R0b0t1 wrote:
>> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
>> > cards and a Tesla card.
>> >
>> > The system is a few years old at this point. Old enough that the
>> > thermal compound could have hardened, which is why I replaced it.
>
> If the problem started suddenly, rather than getting progressively worse over
> time, it may have something to do with kernel drivers, or some change in
> firmware.
>

As far as I know it has always been like this. It may be why it was
hardly used before it came into my care. Looking at the server I could
blame poor design; the inside is rather cramped, despite the care
taken with the internal baffles. They may not have run a good flow
simulation.

Mr. Bird's observation seems to support this.

> If the cause is mechanical, I'd also suggest checking the heat sink contact
> surface.  Some heat sinks are poorly manufactured and require flattening with
> wet 'n dry sandpaper to get a flat enough surface and improve their contact
> with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler after excess
> metal was removed from copper pipes, which were manufactured proud.  Hardcore
> O/C's flatten the CPU too, but I'd avoid anything as radical because it can go
> badly wrong if you remove more than the surface varnish from the chip.
>
> In the interim, opening the side panel may also help in hot weather.
>

The internals are custom made to fit the motherboard, cards, and drive
slots. It may work better if I move it to another tower but it will be
a while before I can find one. I will look at the interface between
the heatsink and processor again, but it looked fine.


How concerned should I be about overheating machine check errors? I
used to think that it was best to avoid them, as the threshold was
high enough that very small parts of the die could overshoot and fail,
but I was informed that is not the case. Besides the throttling (which
is fairly bad) I am not sure if there are any drawbacks to the
overheating.

I am wondering what the point of 32 threads is if you can't use them at 100%.

Cheers,
 R0b0t1



Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread Mick
On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
> has numerous heat failures.
> 
> Due to poor cooling ... surprised?
> 
> The cooling is not working right. Something is still wrong.
> 
> On 04/19/2018 09:33 PM, R0b0t1 wrote:
> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
> > cards and a Tesla card.
> > 
> > The system is a few years old at this point. Old enough that the
> > thermal compound could have hardened, which is why I replaced it.

If the problem started suddenly, rather than getting progressively worse over 
time, it may have something to do with kernel drivers, or some change in 
firmware.

If the cause is mechanical, I'd also suggest checking the heat sink contact 
surface.  Some heat sinks are poorly manufactured and require flattening with 
wet 'n dry sandpaper to get a flat enough surface and improve their contact 
with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler after excess 
metal was removed from copper pipes, which were manufactured proud.  Hardcore 
O/C's flatten the CPU too, but I'd avoid anything as radical because it can go 
badly wrong if you remove more than the surface varnish from the chip.

In the interim, opening the side panel may also help in hot weather.

-- 
Regards,
Mick

signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-20 Thread Corbin Bird
Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
has numerous heat failures.

Due to poor cooling ... surprised?

The cooling is not working right. Something is still wrong.

On 04/19/2018 09:33 PM, R0b0t1 wrote:
> Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
> cards and a Tesla card.
>
> The system is a few years old at this point. Old enough that the
> thermal compound could have hardened, which is why I replaced it.



Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-19 Thread mad.scientist.at.large
what thermal grease did you use?  did you clean all the old stuff off.  Be 
aware that the lids on cpu chips are usually somewhat concave on the top, 
recently had to add a drop of thermal grease to regreased cpu that was running 
too hot even idleing.  solved it completely.  current processors are large 
enough and concave enough to require more grease than seems right to someone 
experienced with older electronics.  

There is a huge difference between common silicone grease suitable for amp's 
and higher performance products.   basic thermal grease is a sick joke for a 
cpu.   Some of the better stuff is stiff, and overpriced thanks to mad gamers 
and other overclockers.   I Love the pk-3 stuff with nanoaluminum but it has to 
be fairly hot to spread well, then again it takes 2-3 minutes to settle and 
markedly lowers cpu temps when heavily loaded.  Previous favorite was ceramax.

ignore the hype and any brand that doesn't provide the thermal resistance or 
conductivity (which are inverses, of course so either is fine for choosing, 
thermal conductivity should be high, thermal resistance should be low).  Most 
are insulators at low voltages due to surface oxidation and the oil etc, on 
power wiring probably not a good idea.

most greases need to bake and be through power cycling a few times for optimum 
performance, they thin when warm.  with my particular favorite grease it only 
takes 2-3 minutes powered on to reach optimum performance.  my preference also 
avoids wasting silver and just uses extremely fine aluminum powder and some 
additives.

And what cpu has any one seen working without a heatsink in the last ten years 
other than some system on a chip or very low end embedded processor?  
Seriously,  unless I've been in hibernation some how i doubt it.

mad.scientist.at.large (a good madscientist)
--
God bless the rich, the greedy and the corrupt politicians they have put into 
office.   God bless them for helping me do the right thing by giving the rich 
my little pile of cash.  After all, the rich know what to do with money.  




19. Apr 2018 20:33 by r03...@gmail.com :


> On Thu, Apr 19, 2018 at 9:26 PM, Corbin Bird <> corbinb...@charter.net 
> > > wrote:
>> What are the Dell system specs?
>>
>> ( Heatsink on a CPU?  How old is this system ? )
>>
>
> Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
> cards and a Tesla card.
>
> The system is a few years old at this point. Old enough that the
> thermal compound could have hardened, which is why I replaced it.
>
>> On 04/19/2018 08:22 PM, R0b0t1 wrote:
>>> I was compiling Gentoo, as is custom, but found my old new server to
>>> be thermal cycling wildly. The fans will turn on full blast and
>>> machine check errors will be generated if I use approximately more
>>> than one third to half of the cores. The cores then throttle
>>> themselves, only to immediately overheat once the throttle lifts. This
>>> seems to persist on Windows, though Windows seems to be much more
>>> conservative in its CPU usage, and triggers MCEs less.
>>>
>>> Any suggestions? I repasted the CPU and heatsink interface, and the
>>> machine is not loaded with dust. It was hardly ever used. The MCEs
>>> seem to be a "normal" part of operation, though less normal on
>>> Windows.
>>>
>>> Is there a way to at least mimic the conservative CPU usage that
>>> Windows exhibits?
>>>
>>> Cheers,
>>>  R0b0t1
>>>

Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-19 Thread Adam Carter
On Fri, Apr 20, 2018 at 11:22 AM, R0b0t1  wrote:

> I was compiling Gentoo, as is custom, but found my old new server to
> be thermal cycling wildly. The fans will turn on full blast and
> machine check errors will be generated if I use approximately more
> than one third to half of the cores.
>

Very odd that you're not even close to running it hard and getting these
issues. No good ideas...but;
BIOS updates?
Do the temperatures reported by the sensors seem sane?
Powersave CPU freq governor
Underclock CPU in BIOS
BIOS power settings


Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-19 Thread R0b0t1
On Thu, Apr 19, 2018 at 9:26 PM, Corbin Bird  wrote:
> What are the Dell system specs?
>
> ( Heatsink on a CPU?  How old is this system ? )
>

Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
cards and a Tesla card.

The system is a few years old at this point. Old enough that the
thermal compound could have hardened, which is why I replaced it.

> On 04/19/2018 08:22 PM, R0b0t1 wrote:
>> I was compiling Gentoo, as is custom, but found my old new server to
>> be thermal cycling wildly. The fans will turn on full blast and
>> machine check errors will be generated if I use approximately more
>> than one third to half of the cores. The cores then throttle
>> themselves, only to immediately overheat once the throttle lifts. This
>> seems to persist on Windows, though Windows seems to be much more
>> conservative in its CPU usage, and triggers MCEs less.
>>
>> Any suggestions? I repasted the CPU and heatsink interface, and the
>> machine is not loaded with dust. It was hardly ever used. The MCEs
>> seem to be a "normal" part of operation, though less normal on
>> Windows.
>>
>> Is there a way to at least mimic the conservative CPU usage that
>> Windows exhibits?
>>
>> Cheers,
>>  R0b0t1
>>
>



Re: [gentoo-user] Dell Precision Workstation Overheating

2018-04-19 Thread Corbin Bird
What are the Dell system specs?

( Heatsink on a CPU?  How old is this system ? )

On 04/19/2018 08:22 PM, R0b0t1 wrote:
> I was compiling Gentoo, as is custom, but found my old new server to
> be thermal cycling wildly. The fans will turn on full blast and
> machine check errors will be generated if I use approximately more
> than one third to half of the cores. The cores then throttle
> themselves, only to immediately overheat once the throttle lifts. This
> seems to persist on Windows, though Windows seems to be much more
> conservative in its CPU usage, and triggers MCEs less.
>
> Any suggestions? I repasted the CPU and heatsink interface, and the
> machine is not loaded with dust. It was hardly ever used. The MCEs
> seem to be a "normal" part of operation, though less normal on
> Windows.
>
> Is there a way to at least mimic the conservative CPU usage that
> Windows exhibits?
>
> Cheers,
>  R0b0t1
>




[gentoo-user] Dell Precision Workstation Overheating

2018-04-19 Thread R0b0t1
I was compiling Gentoo, as is custom, but found my old new server to
be thermal cycling wildly. The fans will turn on full blast and
machine check errors will be generated if I use approximately more
than one third to half of the cores. The cores then throttle
themselves, only to immediately overheat once the throttle lifts. This
seems to persist on Windows, though Windows seems to be much more
conservative in its CPU usage, and triggers MCEs less.

Any suggestions? I repasted the CPU and heatsink interface, and the
machine is not loaded with dust. It was hardly ever used. The MCEs
seem to be a "normal" part of operation, though less normal on
Windows.

Is there a way to at least mimic the conservative CPU usage that
Windows exhibits?

Cheers,
 R0b0t1