Re: 9.0 spontaneously reboots

2012-03-14 Thread jb
Volodymyr Kostyrko  gmail.com> writes:

> 
> Adam Vande More wrote:
> >> I have one machine behaving unstable. This happened before 9.0. After
> >> upgrading to 9.0 machine was given a light load and now it reboots. Memory
> >> was already tested (without any errors) and changed after another reboot.
> >>
> >
> > So your RAM is good enough to pass a memory test.  It doesn't mean it's not
> > the culprit.  Way too many false negatives from those things.
> 
> True. First server was stacked with Kingston memory, and now I moved to 
> Hynix. And is still gives me sometimes ECC errors.
> 

You mentioned that "it survives an hour in memtest".

Update BIOS - the BIOS in some computers allow counting of detected and
corrected memory errors, in part to help identify failing memory modules before
the problem becomes catastrophic.

Some BIOS have internal memory check tool. Try it.

Some refs:
http://en.wikipedia.org/wiki/ECC_memory

jb



___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Matthew Seaman
On 13/03/2012 10:28, Volodymyr Kostyrko wrote:
> Matthew Seaman wrote:
>> On 13/03/2012 08:59, Volodymyr Kostyrko wrote:
>>> The only other weird thing about this server is:
>>>
>>> dev.cpu.0.temperature: 37,0C
>>> dev.cpu.1.temperature: 37,0C
>>> dev.cpu.2.temperature: 35,0C
>>> dev.cpu.3.temperature: 35,0C
>>> dev.cpu.4.temperature: 43,0C
>>> dev.cpu.5.temperature: 43,0C
>>> dev.cpu.6.temperature: 38,0C
>>> dev.cpu.7.temperature: 38,0C
>>> dev.cpu.8.temperature: 38,0C
>>> dev.cpu.9.temperature: 38,0C
>>> dev.cpu.10.temperature: 37,0C
>>> dev.cpu.11.temperature: 37,0C
>>> dev.cpu.12.temperature: 33,0C
>>> dev.cpu.13.temperature: 33,0C
>>> dev.cpu.14.temperature: 34,0C
>>> dev.cpu.15.temperature: 34,0C
>>>
>>> And it's consistent - cores 4 and 5 always are hotter then any other.
>>> This can be something with scheduler, however this started before any
>>> actual load. Though numbers are normal I had never seen something
>>> alike...
>>
>> Two cores per socket, and 8 sockets on the board?  If so, that looks
>> absolutely fine to me.  The average temperature is 36.8C but 43.0C is
>> still well within spec.  That difference of just over 6 degrees is not
>> really significant and probably entirely due to different airflow
>> patterns over the different CPU sockets.  If you swap the CPU package in
>> that socket with one of the other ones, you'll find the hot spot stays
>> put.  You might be able to even things out by rerouteing cables, but
>> really it's not worth the hassle and won't make any perceptible
>> difference to performance.
> 
> Nope:
> 
> CPU: Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz (2394.05-MHz
> K8-class CPU)
> FreeBSD/SMP: Multiprocessor System Detected: 16 CPUs
> FreeBSD/SMP: 2 package(s) x 4 core(s) x 2 SMT threads
> 
> So the difference is about one physical core with two SMT threads.
> 

Which explains why the numbers go in pairs -- there's only 8 physical cores.

Even so, I don't think there's any great problem there.  Different cores
in the same package can have different temperatures -- that's perfectly
normal, and due to the physical properties of the CPU package and the
local environment rather than any difference in processing load between
cores.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.
PGP: http://www.infracaninophile.co.uk/pgpkey




signature.asc
Description: OpenPGP digital signature


Re: 9.0 spontaneously reboots

2012-03-13 Thread Matthew Seaman
On 13/03/2012 09:08, Da Rock wrote:
> On 03/13/12 19:11, Edward M. wrote:
>> On 03/13/2012 01:59 AM, Volodymyr Kostyrko wrote:

>>> I already moved from Kingston to Hynix with no luck. Next guess
>>> points is motherboard problem (as memory is separated between
>>> processors) or processor problem. I'll gonna pop one processor out
>>> Leaving all memory on another one. 

>>I had a motherboard that was also rebooting constantly, it turned
>> out, it was suffering from capacitor plague.
>>I suggest to inspect each capacitor for any signs of leak and for
>> broken traces.

> I have to agree. I've seen this behaviour also on other systems and OS.

Yes.  A replacement motherboard would be my next step too.  While bad
capacitors are a fairly common cause, it can be due to other reasons: a
crack in one of the traces or a dry-soldered joint that breaks
electrical connection because of the effects of thermal expansion, or
even the extra vibration when the fans go to full power or even when
there is a lot of disk IO activity.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
JID: matt...@infracaninophile.co.uk   Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature


Re: 9.0 spontaneously reboots

2012-03-13 Thread Volodymyr Kostyrko

Matthew Seaman wrote:

On 13/03/2012 08:59, Volodymyr Kostyrko wrote:

The only other weird thing about this server is:

dev.cpu.0.temperature: 37,0C
dev.cpu.1.temperature: 37,0C
dev.cpu.2.temperature: 35,0C
dev.cpu.3.temperature: 35,0C
dev.cpu.4.temperature: 43,0C
dev.cpu.5.temperature: 43,0C
dev.cpu.6.temperature: 38,0C
dev.cpu.7.temperature: 38,0C
dev.cpu.8.temperature: 38,0C
dev.cpu.9.temperature: 38,0C
dev.cpu.10.temperature: 37,0C
dev.cpu.11.temperature: 37,0C
dev.cpu.12.temperature: 33,0C
dev.cpu.13.temperature: 33,0C
dev.cpu.14.temperature: 34,0C
dev.cpu.15.temperature: 34,0C

And it's consistent - cores 4 and 5 always are hotter then any other.
This can be something with scheduler, however this started before any
actual load. Though numbers are normal I had never seen something alike...


Two cores per socket, and 8 sockets on the board?  If so, that looks
absolutely fine to me.  The average temperature is 36.8C but 43.0C is
still well within spec.  That difference of just over 6 degrees is not
really significant and probably entirely due to different airflow
patterns over the different CPU sockets.  If you swap the CPU package in
that socket with one of the other ones, you'll find the hot spot stays
put.  You might be able to even things out by rerouteing cables, but
really it's not worth the hassle and won't make any perceptible
difference to performance.


Nope:

CPU: Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz (2394.05-MHz 
K8-class CPU)

FreeBSD/SMP: Multiprocessor System Detected: 16 CPUs
FreeBSD/SMP: 2 package(s) x 4 core(s) x 2 SMT threads

So the difference is about one physical core with two SMT threads.

--
Sphinx of black quartz judge my vow.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Matthew Seaman
On 13/03/2012 08:59, Volodymyr Kostyrko wrote:
> The only other weird thing about this server is:
> 
> dev.cpu.0.temperature: 37,0C
> dev.cpu.1.temperature: 37,0C
> dev.cpu.2.temperature: 35,0C
> dev.cpu.3.temperature: 35,0C
> dev.cpu.4.temperature: 43,0C
> dev.cpu.5.temperature: 43,0C
> dev.cpu.6.temperature: 38,0C
> dev.cpu.7.temperature: 38,0C
> dev.cpu.8.temperature: 38,0C
> dev.cpu.9.temperature: 38,0C
> dev.cpu.10.temperature: 37,0C
> dev.cpu.11.temperature: 37,0C
> dev.cpu.12.temperature: 33,0C
> dev.cpu.13.temperature: 33,0C
> dev.cpu.14.temperature: 34,0C
> dev.cpu.15.temperature: 34,0C
> 
> And it's consistent - cores 4 and 5 always are hotter then any other.
> This can be something with scheduler, however this started before any
> actual load. Though numbers are normal I had never seen something alike...

Two cores per socket, and 8 sockets on the board?  If so, that looks
absolutely fine to me.  The average temperature is 36.8C but 43.0C is
still well within spec.  That difference of just over 6 degrees is not
really significant and probably entirely due to different airflow
patterns over the different CPU sockets.  If you swap the CPU package in
that socket with one of the other ones, you'll find the hot spot stays
put.  You might be able to even things out by rerouteing cables, but
really it's not worth the hassle and won't make any perceptible
difference to performance.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.
PGP: http://www.infracaninophile.co.uk/pgpkey




signature.asc
Description: OpenPGP digital signature


Re: 9.0 spontaneously reboots

2012-03-13 Thread Da Rock

On 03/13/12 19:11, Edward M. wrote:

On 03/13/2012 01:59 AM, Volodymyr Kostyrko wrote:
I already moved from Kingston to Hynix with no luck. Next guess 
points is motherboard problem (as memory is separated between 
processors) or processor problem. I'll gonna pop one processor out 
Leaving all memory on another one. 



   I had a motherboard that was also rebooting constantly, it turned 
out, it was suffering from capacitor plague.
   I suggest to inspect each capacitor for any signs of leak and for 
broken traces.

I have to agree. I've seen this behaviour also on other systems and OS.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Edward M.

On 03/13/2012 01:59 AM, Volodymyr Kostyrko wrote:
I already moved from Kingston to Hynix with no luck. Next guess points 
is motherboard problem (as memory is separated between processors) or 
processor problem. I'll gonna pop one processor out Leaving all memory 
on another one. 



   I had a motherboard that was also rebooting constantly, it turned 
out, it was suffering from capacitor plague.
   I suggest to inspect each capacitor for any signs of leak and for 
broken traces.




___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Volodymyr Kostyrko

Matthew Seaman wrote:

The only load I know to cause sure lockup in some hours is memcached.
Right now project is migrated to redis and machines survives for two
weeks. Most common problem for lockup is ECC error.


I see.  That puts a different complexion on things.  Although it is
application specific it doesn't rule out hardware problems.  In fact,
given the nature of the error -- ECC problems -- it pretty much nails it
as something wrong with the RAM in that machine.

Given that memtest86 doesn't show any problems, and you can run a
similar workload with different software it suggests that you have a
memory stick (or sticks) that are marginal.  Something like extra heat
due to higher rates of memory accesses from a particular application
could be tipping it over the edge into failure.

The 'marginal' behaviour need not be a fault in the memory stick per se.
  It could simply be the particular characteristics of the memory you
have installed not being exactly compatible with your motherboard.  In
theory the memory conforming to a particular standard should avoid this
sort of problem, but this is unfortunately not completely infallible.
Swapping out memory sticks for an equivalent specification from a
different manufacturer should give good results.


I already moved from Kingston to Hynix with no luck. Next guess points 
is motherboard problem (as memory is separated between processors) or 
processor problem. I'll gonna pop one processor out Leaving all memory 
on another one.


The only other weird thing about this server is:

dev.cpu.0.temperature: 37,0C
dev.cpu.1.temperature: 37,0C
dev.cpu.2.temperature: 35,0C
dev.cpu.3.temperature: 35,0C
dev.cpu.4.temperature: 43,0C
dev.cpu.5.temperature: 43,0C
dev.cpu.6.temperature: 38,0C
dev.cpu.7.temperature: 38,0C
dev.cpu.8.temperature: 38,0C
dev.cpu.9.temperature: 38,0C
dev.cpu.10.temperature: 37,0C
dev.cpu.11.temperature: 37,0C
dev.cpu.12.temperature: 33,0C
dev.cpu.13.temperature: 33,0C
dev.cpu.14.temperature: 34,0C
dev.cpu.15.temperature: 34,0C

And it's consistent - cores 4 and 5 always are hotter then any other. 
This can be something with scheduler, however this started before any 
actual load. Though numbers are normal I had never seen something alike...


--
Sphinx of black quartz judge my vow.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Matthew Seaman
On 13/03/2012 08:09, Volodymyr Kostyrko wrote:
> The only load I know to cause sure lockup in some hours is memcached.
> Right now project is migrated to redis and machines survives for two
> weeks. Most common problem for lockup is ECC error.

I see.  That puts a different complexion on things.  Although it is
application specific it doesn't rule out hardware problems.  In fact,
given the nature of the error -- ECC problems -- it pretty much nails it
as something wrong with the RAM in that machine.

Given that memtest86 doesn't show any problems, and you can run a
similar workload with different software it suggests that you have a
memory stick (or sticks) that are marginal.  Something like extra heat
due to higher rates of memory accesses from a particular application
could be tipping it over the edge into failure.

The 'marginal' behaviour need not be a fault in the memory stick per se.
 It could simply be the particular characteristics of the memory you
have installed not being exactly compatible with your motherboard.  In
theory the memory conforming to a particular standard should avoid this
sort of problem, but this is unfortunately not completely infallible.
Swapping out memory sticks for an equivalent specification from a
different manufacturer should give good results.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.
PGP: http://www.infracaninophile.co.uk/pgpkey




signature.asc
Description: OpenPGP digital signature


Re: 9.0 spontaneously reboots

2012-03-13 Thread Volodymyr Kostyrko

Da Rock wrote:

I have one machine behaving unstable. This happened before 9.0. After
upgrading to 9.0 machine was given a light load and now it reboots.
Memory
was already tested (without any errors) and changed after another
reboot.


So your RAM is good enough to pass a memory test. It doesn't mean it's
not
the culprit. Way too many false negatives from those things.


Overnight soak test with memtest possible?


I'm currently thinking of moving projects from this server to get to it 
more closely. I can't take server down for so long. But it survives an 
hour in memtest.


--
Sphinx of black quartz judge my vow.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Volodymyr Kostyrko

Adam Vande More wrote:

I have one machine behaving unstable. This happened before 9.0. After
upgrading to 9.0 machine was given a light load and now it reboots. Memory
was already tested (without any errors) and changed after another reboot.



So your RAM is good enough to pass a memory test.  It doesn't mean it's not
the culprit.  Way too many false negatives from those things.


True. First server was stacked with Kingston memory, and now I moved to 
Hynix. And is still gives me sometimes ECC errors.


--
Sphinx of black quartz judge my vow.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-13 Thread Volodymyr Kostyrko

Matthew Seaman wrote:

On 12/03/2012 14:07, Volodymyr Kostyrko wrote:

What should I blame now? Is it some programming error or should I
continue with testing/changing motherboard and cpu?


Instability that appears spontaneously (and especially if it persists
across system updates) is almost always caused by hardware problems.
So, yes, carry on swapping out components until you can isolate where
the problem is.

Some common hardware problems which might result in the problems you've
seen:

* PSU going flakey.  If you have the right measuring equipment, this
  is pretty easy to detect by looking at the output voltages -- if
  they've drifted out of spec, or if you've got mains frequency
  jitter leaking through then its no wonder your system crashes.


Sensors report everything is good.


* Similarly, if the crashing is associated with system load,
  (particularly at startup, when things are happening like disks
  spinning up) this can indicate a power supply fading under load.
  That can happen due to age, or because you've been adding extra
  hardware and haven't considered the power requirements.


The only load I know to cause sure lockup in some hours is memcached. 
Right now project is migrated to redis and machines survives for two 
weeks. Most common problem for lockup is ECC error.



* The other reason for crashing under load is overheating.
  Sometimes this can be cured easily by cleaning dust out of vents
  and heat-sinks.  Check too for fans either seized or running
  slowly.


Sensors reports normal temperature.


* You may need to clean off any old heat-sink compound and re-apply
  a fresh layer, especially if you've taken CPU coolers off at
  some point.

* There's also the old capacitor problem: electrolytic capacitors
  have a failure mode that generates some positive pressure inside
  them.  This is detectable by the end of the capacitor being bowed
  out, rather than slightly concave. (Generally this means a new
  motherboard, although I've heard of people being able to solder in
  replacements successfully.)


It's fully serviced SuperMicro server without any additional problems.


Other than that, try disconnecting and reconnecting peripherals like
disks or DVDs and so forth in various combinations to test if that
improves system stability.  One faulty component can knock the whole
machine over.


--
Sphinx of black quartz judge my vow.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-12 Thread Da Rock

On 03/13/12 06:07, Adam Vande More wrote:

On Mon, Mar 12, 2012 at 9:07 AM, Volodymyr Kostyrkowrote:


Hi all.

I have one machine behaving unstable. This happened before 9.0. After
upgrading to 9.0 machine was given a light load and now it reboots. Memory
was already tested (without any errors) and changed after another reboot.


So your RAM is good enough to pass a memory test.  It doesn't mean it's not
the culprit.  Way too many false negatives from those things.


Overnight soak test with memtest possible?
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-12 Thread Da Rock

On 03/13/12 02:56, Matthew Seaman wrote:

On 12/03/2012 14:07, Volodymyr Kostyrko wrote:

What should I blame now? Is it some programming error or should I
continue with testing/changing motherboard and cpu?

Instability that appears spontaneously (and especially if it persists
across system updates) is almost always caused by hardware problems.
So, yes, carry on swapping out components until you can isolate where
the problem is.

Some common hardware problems which might result in the problems you've
seen:

* PSU going flakey.  If you have the right measuring equipment, this
  is pretty easy to detect by looking at the output voltages -- if
  they've drifted out of spec, or if you've got mains frequency
  jitter leaking through then its no wonder your system crashes.

* Similarly, if the crashing is associated with system load,
  (particularly at startup, when things are happening like disks
  spinning up) this can indicate a power supply fading under load.
  That can happen due to age, or because you've been adding extra
  hardware and haven't considered the power requirements.

* The other reason for crashing under load is overheating.
  Sometimes this can be cured easily by cleaning dust out of vents
  and heat-sinks.  Check too for fans either seized or running
  slowly.

* You may need to clean off any old heat-sink compound and re-apply
  a fresh layer, especially if you've taken CPU coolers off at
  some point.

* There's also the old capacitor problem: electrolytic capacitors
  have a failure mode that generates some positive pressure inside
  them.  This is detectable by the end of the capacitor being bowed
  out, rather than slightly concave. (Generally this means a new
  motherboard, although I've heard of people being able to solder in
  replacements successfully.)
Yes, that works (relatively easily); but you need to be good with a 
soldering iron and be able to remove the cap without breaking tracks or 
shorting them. If you're not that or confident, I wouldn't try; although 
if the MB is cactus anyway you may have nothing to lose :)


Other than that, try disconnecting and reconnecting peripherals like
disks or DVDs and so forth in various combinations to test if that
improves system stability.  One faulty component can knock the whole
machine over.

Cheers,

Matthew



___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-12 Thread Adam Vande More
On Mon, Mar 12, 2012 at 9:07 AM, Volodymyr Kostyrko wrote:

> Hi all.
>
> I have one machine behaving unstable. This happened before 9.0. After
> upgrading to 9.0 machine was given a light load and now it reboots. Memory
> was already tested (without any errors) and changed after another reboot.
>

So your RAM is good enough to pass a memory test.  It doesn't mean it's not
the culprit.  Way too many false negatives from those things.

-- 
Adam Vande More
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-12 Thread Al Plant

Matthew Seaman wrote:

On 12/03/2012 14:07, Volodymyr Kostyrko wrote:

What should I blame now? Is it some programming error or should I
continue with testing/changing motherboard and cpu?


Instability that appears spontaneously (and especially if it persists
across system updates) is almost always caused by hardware problems.
So, yes, carry on swapping out components until you can isolate where
the problem is.

Some common hardware problems which might result in the problems you've
seen:

   * PSU going flakey.  If you have the right measuring equipment, this
 is pretty easy to detect by looking at the output voltages -- if
 they've drifted out of spec, or if you've got mains frequency
 jitter leaking through then its no wonder your system crashes.

   * Similarly, if the crashing is associated with system load,
 (particularly at startup, when things are happening like disks
 spinning up) this can indicate a power supply fading under load.
 That can happen due to age, or because you've been adding extra
 hardware and haven't considered the power requirements.

   * The other reason for crashing under load is overheating.
 Sometimes this can be cured easily by cleaning dust out of vents
 and heat-sinks.  Check too for fans either seized or running
 slowly.

   * You may need to clean off any old heat-sink compound and re-apply
 a fresh layer, especially if you've taken CPU coolers off at
 some point.

   * There's also the old capacitor problem: electrolytic capacitors
 have a failure mode that generates some positive pressure inside
 them.  This is detectable by the end of the capacitor being bowed
 out, rather than slightly concave. (Generally this means a new
 motherboard, although I've heard of people being able to solder in
 replacements successfully.)

Other than that, try disconnecting and reconnecting peripherals like
disks or DVDs and so forth in various combinations to test if that
improves system stability.  One faulty component can knock the whole
machine over.

Cheers,

Matthew


Aloha,

Have seen the problems Matthew is addressing here in Hawaii. And
if your equipment is in a non climate controlled room check for 
corrosion on the board or any plugins. Clean all the cabled and 
components that can be removed. (No air-con in my systems here in Hawaii 
and humidity is around 60-70% normally so we have to clean and put 
teflon on contacts about 2 times a year.) Corrosion is worse if your on 
the ocean or brackish river.


Happy hunting.

~Al Plant - Honolulu, Hawaii -  Phone:  808-284-2740
  + http://hawaiidakine.com + http://freebsdinfo.org +
  + http://aloha50.net   - Supporting - FreeBSD  7.2 - 8.0 - 9* +
  < email: n...@hdk5.net >
"All that's really worth doing is what we do for others."- Lewis Carrol

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: 9.0 spontaneously reboots

2012-03-12 Thread Matthew Seaman
On 12/03/2012 14:07, Volodymyr Kostyrko wrote:
> What should I blame now? Is it some programming error or should I
> continue with testing/changing motherboard and cpu?

Instability that appears spontaneously (and especially if it persists
across system updates) is almost always caused by hardware problems.
So, yes, carry on swapping out components until you can isolate where
the problem is.

Some common hardware problems which might result in the problems you've
seen:

   * PSU going flakey.  If you have the right measuring equipment, this
 is pretty easy to detect by looking at the output voltages -- if
 they've drifted out of spec, or if you've got mains frequency
 jitter leaking through then its no wonder your system crashes.

   * Similarly, if the crashing is associated with system load,
 (particularly at startup, when things are happening like disks
 spinning up) this can indicate a power supply fading under load.
 That can happen due to age, or because you've been adding extra
 hardware and haven't considered the power requirements.

   * The other reason for crashing under load is overheating.
 Sometimes this can be cured easily by cleaning dust out of vents
 and heat-sinks.  Check too for fans either seized or running
 slowly.

   * You may need to clean off any old heat-sink compound and re-apply
 a fresh layer, especially if you've taken CPU coolers off at
 some point.

   * There's also the old capacitor problem: electrolytic capacitors
 have a failure mode that generates some positive pressure inside
 them.  This is detectable by the end of the capacitor being bowed
 out, rather than slightly concave. (Generally this means a new
 motherboard, although I've heard of people being able to solder in
 replacements successfully.)

Other than that, try disconnecting and reconnecting peripherals like
disks or DVDs and so forth in various combinations to test if that
improves system stability.  One faulty component can knock the whole
machine over.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.
PGP: http://www.infracaninophile.co.uk/pgpkey




signature.asc
Description: OpenPGP digital signature


9.0 spontaneously reboots

2012-03-12 Thread Volodymyr Kostyrko

Hi all.

I have one machine behaving unstable. This happened before 9.0. After 
upgrading to 9.0 machine was given a light load and now it reboots. 
Memory was already tested (without any errors) and changed after another 
reboot.


I just have one snippet in the logs:

Mar 12 07:51:56 beeb kernel: interrupt   total
Mar 12 07:51:56 beeb kernel: irq18: ehci0 uhci5+  325
Mar 12 07:51:56 beeb kernel: irq19: uhci2 uhci4  4350
Mar 12 07:51:56 beeb kernel: irq23: ehci1 uhci3272776
Mar 12 07:51:56 beeb kernel: cpu0:timer 306304013
Mar 12 07:51:56 beeb kernel: irq256: mpt0   106758743
Mar 12 07:51:56 beeb kernel: cpu1:timer  50588836
Mar 12 07:51:56 beeb kernel: cpu14:timer 40862828
Mar 12 07:51:56 beeb kernel: cpu12:timer 0057
Mar 12 07:51:56 beeb kernel: cpu6:timer  51650325
Mar 12 07:51:56 beeb kernel: cpu13:timer 35826328
Mar 12 07:51:56 beeb kernel: cpu3:timer  47414874
Mar 12 07:51:56 beeb kernel: cpu10:timer101158759
Mar 12 07:51:56 beeb kernel: cpu2:timer 116817563
Mar 12 07:51:56 beeb kernel: cpu8:timer 137051223
Mar 12 07:51:56 beeb kernel: cpu7:timer  31732225
Mar 12 07:51:56 beeb kernel: cpu11:timer 43244351
Mar 12 07:51:56 beeb kernel: cpu4:timer  83143936
Mar 12 07:51:56 beeb kernel: cpu9:timer  49622770
Mar 12 07:51:56 beeb kernel: cpu5:timer  40662969
Mar 12 07:51:56 beeb kernel: cpu15:timer 27434472
Mar 12 07:51:56 beeb kernel: irq257: igb0:que 0  20058599
Mar 12 07:51:56 beeb kernel: irq258: igb0:que 1  15054525
Mar 12 07:51:56 beeb kernel: irq259: igb0:que 2  14738762
Mar 12 07:51:56 beeb kernel: irq260: igb0:que 3  14702046
Mar 12 07:51:56 beeb kernel: irq261: igb0:que 4  14842310
Mar 12 07:51:56 beeb kernel: irq262: igb0:que 5  15035818
Mar 12 07:51:56 beeb kernel: irq263: igb0:que 6  14826606
Mar 12 07:51:56 beeb kernel: irq264: igb0:que 7  14924631
Mar 12 07:51:56 beeb kernel: irq265: igb0:link  2
Mar 12 07:51:56 beeb kernel: Total  1461395023
Mar 12 07:51:56 beeb kernel: KDB: stack backtrace:
Mar 12 07:51:56 beeb kernel: #0 0x8038d458 at kdb_backtrace+0x58
Mar 12 07:51:56 beeb kernel: #1 0x80315b4b at watchdog_fire+0x8b
Mar 12 07:51:56 beeb kernel: #2 0x80315e10 at hardclock_anycpu+0x2a0
Mar 12 07:51:56 beeb kernel: #3 0x80583278 at handleevents+0xd8
Mar 12 07:51:56 beeb kernel: #4 0x80583e36 at timercb+0x2d6
Mar 12 07:51:56 beeb kernel: #5 0x805aec46 at 
lapic_handle_timer+0xb6

Mar 12 07:51:56 beeb kernel: #6 0x80557f2c at Xtimerint+0x8c
Mar 12 07:51:56 beeb kernel: #7 0x8055c348 at cpu_idle_acpi+0x38
Mar 12 07:51:56 beeb kernel: #8 0x8055c402 at cpu_idle+0xa2
Mar 12 07:51:56 beeb kernel: #9 0x80380b7f at sched_idletd+0x37f
Mar 12 07:51:56 beeb kernel: #10 0x80331d36 at fork_exit+0x76
Mar 12 07:51:56 beeb kernel: #11 0x8055790e at fork_trampoline+0xe

What should I blame now? Is it some programming error or should I 
continue with testing/changing motherboard and cpu?


--
Sphinx of black quartz judge my vow.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"