Re: Stability Issues on 5.4-RELEASE Box

2007-03-01 Thread Kris Kennaway
On Wed, Feb 28, 2007 at 05:39:48PM -0500, [EMAIL PROTECTED] wrote:
 I'm not running your application mix, but I've never seen random reboots
 unless there were hardware issues.  With 5.X these included having
 hyperthreading turned on, which I know caused problems with my dual XEON
 system.
 
 Hmmm...two answers so far, two people saying hyperthreading can be an 
 issue. I'll definitely have that turned off ASAP.

HTT isn't expected to cause problems at least on modern versions of
FreeBSD (I dont remember if old versions like 5.4 have a bug), but it
could be if you are running older hardware with broken BIOS support.

 * Issues with files that are not found on startup sometimes, but are 
 other times. Prime example: the Zope CMS system that's been 
 installed failed to find libmysqlclient.so after a planned soft 
 reboot, but found it with no trouble on a subsequent boot a few 
 minutes later, with no config changes in between.
 
 Haven't seen that; are there any messages indicating you're having
 filesystem problems?
 
 Thanks for asking; I see some new nasties in /var/log/messages:
 
 Feb 27 09:05:37 www fsck: /dev/ad4s1f: PARTIALLY TRUNCATED INODE I=9397392
 Feb 27 09:05:37 www fsck: /dev/ad4s1f: UNEXPECTED SOFT UPDATE 
 INCONSISTENCY; RUN fsck MANUALLY.

You definitely need to drop to single-user mode and fsck -f:
filesystem corruption can cause many problems.

 * Given my dmesg below, do you see any specific problems?
 
 The interrupt storm on uhci1+ is not a good thing.
 
 Any thoughts on how to fix it?

You can disable USB support if you do not need it, otherwise a fix
will probably involve an upgrade to a newer version as a first step.

Kris


pgpBT0Fmvzg3S.pgp
Description: PGP signature


Stability Issues on 5.4-RELEASE Box

2007-02-28 Thread alex

Hello All,

I've recently fallen into the task of administering a FreeBSD 
5.4-RELEASE box that acts as the web server for a small non-profit that 
I volunteer for. Unfortunately, the system has been having some 
extremely vexing stability issues over the last month or so, which even 
my 6+ years of experience as an OpenBSD admin have not helped me track 
down.


First things first, let me say explicitly that I'm not trying to say 
FreeBSD sucks, it's not stable or anything like that. It's a fine OS, 
and I'm sure that it's either faulty hardware or a misconfiguration of 
some sort causing these problems. :-)


That said, here are some of the symptoms the box has been experiencing:

* Occasional random reboots. I've only personally witnessed one, and 
they don't happen often, but any time a *NIX box just reboots for no 
apparent reason (there was no indication of a problem in any of the 
logs, at least that I could see), something really bad is going on.


* Random extreme slowness when logging in via SSH, with the time to get 
a shell ranging from a second or two all the way up to 80 seconds. The 
box isn't busy enough that it's just slow due to load (especially 
since, once you're in, things fly), and it's not just a reverse DNS 
issue like I've seen on OpenBSD (this occurs even when logging in from 
locations listed in /etc/hosts that resolve properly out of that file). 
Until I upgraded to the current version of OpenSSL/OpenSSH, the box 
would occasionally just become unresponsive altogether over SSH, not 
allowing logins for 15+ minutes at a time.


* Issues with files that are not found on startup sometimes, but are 
other times. Prime example: the Zope CMS system that's been installed 
failed to find libmysqlclient.so after a planned soft reboot, but found 
it with no trouble on a subsequent boot a few minutes later, with no 
config changes in between.


* A warning in /var/log/messages that the root filesystem was full, 
when it was at 60% capacity (and something like 2% inode capacity); the 
problem has yet to repeat, though no files have been cleared off of 
that filesystem.


* Random crashes of the Zope/Plone system that's running the main part 
of the web site. While I realize that, in and of itself, this means 
nothing about the stability of the underlying OS, in the context of all 
of the other things going on (as well as the fact that the Zope list 
has been unable to help figure out why it's crashing), it seems like it 
might be further evidence of a larger problem.


Thus far, besides simply scanning log files, constantly watching top 
and ps, etc., I've not been able to do much with the box. As I said, 
I upgraded OpenSSL/OpenSSH to current versions, and I installed pf as 
the firewall (there was none before I arrived...don't even get me 
started on that). This weekend the guy who was the previous admin will 
be running a Memtest for me and disabling hyperthreading (which there's 
no performance justification for, and which has caused me stability 
issues at least on Linux in the past), since the server is in Oregon 
and I'm in the DC area. That's about the extent of what I've been able 
to do to date, since this is a production box.


What I'd like to know from you guys is:

* Am I justified in suspecting hyperthreading as a potential cause of 
instability?


* Does 5.4-RELEASE have any known bugs that might cause stability 
issues like the ones I've described here? More importantly, would an 
upgrade to 6.2-RELEASE be worthwhile (as is my instinct), in terms of 
being generally more stable and/or having better hardware support? 
Would such an upgrade be possible/relatively painless to perform 
without being physically at a console, as has been the case with 
OpenBSD over the years?


* Given my dmesg below, do you see any specific problems?

* Do you have any other suggestions for debugging this problem?

Thanks in advance for any help you can provide. :-)

Alex Kirk

dmesg:

Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.4-RELEASE #0: Sun May  8 10:21:06 UTC 2005
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC
ACPI APIC Table: INTEL  D945GTP 
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz (3200.01-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf43  Stepping = 3
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

  Hyperthreading: 2 logical CPUs
real memory  = 2137509888 (2038 MB)
avail memory = 2086207488 (1989 MB)
ioapic0: Changing APIC ID to 2
ioapic0 Version 2.0 irqs 0-23 on motherboard
npx0: math processor on motherboard
npx0: INT 16 interface
acpi0: INTEL D945GTP on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 

Re: Stability Issues on 5.4-RELEASE Box

2007-02-28 Thread DAve

[EMAIL PROTECTED] wrote:

Hello All,

I've recently fallen into the task of administering a FreeBSD 
5.4-RELEASE box that acts as the web server for a small non-profit that 
I volunteer for. Unfortunately, the system has been having some 
extremely vexing stability issues over the last month or so, which even 
my 6+ years of experience as an OpenBSD admin have not helped me track 
down.


First things first, let me say explicitly that I'm not trying to say 
FreeBSD sucks, it's not stable or anything like that. It's a fine OS, 
and I'm sure that it's either faulty hardware or a misconfiguration of 
some sort causing these problems. :-)


That said, here are some of the symptoms the box has been experiencing:

* Occasional random reboots. I've only personally witnessed one, and 
they don't happen often, but any time a *NIX box just reboots for no 
apparent reason (there was no indication of a problem in any of the 
logs, at least that I could see), something really bad is going on.


* Random extreme slowness when logging in via SSH, with the time to get 
a shell ranging from a second or two all the way up to 80 seconds. The 
box isn't busy enough that it's just slow due to load (especially since, 
once you're in, things fly), and it's not just a reverse DNS issue like 
I've seen on OpenBSD (this occurs even when logging in from locations 
listed in /etc/hosts that resolve properly out of that file). Until I 
upgraded to the current version of OpenSSL/OpenSSH, the box would 
occasionally just become unresponsive altogether over SSH, not allowing 
logins for 15+ minutes at a time.


* Issues with files that are not found on startup sometimes, but are 
other times. Prime example: the Zope CMS system that's been installed 
failed to find libmysqlclient.so after a planned soft reboot, but found 
it with no trouble on a subsequent boot a few minutes later, with no 
config changes in between.


* A warning in /var/log/messages that the root filesystem was full, when 
it was at 60% capacity (and something like 2% inode capacity); the 
problem has yet to repeat, though no files have been cleared off of that 
filesystem.


* Random crashes of the Zope/Plone system that's running the main part 
of the web site. While I realize that, in and of itself, this means 
nothing about the stability of the underlying OS, in the context of all 
of the other things going on (as well as the fact that the Zope list has 
been unable to help figure out why it's crashing), it seems like it 
might be further evidence of a larger problem.


Thus far, besides simply scanning log files, constantly watching top 
and ps, etc., I've not been able to do much with the box. As I said, I 
upgraded OpenSSL/OpenSSH to current versions, and I installed pf as the 
firewall (there was none before I arrived...don't even get me started on 
that). This weekend the guy who was the previous admin will be running a 
Memtest for me and disabling hyperthreading (which there's no 
performance justification for, and which has caused me stability issues 
at least on Linux in the past), since the server is in Oregon and I'm in 
the DC area. That's about the extent of what I've been able to do to 
date, since this is a production box.


What I'd like to know from you guys is:

* Am I justified in suspecting hyperthreading as a potential cause of 
instability?


* Does 5.4-RELEASE have any known bugs that might cause stability issues 
like the ones I've described here? More importantly, would an upgrade to 
6.2-RELEASE be worthwhile (as is my instinct), in terms of being 
generally more stable and/or having better hardware support? Would such 
an upgrade be possible/relatively painless to perform without being 
physically at a console, as has been the case with OpenBSD over the years?


* Given my dmesg below, do you see any specific problems?

* Do you have any other suggestions for debugging this problem?

Thanks in advance for any help you can provide. :-)

Alex Kirk


I would certainly think hardware is the place to look.

Just so you know, we still run a server on FBSD 4.8, and it runs very 
well. We have 4.8, 4.11, 5.2.1, 5.4, 6.1, and 6.2. Oh, and a couple 
Linux, NetBSD, and Solaris boxen too.


I prefer not to chase versions on high load production equipment, 
certainly not as a problem resolution strategy. For the record, I have 
never had an blind upgrade fix an unidentified problem, and if it did I 
would be very worried.


I would guess memory, at least that is where I would look first. I would 
also wonder what environment the server runs in, heat is a killer, so is 
vibration. Loose racks and humming floors can and will cause connections 
to slip. I have fixed servers that ran for months and suddenly showed 
odd behavior simply by powering down and removing all cards/ram/cables, 
then reattaching everything.


Mysterious failures, 3000 miles to the console, I don't envy you ;^)

DAve


--
Three years now I've asked Google why they don't have a

Re: Stability Issues on 5.4-RELEASE Box

2007-02-28 Thread alex

I'm not running your application mix, but I've never seen random reboots
unless there were hardware issues.  With 5.X these included having
hyperthreading turned on, which I know caused problems with my dual XEON
system.


Hmmm...two answers so far, two people saying hyperthreading can be an 
issue. I'll definitely have that turned off ASAP.


* Issues with files that are not found on startup sometimes, but are 
other times. Prime example: the Zope CMS system that's been 
installed failed to find libmysqlclient.so after a planned soft 
reboot, but found it with no trouble on a subsequent boot a few 
minutes later, with no config changes in between.



Haven't seen that; are there any messages indicating you're having
filesystem problems?


Thanks for asking; I see some new nasties in /var/log/messages:

Feb 27 09:05:37 www fsck: /dev/ad4s1f: PARTIALLY TRUNCATED INODE I=9397392
Feb 27 09:05:37 www fsck: /dev/ad4s1f: UNEXPECTED SOFT UPDATE 
INCONSISTENCY; RUN fsck MANUALLY.


Thus far, besides simply scanning log files, constantly watching 
top and ps, etc., I've not been able to do much with the box. As 
I said, I upgraded OpenSSL/OpenSSH to current versions, and I 
installed pf as the firewall (there was none before I 
arrived...don't even get me started on that). This weekend the guy 
who was the previous admin will be running a Memtest for me and 
disabling hyperthreading (which there's no performance justification 
for, and which has caused me stability issues at least on Linux in 
the past), since the server is in Oregon and I'm in the DC area. 
That's about the extent of what I've been able to do to date, since 
this is a production box.




How did you upgrade OpenSSL/SSH - cvsup + buildworld, etc., from the
ports, or some other way?  I've always did this using cvsup since
OpenSSL/SSH is now built into the OS


From source on openssl.org and openssh.org. I'm not yet familiar with 
all of the cool helper things that FreeBSD has such as cvsup, and I was 
conerned about getting those two specifically fixed fast when I first 
started this task (as the version on there was old enough that there 
were known remote exploits).


* Does 5.4-RELEASE have any known bugs that might cause stability 
issues like the ones I've described here? More importantly, would an 
upgrade to 6.2-RELEASE be worthwhile (as is my instinct), in terms 
of being generally more stable and/or having better hardware 
support? Would such an upgrade be possible/relatively painless to 
perform without being physically at a console, as has been the case 
with OpenBSD over the years?




I am not running your mix, but I have a 4.11 box, a 5.5-STABLE box, and a
6.2-STABLE box, and haven't seen anything like this. the 4.11 box looks a
lot like yours - Supermicro P4DC6 with dual Xeons, RAID1 with the onboard
Adaptec controller + the add-in RAID board, RAID5 using vinum with an
Adaptec (39160?) board. 5.X plus hyperthreading was definitely a problem
with that system.  I have another similar box which I'm testing with
6.2-STABLE and haven't seen any problems yet.


Again, since the two answers I have so far say essentially the same 
thing here, it looks like my concerns about hardware are valid.



* Given my dmesg below, do you see any specific problems?


The interrupt storm on uhci1+ is not a good thing.


Any thoughts on how to fix it?

Thanks,
Alex Kirk

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]