Re: Hardware or OS problem? System Crashing...

2005-01-06 Thread Joseph Koenig (jWeb)
 On 1/5/2005 at 09:14 Joseph Koenig (jWeb) wrote:
 
 Hi,
 
 We have a system that is currently giving us some trouble. The system
 is
 FreeBSD 4.9. It's a 2 GHz system with 1MB RAM and (here's the kicker)
 73GB
 RAID 1 ATA drives. The system serves as a web/database server
 dedicated to
 1
 site. Daily the system goes out and downloads real estate listings
 (via
 shell scripts and cURL) and processes them (via PHP into MySQL). Also,
 nightly the system downloads a zipped set of images (probably around
 400-500) and processes them into thumbnails (PHP scripts calling
 ImageMagick). Over the last week or two, the system is crashing and
 rebooting into single user mode. It's not consistently during updates,
 or
 resizing of images, or anything like that. Yesterday, it crashed with
 99%
 processor idle and load averages of 0.00 0.00 0.00 -- I was watching a
 'top'
 when the machine died. When it boots into single user mode, an fsck
 must be
 run, which identified a few corrupt JPEG files -- however, the
 sysadmin who
 reboots it never tells me which files they are. The sysadmin is
 convinced
 it
 is a FreeBSD problem and says that Linux will not crash because of a
 corrupt
 file and if it does, will not boot into single user mode and he will
 be
 able
 to access it remotely to do the fsck. About 3-4 weeks ago, one of the
 drives
 in the mirror set crashed and had to be replaced. I'm not convinced
 that
 drives are not to blame for these issues. Is there any way to verify
 that?
 Is it possible a corrupt JPEG on the drive could cause the system to
 crash
 randomly? What can I do to correctly identify the problem so that we
 can
 fix
 it and not change the OS? Thanks,
 
 The sysadmin has no clue about either linux or freebsd!
 
 A corrupt JPEG cannot cause a crash of the OS, for any real OS.  (If it
 does, it is a bug in the OS, but I doubt one exists)  Real OS includes
 Windows XP, linux, and FreeBSD.
 
 However, an OS crash can cause a corrupt JPEG!
 
 Either linux or FreeBSD may boot into single user mode when the
 filesystem is corrupt.What your sysadmin means is that with one of
 the newer filesystems Linux uses journeling, which is much less likely
 to enter this situation, but it still can happen.   With soft updates
 FreeBSD is in the same situation as linux, but softupdates is
 (generally, there are exceptions) better than journeling.   There is
 softupdates in Freebsd 4.9, but I'm not sure how to enable it, or how
 good it is.  (in 5.3 it is awesome!)
 
 I suspect hardware.
 
 I'd burn memtest to a CD, and run that for a few hours to see if
 something is identified.   Memtest won't catch everything, but it does
 a pretty good job.
 
 Also look at other factors.  Does the HVAC kick in when this happens?
 Is someone hitting the panic stop switch?  Situations like that have
 happened, and they can take a while to debug.  They are not likely, but
 don't rule them out.
 
 FreeBSD 4.9 is fairly old at this point.   You should seriously
 consider upgrading to 4.11 (due out in a few weeks), or 5.3 (my
 recommendation, but a much more involved upgrade).
 
 
 In addition, to the original problem stated above, we are seeing a number of
 problems like ...in free(): warning: modified (page-) pointer and ...in
 free(): warning: chunk is already free. I have them admin running a memtest
 today, but wanted to make sure these errors were not indicative of something
 else going on. Thanks,
 

Well, the sysadmin tells me that memtest passed. Any one have any
suggestions as to what could be causing the crashes? Thanks,

Joe

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Hardware or OS problem? System Crashing...

2005-01-06 Thread Vulpes Velox
On Thu, 06 Jan 2005 13:33:41 -0600
Joseph Koenig (jWeb) [EMAIL PROTECTED] wrote:

  On 1/5/2005 at 09:14 Joseph Koenig (jWeb) wrote:
  
  Hi,
  
  We have a system that is currently giving us some trouble. The
 system is
  FreeBSD 4.9. It's a 2 GHz system with 1MB RAM and (here's the
 kicker) 73GB
  RAID 1 ATA drives. The system serves as a web/database server
  dedicated to
  1
  site. Daily the system goes out and downloads real estate
 listings (via
  shell scripts and cURL) and processes them (via PHP into MySQL).
 Also, nightly the system downloads a zipped set of images
 (probably around 400-500) and processes them into thumbnails (PHP
 scripts calling ImageMagick). Over the last week or two, the
 system is crashing and rebooting into single user mode. It's not
 consistently during updates, or
  resizing of images, or anything like that. Yesterday, it crashed
 with 99%
  processor idle and load averages of 0.00 0.00 0.00 -- I was
 watching a 'top'
  when the machine died. When it boots into single user mode, an
 fsck must be
  run, which identified a few corrupt JPEG files -- however, the
  sysadmin who
  reboots it never tells me which files they are. The sysadmin is
  convinced
  it
  is a FreeBSD problem and says that Linux will not crash because
 of a corrupt
  file and if it does, will not boot into single user mode and he
 will be
  able
  to access it remotely to do the fsck. About 3-4 weeks ago, one
 of the drives
  in the mirror set crashed and had to be replaced. I'm not
 convinced that
  drives are not to blame for these issues. Is there any way to
 verify that?
  Is it possible a corrupt JPEG on the drive could cause the
 system to crash
  randomly? What can I do to correctly identify the problem so
 that we can
  fix
  it and not change the OS? Thanks,
  
  The sysadmin has no clue about either linux or freebsd!
  
  A corrupt JPEG cannot cause a crash of the OS, for any real OS. 
 (If it does, it is a bug in the OS, but I doubt one exists)  Real
 OS includes Windows XP, linux, and FreeBSD.
  
  However, an OS crash can cause a corrupt JPEG!
  
  Either linux or FreeBSD may boot into single user mode when the
  filesystem is corrupt.What your sysadmin means is that with
 one of the newer filesystems Linux uses journeling, which is much
 less likely to enter this situation, but it still can happen.  
 With soft updates FreeBSD is in the same situation as linux, but
 softupdates is (generally, there are exceptions) better than
 journeling.   There is softupdates in Freebsd 4.9, but I'm not
 sure how to enable it, or how good it is.  (in 5.3 it is awesome!)
  
  I suspect hardware.
  
  I'd burn memtest to a CD, and run that for a few hours to see if
  something is identified.   Memtest won't catch everything, but it
 does a pretty good job.
  
  Also look at other factors.  Does the HVAC kick in when this
 happens? Is someone hitting the panic stop switch?  Situations
 like that have happened, and they can take a while to debug.  They
 are not likely, but don't rule them out.
  
  FreeBSD 4.9 is fairly old at this point.   You should seriously
  consider upgrading to 4.11 (due out in a few weeks), or 5.3 (my
  recommendation, but a much more involved upgrade).
  
  
  In addition, to the original problem stated above, we are seeing a
  number of problems like ...in free(): warning: modified (page-)
  pointer and ...in free(): warning: chunk is already free. I
  have them admin running a memtest today, but wanted to make sure
  these errors were not indicative of something else going on.
  Thanks,
  
 
 Well, the sysadmin tells me that memtest passed. Any one have any
 suggestions as to what could be causing the crashes? Thanks,

Don't trust memtest. I've seen it fail to identify faulty hardware in
this area.

FreeBSD does not crash because of bad files and I would be seriously
suspect of the admin that is trying to feed you this. That and he does
appear to be not concerned with it what so ever.

Yeah, in 4x a major file system  problem is a lot more likely to need
fsck manually ran than on 5x. 5x will boot and run a back ground fsck.
So you will still have network and ect.

The best way to test drive is this... run lots of transactions across
all parts of the disk for a rather nice amount of time. Smartmontools
is also aviable in the ports.
http://www.freebsd.org/cgi/url.cgi?ports/sysutils/smartmontools/pkg-descr

I would be suspect of nearly any possible chuck of the hardware in
that box. Since you've not listed any thing that would allow any piece
of hardware to be ruled out as a problem regardless of the OS being
used.

The places I would focus my attentions are the PSU, RAM, CPU, mother
board, cables, any PCI card or the like, and the drives them self.

BTW you may want to check this out...
http://www.freshports.org/graphics/ImageMagick/
___
freebsd-questions@freebsd.org mailing list

Re: Hardware or OS problem? System Crashing...

2005-01-06 Thread Anthony Atkielski
I had a very similar problem over the holidays. After a power failure
over a month ago, I noticed some anomalies in FreeBSD, but they were
very insidious and didn't seem like hardware (and the system was on a
UPS plus a surge protector, so I didn't think the PF alone could have
done damage, unless the power cycled many times over a short period).
I'd get strange faults in programs from time to time, usually some type
of memory faults--usually in Apache (since it uses most of the processor
time), but sometimes in system programs that had never given trouble
before. As time passed, the system would occasionally freeze, or I would
even get kernel panics. There never seemed to be any information left
behind that could help me find out why the system was crashing (fault
type, processes running, etc.), and error messages in logs were scarce.
(If there is a way to debug FreeBSD crashes without running a kernel
specifically set up for the purpose, I'd like to know what it is.)

Anyway, I suspected a virus--I had seen a virus infection on the Web
server, but it had apparently never been activated because the firewall
prevented it from calling home. FreeBSD had never faulted before, so
the OS was excluded (it would not _suddenly_ develop a bug). I
reinstalled everything just to see. It wasn't until I reinstalled and
upgraded to FreeBSD 5.3 and got even more frequent mystery crashes that
I felt sure that hardware was causing a problem.

It turned out that (I think) something had been damaged before or during
the power failures. A motherboard failure earlier on had turned off the
CPU fan. The fan worked, but the MB had stopped powering it, so it
wasn't running. The AMD processor stayed cool enough to operate most of
the time because the system is very lightly loaded processor-wise.
However, at some point, something got the system into a tight loop, and
the processor reached something above 120° C (around 300° F at one
point, I think--I could _smell_ the system when I got into the room).
Amazingly, it still ran most of the time, but I think some part of the
virtual memory logic was damaged, because most of the mystery faults
were segment violations. The problem very gradually got worse, with the
OS faulting more and more often, until it eventually got so bad that it
would fault before the bootload completed.

I finally replaced the entire machine--this time with _seven_ fans, and
with an Intel processor that will simply shut down if it gets too hot,
instead of cooking itself to death. I also upgraded to FreeBSD 5.3, and
I updated all the other system software as well. There have been no
problems since ... except for a panic in sysinstall during the first
installation, which I think was an honest-to-goodness OS bug (it
happened only once, and reminded me vaguely of a similar problem on my
first installation of 4.3, years earlier). The gigabit Ethernet on the
MB doesn't work reliably under FreeBSD, though, so I just reinstalled
the 100 Mbps card from the old server, which works perfectly.

In summary, this was a hardware problem, but so subtle in the beginning
that it wasn't at all clear that hardware was at fault--for a long time
I suspected traces of a virus infection or something.

Obviously, running Linux would not have made any difference.  I did see
filesystem corruption after the panics, which was to be expected, but as
far as I know I never lost any actual data; fsck corrected the structure
errors each time (sometimes from single-user mode, since it wouldn't
always succeed in automatic checks).  No OS can guarantee against data
corruption on unreliable hardware, not even all-knowing, all-seeing
Linux.

Maybe you need a new sysadmin.

--
Anthony


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]