Re: Hardware or OS problem? System Crashing...
On 1/5/2005 at 09:14 Joseph Koenig (jWeb) wrote: Hi, We have a system that is currently giving us some trouble. The system is FreeBSD 4.9. It's a 2 GHz system with 1MB RAM and (here's the kicker) 73GB RAID 1 ATA drives. The system serves as a web/database server dedicated to 1 site. Daily the system goes out and downloads real estate listings (via shell scripts and cURL) and processes them (via PHP into MySQL). Also, nightly the system downloads a zipped set of images (probably around 400-500) and processes them into thumbnails (PHP scripts calling ImageMagick). Over the last week or two, the system is crashing and rebooting into single user mode. It's not consistently during updates, or resizing of images, or anything like that. Yesterday, it crashed with 99% processor idle and load averages of 0.00 0.00 0.00 -- I was watching a 'top' when the machine died. When it boots into single user mode, an fsck must be run, which identified a few corrupt JPEG files -- however, the sysadmin who reboots it never tells me which files they are. The sysadmin is convinced it is a FreeBSD problem and says that Linux will not crash because of a corrupt file and if it does, will not boot into single user mode and he will be able to access it remotely to do the fsck. About 3-4 weeks ago, one of the drives in the mirror set crashed and had to be replaced. I'm not convinced that drives are not to blame for these issues. Is there any way to verify that? Is it possible a corrupt JPEG on the drive could cause the system to crash randomly? What can I do to correctly identify the problem so that we can fix it and not change the OS? Thanks, The sysadmin has no clue about either linux or freebsd! A corrupt JPEG cannot cause a crash of the OS, for any real OS. (If it does, it is a bug in the OS, but I doubt one exists) Real OS includes Windows XP, linux, and FreeBSD. However, an OS crash can cause a corrupt JPEG! Either linux or FreeBSD may boot into single user mode when the filesystem is corrupt.What your sysadmin means is that with one of the newer filesystems Linux uses journeling, which is much less likely to enter this situation, but it still can happen. With soft updates FreeBSD is in the same situation as linux, but softupdates is (generally, there are exceptions) better than journeling. There is softupdates in Freebsd 4.9, but I'm not sure how to enable it, or how good it is. (in 5.3 it is awesome!) I suspect hardware. I'd burn memtest to a CD, and run that for a few hours to see if something is identified. Memtest won't catch everything, but it does a pretty good job. Also look at other factors. Does the HVAC kick in when this happens? Is someone hitting the panic stop switch? Situations like that have happened, and they can take a while to debug. They are not likely, but don't rule them out. FreeBSD 4.9 is fairly old at this point. You should seriously consider upgrading to 4.11 (due out in a few weeks), or 5.3 (my recommendation, but a much more involved upgrade). In addition, to the original problem stated above, we are seeing a number of problems like ...in free(): warning: modified (page-) pointer and ...in free(): warning: chunk is already free. I have them admin running a memtest today, but wanted to make sure these errors were not indicative of something else going on. Thanks, Well, the sysadmin tells me that memtest passed. Any one have any suggestions as to what could be causing the crashes? Thanks, Joe ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Hardware or OS problem? System Crashing...
On Thu, 06 Jan 2005 13:33:41 -0600 Joseph Koenig (jWeb) [EMAIL PROTECTED] wrote: On 1/5/2005 at 09:14 Joseph Koenig (jWeb) wrote: Hi, We have a system that is currently giving us some trouble. The system is FreeBSD 4.9. It's a 2 GHz system with 1MB RAM and (here's the kicker) 73GB RAID 1 ATA drives. The system serves as a web/database server dedicated to 1 site. Daily the system goes out and downloads real estate listings (via shell scripts and cURL) and processes them (via PHP into MySQL). Also, nightly the system downloads a zipped set of images (probably around 400-500) and processes them into thumbnails (PHP scripts calling ImageMagick). Over the last week or two, the system is crashing and rebooting into single user mode. It's not consistently during updates, or resizing of images, or anything like that. Yesterday, it crashed with 99% processor idle and load averages of 0.00 0.00 0.00 -- I was watching a 'top' when the machine died. When it boots into single user mode, an fsck must be run, which identified a few corrupt JPEG files -- however, the sysadmin who reboots it never tells me which files they are. The sysadmin is convinced it is a FreeBSD problem and says that Linux will not crash because of a corrupt file and if it does, will not boot into single user mode and he will be able to access it remotely to do the fsck. About 3-4 weeks ago, one of the drives in the mirror set crashed and had to be replaced. I'm not convinced that drives are not to blame for these issues. Is there any way to verify that? Is it possible a corrupt JPEG on the drive could cause the system to crash randomly? What can I do to correctly identify the problem so that we can fix it and not change the OS? Thanks, The sysadmin has no clue about either linux or freebsd! A corrupt JPEG cannot cause a crash of the OS, for any real OS. (If it does, it is a bug in the OS, but I doubt one exists) Real OS includes Windows XP, linux, and FreeBSD. However, an OS crash can cause a corrupt JPEG! Either linux or FreeBSD may boot into single user mode when the filesystem is corrupt.What your sysadmin means is that with one of the newer filesystems Linux uses journeling, which is much less likely to enter this situation, but it still can happen. With soft updates FreeBSD is in the same situation as linux, but softupdates is (generally, there are exceptions) better than journeling. There is softupdates in Freebsd 4.9, but I'm not sure how to enable it, or how good it is. (in 5.3 it is awesome!) I suspect hardware. I'd burn memtest to a CD, and run that for a few hours to see if something is identified. Memtest won't catch everything, but it does a pretty good job. Also look at other factors. Does the HVAC kick in when this happens? Is someone hitting the panic stop switch? Situations like that have happened, and they can take a while to debug. They are not likely, but don't rule them out. FreeBSD 4.9 is fairly old at this point. You should seriously consider upgrading to 4.11 (due out in a few weeks), or 5.3 (my recommendation, but a much more involved upgrade). In addition, to the original problem stated above, we are seeing a number of problems like ...in free(): warning: modified (page-) pointer and ...in free(): warning: chunk is already free. I have them admin running a memtest today, but wanted to make sure these errors were not indicative of something else going on. Thanks, Well, the sysadmin tells me that memtest passed. Any one have any suggestions as to what could be causing the crashes? Thanks, Don't trust memtest. I've seen it fail to identify faulty hardware in this area. FreeBSD does not crash because of bad files and I would be seriously suspect of the admin that is trying to feed you this. That and he does appear to be not concerned with it what so ever. Yeah, in 4x a major file system problem is a lot more likely to need fsck manually ran than on 5x. 5x will boot and run a back ground fsck. So you will still have network and ect. The best way to test drive is this... run lots of transactions across all parts of the disk for a rather nice amount of time. Smartmontools is also aviable in the ports. http://www.freebsd.org/cgi/url.cgi?ports/sysutils/smartmontools/pkg-descr I would be suspect of nearly any possible chuck of the hardware in that box. Since you've not listed any thing that would allow any piece of hardware to be ruled out as a problem regardless of the OS being used. The places I would focus my attentions are the PSU, RAM, CPU, mother board, cables, any PCI card or the like, and the drives them self. BTW you may want to check this out... http://www.freshports.org/graphics/ImageMagick/ ___ freebsd-questions@freebsd.org mailing list
Re: Hardware or OS problem? System Crashing...
I had a very similar problem over the holidays. After a power failure over a month ago, I noticed some anomalies in FreeBSD, but they were very insidious and didn't seem like hardware (and the system was on a UPS plus a surge protector, so I didn't think the PF alone could have done damage, unless the power cycled many times over a short period). I'd get strange faults in programs from time to time, usually some type of memory faults--usually in Apache (since it uses most of the processor time), but sometimes in system programs that had never given trouble before. As time passed, the system would occasionally freeze, or I would even get kernel panics. There never seemed to be any information left behind that could help me find out why the system was crashing (fault type, processes running, etc.), and error messages in logs were scarce. (If there is a way to debug FreeBSD crashes without running a kernel specifically set up for the purpose, I'd like to know what it is.) Anyway, I suspected a virus--I had seen a virus infection on the Web server, but it had apparently never been activated because the firewall prevented it from calling home. FreeBSD had never faulted before, so the OS was excluded (it would not _suddenly_ develop a bug). I reinstalled everything just to see. It wasn't until I reinstalled and upgraded to FreeBSD 5.3 and got even more frequent mystery crashes that I felt sure that hardware was causing a problem. It turned out that (I think) something had been damaged before or during the power failures. A motherboard failure earlier on had turned off the CPU fan. The fan worked, but the MB had stopped powering it, so it wasn't running. The AMD processor stayed cool enough to operate most of the time because the system is very lightly loaded processor-wise. However, at some point, something got the system into a tight loop, and the processor reached something above 120° C (around 300° F at one point, I think--I could _smell_ the system when I got into the room). Amazingly, it still ran most of the time, but I think some part of the virtual memory logic was damaged, because most of the mystery faults were segment violations. The problem very gradually got worse, with the OS faulting more and more often, until it eventually got so bad that it would fault before the bootload completed. I finally replaced the entire machine--this time with _seven_ fans, and with an Intel processor that will simply shut down if it gets too hot, instead of cooking itself to death. I also upgraded to FreeBSD 5.3, and I updated all the other system software as well. There have been no problems since ... except for a panic in sysinstall during the first installation, which I think was an honest-to-goodness OS bug (it happened only once, and reminded me vaguely of a similar problem on my first installation of 4.3, years earlier). The gigabit Ethernet on the MB doesn't work reliably under FreeBSD, though, so I just reinstalled the 100 Mbps card from the old server, which works perfectly. In summary, this was a hardware problem, but so subtle in the beginning that it wasn't at all clear that hardware was at fault--for a long time I suspected traces of a virus infection or something. Obviously, running Linux would not have made any difference. I did see filesystem corruption after the panics, which was to be expected, but as far as I know I never lost any actual data; fsck corrected the structure errors each time (sometimes from single-user mode, since it wouldn't always succeed in automatic checks). No OS can guarantee against data corruption on unreliable hardware, not even all-knowing, all-seeing Linux. Maybe you need a new sysadmin. -- Anthony ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]