Re: Computer stops responding (freezes up) during uncorrectable data error
On Thu, 27 Jan 2011, Gordon Ferris wrote: 2. What utilities will show which sectors are occupied by specific files? Ideally I could specify a range of sectors and a list of files using those sectors would be provided. It would also be nice to specify files and be shown which sectors they occupy. I've heard of the Coroner's Toolkit; are there any other recommendations? Try looking at sleuthkit. Handy set of tools. Hopefully it could be used in your case too. Sleuthkit seems to have some limitations on OpenBSD. I needed to use it recently, but it did not recognise the FFS filesystem size when run on OpenBSD. I had to compile and run it on some linux live CD and it worked well (on OpenBSD FFS) from there. I was able to get a listing of blocks occupied by individual files. Regards, David
Re: Computer stops responding (freezes up) during uncorrectable data error
Thank you for the interest so far in my post. I never meant to imply "someone fix this now". If that's how it came across, then I do apologize - that's not what I intended. I am looking for more than the standard "disks break, live with it" answer. I am surprised that the disk retry code doesn't timeout after 5 minutes or 100 retries or something like that. Also, it seems odd that the system is still responsive when the first few error messages are written to console but then stops responding a few messages later. Also, I expected the unresponsiveness when the failed disk was mounted as part of the root filesystem - not when it is mounted as an auxiliary filesystem or not even mounted at all but simply accessed as a raw device. I have trouble believing that I'm the first one to run into this, or at least the need to go back and forth between filesystem blocks and filenames. But maybe I am. Thanks Lee, for the dd_rescue suggestion. Thanks David, for the sleuthkit suggestion. Sincerely, Gordon On Thu, Jan 27, 2011 at 03:01:40PM +0100, Benny Lofgren wrote: > On 2011-01-27 14.11, Ted Unangst wrote: > > On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren wrote: > >> It's a matter of uptime. > >> > >> The indicated behaviour, that the system more or less freezes when > >> encountering a simple sector read error is indeed disturbing. For > >> example, my own reasons for using mirroring are exclusively so that a > >> system can remain online and operational in case of a disk failure. > > > > If that's why you're investigating, I'll save you some time. The disk > > retry code will basically lock the system up while it's retrying. If > > you don't like it, send a patch. > > Well, fwiw I wasn't the one investigating this particular problem, but I > have no problem submitting patches in cases where I'm able to do > meaningful work. (The problem I mentioned investigating is in all > likelihood either driver-related or a hardware problem.) > > I absolutely didn't mean to imply that "hey this is broken, 'someone' > need to spend time to fix it" - I fully realize that that someone may > very well be me. I apologize if I came across that way. > > I was merely pointing out that the standard response of "disks break, > live with it", while ever true, is sometimes irrelevant to the problem. > > Yes, disks break (I currently have approximately two dozen broken ones > in a box at the office waiting for an appointment with a sledgehammer), > and yes, we diligently keep backups (or are sorry we didn't) but that > doesn't solve the situation where you have a critical system that causes > pain if it goes offline. > > I have never in almost thirty years in this business lost a single byte > of customer data to disk failure. I have however had cases of unplanned > downtime, and every time that happens is also a failure. > > Designing redundancy into our systems helps only as far as to the > nearest single point of failure, and if that point is the OS then I'd > say that is a problem (since it's not always feasible to build > redundancy using multiple servers). > > I know I'm preaching to the choir here, and my only interest here is to > improve the robustness of an already incredibly robust system. I'll > certainly contribute to the best of my ability whenever I find fixable > problems. > > > Best regards, > > /Benny > > -- > internetlabbet.se / work: +46 8 551 124 80 / "Words must > Benny Lofgren/ mobile: +46 70 718 11 90 / be weighed, > / fax:+46 8 551 124 89/not counted." >/email: benny -at- internetlabbet.se - End forwarded message - -- Gordon Ferris W.F. Engineering Phone: +1 801-455-6108 - End forwarded message - -- Gordon Ferris W.F. Engineering Phone: +1 801-455-6108
[gordon.fer...@wfengineering.com: Re: Computer stops responding (freezes up) during uncorrectable data error]
Thank you for the interest so far in my post. I never meant to imply "someone fix this now". If that's how it came across, then I do apologize - that's not what I intended. I am looking for more than the standard "disks break, live with it" answer. I am surprised that the disk retry code doesn't timeout after 5 minutes or 100 retries or something like that. Also, it seems odd that the system is still responsive when the first few error messages are written to console but then stops responding a few messages later. Also, I expected the unresponsiveness when the failed disk was mounted as part of the root filesystem - not when it is mounted as an auxiliary filesystem or not even mounted at all but simply accessed as a raw device. I have trouble believing that I'm the first one to run into this, or at least the need to go back and forth between filesystem blocks and filenames. But maybe I am. Thanks Lee, for the dd_rescue suggestion. I'll take a look at it. Sincerely, Gordon On Thu, Jan 27, 2011 at 03:01:40PM +0100, Benny Lofgren wrote: > On 2011-01-27 14.11, Ted Unangst wrote: > > On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren wrote: > >> It's a matter of uptime. > >> > >> The indicated behaviour, that the system more or less freezes when > >> encountering a simple sector read error is indeed disturbing. For > >> example, my own reasons for using mirroring are exclusively so that a > >> system can remain online and operational in case of a disk failure. > > > > If that's why you're investigating, I'll save you some time. The disk > > retry code will basically lock the system up while it's retrying. If > > you don't like it, send a patch. > > Well, fwiw I wasn't the one investigating this particular problem, but I > have no problem submitting patches in cases where I'm able to do > meaningful work. (The problem I mentioned investigating is in all > likelihood either driver-related or a hardware problem.) > > I absolutely didn't mean to imply that "hey this is broken, 'someone' > need to spend time to fix it" - I fully realize that that someone may > very well be me. I apologize if I came across that way. > > I was merely pointing out that the standard response of "disks break, > live with it", while ever true, is sometimes irrelevant to the problem. > > Yes, disks break (I currently have approximately two dozen broken ones > in a box at the office waiting for an appointment with a sledgehammer), > and yes, we diligently keep backups (or are sorry we didn't) but that > doesn't solve the situation where you have a critical system that causes > pain if it goes offline. > > I have never in almost thirty years in this business lost a single byte > of customer data to disk failure. I have however had cases of unplanned > downtime, and every time that happens is also a failure. > > Designing redundancy into our systems helps only as far as to the > nearest single point of failure, and if that point is the OS then I'd > say that is a problem (since it's not always feasible to build > redundancy using multiple servers). > > I know I'm preaching to the choir here, and my only interest here is to > improve the robustness of an already incredibly robust system. I'll > certainly contribute to the best of my ability whenever I find fixable > problems. > > > Best regards, > > /Benny > > -- > internetlabbet.se / work: +46 8 551 124 80 / "Words must > Benny Lofgren/ mobile: +46 70 718 11 90 / be weighed, > / fax:+46 8 551 124 89/not counted." >/email: benny -at- internetlabbet.se - End forwarded message - -- Gordon Ferris W.F. Engineering Phone: +1 801-455-6108
Re: Computer stops responding (freezes up) during uncorrectable data error
On Thu, Jan 27, 2011 at 2:16 AM, Gordon Ferris wrote: > 1. Is it normal for the operating system to freeze when accessing damaged sectors - even if the only access is via a raw, unmounted partition? This seems like a hardware problem to me, except that errors are logged to /var/log/messages as I described in the original post. Yes. It may not be desirable, but the retry code basically puts everything else on hold while it's running. It is a hardware problem the operating system is trying to overcome. > 2. What utilities will show which sectors are occupied by specific files? Ideally I could specify a range of sectors and a list of files using those sectors would be provided. It would also be nice to specify files and be shown which sectors they occupy. I've heard of the Coroner's Toolkit; are there any other recommendations? I don't know of any. If I needed to do something like this, I'd probably start with fsck_ffs and modify as needed. Actually, that's what fsdb does already. You probably just need it to print a little more info and to walk the tree automatically.
Re: Computer stops responding (freezes up) during uncorrectable data error
On Thu, 27 Jan 2011, Gordon Ferris wrote: > We waited too long to replace the failed drive, so there were errors on > both drives in the mirror, so the data was not completely restored. > Backups were not as recent as we would have liked. Since the drive > didn't completely fail, it seemed worth trying to retrieve some data > where possible from it. > dd_rescue will give you the best chance of recovering bad sectors. Lee
Re: Computer stops responding (freezes up) during uncorrectable data error
On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren wrote: > It's a matter of uptime. > > The indicated behaviour, that the system more or less freezes when > encountering a simple sector read error is indeed disturbing. For > example, my own reasons for using mirroring are exclusively so that a > system can remain online and operational in case of a disk failure. If that's why you're investigating, I'll save you some time. The disk retry code will basically lock the system up while it's retrying. If you don't like it, send a patch.
Re: Computer stops responding (freezes up) during uncorrectable data error
On 2011-01-27 14.11, Ted Unangst wrote: > On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren wrote: >> It's a matter of uptime. >> >> The indicated behaviour, that the system more or less freezes when >> encountering a simple sector read error is indeed disturbing. For >> example, my own reasons for using mirroring are exclusively so that a >> system can remain online and operational in case of a disk failure. > > If that's why you're investigating, I'll save you some time. The disk > retry code will basically lock the system up while it's retrying. If > you don't like it, send a patch. Well, fwiw I wasn't the one investigating this particular problem, but I have no problem submitting patches in cases where I'm able to do meaningful work. (The problem I mentioned investigating is in all likelihood either driver-related or a hardware problem.) I absolutely didn't mean to imply that "hey this is broken, 'someone' need to spend time to fix it" - I fully realize that that someone may very well be me. I apologize if I came across that way. I was merely pointing out that the standard response of "disks break, live with it", while ever true, is sometimes irrelevant to the problem. Yes, disks break (I currently have approximately two dozen broken ones in a box at the office waiting for an appointment with a sledgehammer), and yes, we diligently keep backups (or are sorry we didn't) but that doesn't solve the situation where you have a critical system that causes pain if it goes offline. I have never in almost thirty years in this business lost a single byte of customer data to disk failure. I have however had cases of unplanned downtime, and every time that happens is also a failure. Designing redundancy into our systems helps only as far as to the nearest single point of failure, and if that point is the OS then I'd say that is a problem (since it's not always feasible to build redundancy using multiple servers). I know I'm preaching to the choir here, and my only interest here is to improve the robustness of an already incredibly robust system. I'll certainly contribute to the best of my ability whenever I find fixable problems. Best regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / "Words must Benny Lofgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted." /email: benny -at- internetlabbet.se
Re: Computer stops responding (freezes up) during uncorrectable data error
On 2011-01-27 06.02, Ted Unangst wrote: > On Wed, Jan 26, 2011 at 10:00 PM, Amit Kulkarni wrote: >> pardon my ignorance but if you restored your data already, why bother >> investigating disk failure? > Unless they are all the same person, there seems to be a sudden rash > of people who want to bring a disk back from the dead because they are > unwilling or unable to do the math on how much disks cost, how much > time costs, and what the future integrity of their data is worth. I > don't know why this is, but I do know "disks die, buy new ones" is the > correct answer to give them. I fully understand the OP:s need to investigate this problem further, regardless of whether there was any significant data loss or not. It's a matter of uptime. The indicated behaviour, that the system more or less freezes when encountering a simple sector read error is indeed disturbing. For example, my own reasons for using mirroring are exclusively so that a system can remain online and operational in case of a disk failure. If a disk in a mirror or redundant stripe set fails in a hotpluggable hardware environment there really should be no need for service interruption. The disk should be able to be replaced on the fly, or at the very least during a controlled service window. In this case, that obviously wouldn't work. (The reason I'm butting in to this thread is that I'm currently investigating a similar but probably totally unrelated problem, where a system under high load (disk activity) claims there are sector read errors, and then stops responding in a similar fashion to the OP:s system. Saturate one, two or three disks with reads - no problem. Add a fourth disk and after a while the problem appears. If I can determine beyond reasonable doubt that this isn't a hardware problem, I'll submit a bug report.) Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / "Words must Benny Lofgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted." /email: benny -at- internetlabbet.se
Re: Computer stops responding (freezes up) during uncorrectable data error
We waited too long to replace the failed drive, so there were errors on both drives in the mirror, so the data was not completely restored. Backups were not as recent as we would have liked. Since the drive didn't completely fail, it seemed worth trying to retrieve some data where possible from it. 1. Is it normal for the operating system to freeze when accessing damaged sectors - even if the only access is via a raw, unmounted partition? This seems like a hardware problem to me, except that errors are logged to /var/log/messages as I described in the original post. 2. What utilities will show which sectors are occupied by specific files? Ideally I could specify a range of sectors and a list of files using those sectors would be provided. It would also be nice to specify files and be shown which sectors they occupy. I've heard of the Coroner's Toolkit; are there any other recommendations? On Thu, Jan 27, 2011 at 12:02:44AM -0500, Ted Unangst wrote: > On Wed, Jan 26, 2011 at 10:00 PM, Amit Kulkarni wrote: > > pardon my ignorance but if you restored your data already, why bother > > investigating disk failure? > > Unless they are all the same person, there seems to be a sudden rash > of people who want to bring a disk back from the dead because they are > unwilling or unable to do the math on how much disks cost, how much > time costs, and what the future integrity of their data is worth. I > don't know why this is, but I do know "disks die, buy new ones" is the > correct answer to give them.
Re: Computer stops responding (freezes up) during uncorrectable data error
On Wed, Jan 26, 2011 at 10:00 PM, Amit Kulkarni wrote: > pardon my ignorance but if you restored your data already, why bother > investigating disk failure? Unless they are all the same person, there seems to be a sudden rash of people who want to bring a disk back from the dead because they are unwilling or unable to do the math on how much disks cost, how much time costs, and what the future integrity of their data is worth. I don't know why this is, but I do know "disks die, buy new ones" is the correct answer to give them.
Re: Computer stops responding (freezes up) during uncorrectable data error
pardon my ignorance but if you restored your data already, why bother investigating disk failure? On Wed, Jan 26, 2011 at 6:50 PM, Gordon Ferris wrote: >I have a disk that has failed; there seem to be damaged areas that cause errors when specific files are accessed. This disk was one of a two-disk mirror running raidframe. The disk has been replaced and the original machine is back up and running again. >However as I use a second computer to investigate the failed disk, I have been puzzled that this second computer locks up and stops responding when I try copying files that include various damaged areas of the disk. > >This second computer has an installation of OpenBSD 4.6, with the kernel recompiled to support raidframe (so I can access the data on the partition); I have also adjusted the drive numbering so that the failed drive believes it is the only disk present in its mirror. On this second computer, the operating system is on a completely different physical disk; the failed disk is not necessary for a completely functional system. >However, even though this computer doesn't use the failed disk for its root filesystem - the computer still freezes up and stops responding when the bad sectors are accessed. >I even tried using the "dump" and "dd" utilities to access the disk with a raw, unmounted partition - but the host computer still freezes up and stops responding after adding a few lines to /var/log/messages. > >I was expecting the error messages, but not expecting the host system to freeze up - even the mouse stops responding. It's irritating to have to reboot the computer each time I access one of the damaged sectors. >I thought this problem might be caused if the drive controller hardware never returns control back to the operating system once the disk error occurs too many times. But the error messages do end up in /var/log/messages, so control does return to the operating system for at least a little while. > >And yes, repeatedly accessing the same file generates the error messages referring to the same sectors. > > 1. How can I attempt to access the damaged sectors without causing the entire computer to freeze up and stop responding? > > 2. I have used stat, ncheck, and fsdb to find and examine the inodes for various files. Is there a utility to show which sectors of the filesystem and/or the drive are actually used by various files? > > 3. How can I identify all the files that contain bad sectors without freezing up the computer on each file that contains one? > > # mount > /dev/wd1a on / type ffs (local) > /dev/wd1e on /usr type ffs (local, read-only) > /dev/wd1g on /mnt3 type ffs (local, read-only) > /dev/wd1f on /mnt type ffs (local, read-only) > # fsck -f /dev/rraid2d > ** /dev/rraid2d > ** File system is already clean > ** Last Mounted on /home-big > ** Phase 1 - Check Blocks and Sizes > ** Phase 2 - Check Pathnames > ** Phase 3 - Check Connectivity > ** Phase 4 - Check Reference Counts > ** Phase 5 - Check Cyl groups > 452600 files, 69774853 used, 43730370 free (26658 frags, 5462964 blocks, 0.0% fr > agmentation) > > # mount -r /dev/raid2d /mnt2 > # mount > /dev/wd1a on / type ffs (local) > /dev/wd1e on /usr type ffs (local, read-only) > /dev/wd1g on /mnt3 type ffs (local, read-only) > /dev/wd1f on /mnt type ffs (local, read-only) > /dev/raid2d on /mnt2 type ffs (local, read-only) > > # dd conv=noerror,notrunc,sync \ >> if=/mnt2/.../20198332.txt of=/dev/null count=1 > >The computer stopped responding but these messages were on the console and in /var/log/messages on rebooting: > /var/log/messages > Jan 26 08:23:15 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o > f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying > Jan 26 08:23:18 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 4 > Jan 26 08:23:18 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 4 > Jan 26 08:23:18 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o > f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying > Jan 26 08:23:20 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 3 > Jan 26 08:23:20 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 3 > Jan 26 08:23:20 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o > f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying > Jan 26 08:23:22 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 2 > Jan 26 08:23:22 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 2 > Jan 26 08:23:22 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o > f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying > Jan 26 08:23:25 one /bsd: wd0f: uncorrectable data error reading fsbn 40104976 o > f 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying > >And the error messages are repeatable (especially the failed b
Computer stops responding (freezes up) during uncorrectable data error
I have a disk that has failed; there seem to be damaged areas that cause errors when specific files are accessed. This disk was one of a two-disk mirror running raidframe. The disk has been replaced and the original machine is back up and running again. However as I use a second computer to investigate the failed disk, I have been puzzled that this second computer locks up and stops responding when I try copying files that include various damaged areas of the disk. This second computer has an installation of OpenBSD 4.6, with the kernel recompiled to support raidframe (so I can access the data on the partition); I have also adjusted the drive numbering so that the failed drive believes it is the only disk present in its mirror. On this second computer, the operating system is on a completely different physical disk; the failed disk is not necessary for a completely functional system. However, even though this computer doesn't use the failed disk for its root filesystem - the computer still freezes up and stops responding when the bad sectors are accessed. I even tried using the "dump" and "dd" utilities to access the disk with a raw, unmounted partition - but the host computer still freezes up and stops responding after adding a few lines to /var/log/messages. I was expecting the error messages, but not expecting the host system to freeze up - even the mouse stops responding. It's irritating to have to reboot the computer each time I access one of the damaged sectors. I thought this problem might be caused if the drive controller hardware never returns control back to the operating system once the disk error occurs too many times. But the error messages do end up in /var/log/messages, so control does return to the operating system for at least a little while. And yes, repeatedly accessing the same file generates the error messages referring to the same sectors. 1. How can I attempt to access the damaged sectors without causing the entire computer to freeze up and stop responding? 2. I have used stat, ncheck, and fsdb to find and examine the inodes for various files. Is there a utility to show which sectors of the filesystem and/or the drive are actually used by various files? 3. How can I identify all the files that contain bad sectors without freezing up the computer on each file that contains one? # mount /dev/wd1a on / type ffs (local) /dev/wd1e on /usr type ffs (local, read-only) /dev/wd1g on /mnt3 type ffs (local, read-only) /dev/wd1f on /mnt type ffs (local, read-only) # fsck -f /dev/rraid2d ** /dev/rraid2d ** File system is already clean ** Last Mounted on /home-big ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 452600 files, 69774853 used, 43730370 free (26658 frags, 5462964 blocks, 0.0% fr agmentation) # mount -r /dev/raid2d /mnt2 # mount /dev/wd1a on / type ffs (local) /dev/wd1e on /usr type ffs (local, read-only) /dev/wd1g on /mnt3 type ffs (local, read-only) /dev/wd1f on /mnt type ffs (local, read-only) /dev/raid2d on /mnt2 type ffs (local, read-only) # dd conv=noerror,notrunc,sync \ > if=/mnt2/.../20198332.txt of=/dev/null count=1 The computer stopped responding but these messages were on the console and in /var/log/messages on rebooting: /var/log/messages Jan 26 08:23:15 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying Jan 26 08:23:18 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 4 Jan 26 08:23:18 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 4 Jan 26 08:23:18 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying Jan 26 08:23:20 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 3 Jan 26 08:23:20 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 3 Jan 26 08:23:20 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying Jan 26 08:23:22 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 2 Jan 26 08:23:22 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 2 Jan 26 08:23:22 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying Jan 26 08:23:25 one /bsd: wd0f: uncorrectable data error reading fsbn 40104976 o f 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying And the error messages are repeatable (especially the failed block numbers) if I repeat the command: Jan 26 10:40:19 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying Jan 26 10:40:21 one /bsd: wd0f: uncorrectable