[Kernel-packages] [Bug 309901] Re: SATA timeout causing soft lockup during heavy disk activity

Bug Watch Updater Thu, 26 Oct 2017 15:26:06 -0700

Launchpad has imported 28 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=465838.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2008-10-06T17:22:32+00:00 Johan wrote:

Description of problem:
After a few minutes (varies <1 - ~10minutes) the IDE (PATA) drive totaly 100%
stops responding. dmesg shows timeouts and retries. Processes goes into D
states when doing anything requiring disk activity.

Version-Release number of selected component (if applicable):
2.6.27-0.392.rc8.git7.fc10.x86_64 bad
2.6.27-0.391.rc8.git7.fc10.x86_64 bad
2.6.27-0.382.rc8.git4.fc10.x86_64 bad
2.6.27-0.354.rc7.git3.fc10.x86_64 good

How reproducible:
100%

Steps to Reproduce:
Cannot find pattern. With or without gui, disk activity, CPU pressure. It
happens after a seemingly random number of minutes (<1 - ~10minutes).

Actual results:
A non-working system.

Expected results:
A working system.

Additional info:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
cdb 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1.01: qc timeout (cmd 0x27)
ata1.01: failed to read native max address (err_mask=0x4)
ata1.01: HPA support seems broken, skipping HPA handling
ata1.01: revalidation failed (errno=-5)
ata1: soft resetting link
ata1: nv_mode_filter: 0x1f01f&0x1f01f->0x1f01f, BIOS=0x1f000 (0xc5c60000)
ACPI=0x1f01f (30:20:0x15)
ata1: nv_mode_filter: 0x3f01f&0x3f01f->0x3f01f, BIOS=0x3f000 (0xc5c60000)
ACPI=0x3f01f (30:20:0x15)
ata1.00: configured for UDMA/66
ata1.01: configured for UDMA/100
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
cdb 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1: nv_mode_filter: 0x1f01f&0x1f01f->0x1f01f, BIOS=0x1f000 (0xc5c60000)
ACPI=0x1f01f (30:20:0x15)
ata1: nv_mode_filter: 0x3f01f&0x3f01f->0x3f01f, BIOS=0x3f000 (0xc5c60000)
ACPI=0x3f01f (30:20:0x15)
ata1.00: configured for UDMA/66
ata1.01: configured for UDMA/100
ata1: EH complete

Repeats with UDMA/44, UDMA/33, PIO4, PIO3, PIO0. Again and again. See
file for a few more.

A bit hard to capture after a while as most everything starts going into
D states as they apparently does something requiring disk access. Things
more or less identical to the above keeps repeating.

Some slightly different stuff after a while, copied by hand to another
computer:

SR0: cdrom (IOCTL) ERROR, COMMAND: GET EVENT STATUS NOTIFICATION 4A 01
00 00 10 00 00 00 08 00

...

sr 0:0:0:0: ioctl_internal_command return code = 8000002
: Sense Key : Aborted Command [current] [descriptor]
: Add. Sense: No additional sense information

...

sd 0:0:1:0: [sda] Result: hostbyte=DID_OK driverrbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:1:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
00 00 00 00
sd 0:0:1:0: [sda] Add. Sense: No additional sense information
end_request: I/O error, dev sda, sector 519537

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/0

------------------------------------------------------------------------
On 2008-10-06T17:23:25+00:00 Johan wrote:

Created attachment 319573
Boot dmesg of 2.6.27-0.392.rc8.git7.fc10.x86_64 (bad)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/1

------------------------------------------------------------------------
On 2008-10-06T17:24:15+00:00 Johan wrote:

Created attachment 319574
Boot dmesg of 2.6.27-0.354.rc7.git3.fc10.x86_64 (good)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/2

------------------------------------------------------------------------
On 2008-10-06T17:25:33+00:00 Johan wrote:

Created attachment 319575
Boot dmesg of 2.6.27-0.382.rc8.git4.fc10.x86_64 (bad)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/3

------------------------------------------------------------------------
On 2008-10-06T17:26:54+00:00 Johan wrote:

Created attachment 319576
Some dmesg ata errors

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/4

------------------------------------------------------------------------
On 2008-10-06T21:10:12+00:00 Johan wrote:

Tried a few more kernels in between, and unfortunately (?) it seems the
difference between a working and non-working kernel is if it includes
debug code or not (with debug code = no bug).

This includes latest kernel-debug
(2.6.27-0.392.rc8.git7.fc10.x86_64.debug) which seems to be working,
where the non-debug version bugs out within minutes.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/5

------------------------------------------------------------------------
On 2008-10-08T18:41:15+00:00 Alan wrote:

Looks like another stuck DRQ case - good news if so as I'm currently
tesitng kernel changes to do DRQ data draining

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/6

------------------------------------------------------------------------
On 2008-10-08T20:20:54+00:00 Johan wrote:

Ok. Please advice if you need any further information or if there is
anything that needs testing.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/7

------------------------------------------------------------------------
On 2008-10-18T10:00:39+00:00 Johan wrote:

2.6.27-3.fc10.x86_64 -- bad (3 min uptime)
2.6.27-3.fc10.x86_64.debug -- good
2.6.27.2-23.rc1.fc10.x86_64 -- bad (7 min uptime)
2.6.27.2-23.rc1.fc10.x86_64.debug -- good

Some older kernels
2.6.26.6-79.fc9.x86_64 -- bad (4 min uptime)
2.6.25.14-108.fc9.x86-64 -- bad (2 min uptime)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/8

------------------------------------------------------------------------
On 2008-10-21T08:25:32+00:00 Alan wrote:

If its predictably the case that only the debug kernels work after
multiple tests (and I assume you've been running work debug kernels for
a few days now ?) that points outside the ATA layer, could I suppose be
timing but sounds almost like a compiler bug

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/9

------------------------------------------------------------------------
On 2008-10-27T12:30:54+00:00 Johan wrote:

(sorry for the delay)

I have run with debug kernels for a number hours without problems (the
machine otherwise runs windows from a sata drive).

I've tried with a minimal .config kernel 2.6.28.rc2 latest git -- same
result, though it survived bonnie++ and took all of 21 minutes before
locking up. Same config with the debug options from fedora enabled seems
to be working (1h+) though I'll test it some more.

I'll look into trying different compiler. The rc2 test was with gcc Red
Hat 4.3.2-6 and Ubuntu 4.3.2-1ubuntu11 in a distcc thing.

Could it be hardware related? The HD in question is oldish -- rest of
machine is new. Still, it seems 100% stable with those debug options
turned on.

Is there anything I can do to find out what is happening here? I can
patch the kernel easily enough, or look into using kgdb, but I really
have no idea what to look for.

For the record:
2.6.28.4-47.rc3.fc10.x86_64 -- bad (3 min uptime)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/10

------------------------------------------------------------------------
On 2008-10-27T15:55:01+00:00 Johan wrote:

non-debug 2.6.28.rc2 kernel compiled with gcc Red Had 3.4.6-9 locked up
after ~4 minutes uptime.

debug 2.6.28.rc2 was still ok after 4h+ of hard testing.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/11

------------------------------------------------------------------------
On 2008-10-29T03:37:46+00:00 Johan wrote:

New lockup, with a complete break in pattern:
1. 2.6.27.4-51.fc10.x86_64.debug (all debug kernels so far had worked)
2. SATA (sata_nv) dmraid (raid-0 nvidia ntfs ro) instead of main PATA-IDE ext3
(which kept working)

Lockup was for ~15-20 minutes(?), then worked again for ~8 minutes, then
locked up again.

I cannot seem to get main drive to lock up like that with this kernel.

I cannot tell if this is new to this kernel, I only recently set this
up. (The dmraid did not activate out of the box). It locked up within
minutes on this kernel after working for a few hours on a 28.rc2-git
thing.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/12

------------------------------------------------------------------------
On 2008-10-29T03:39:32+00:00 Johan wrote:

Created attachment 321742
dmesg of 2.6.28.4-51.fc10.x86_64.debug (bad-sata)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/13

------------------------------------------------------------------------
On 2008-11-26T03:36:40+00:00 Bug wrote:

This bug appears to have been reported against 'rawhide' during the Fedora 10
development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/14

------------------------------------------------------------------------
On 2008-11-29T01:41:35+00:00 Joe wrote:

I have had a similar problem since Fedora 8, as have a few others.
Please see bug 440408 as well. These appear to be the same thing.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/15

------------------------------------------------------------------------
On 2008-12-30T11:33:19+00:00 Jeff wrote:

I also experience this problem since Fedora 8. I'm pleased it is getting
some attention. The only way to get it working again is a reboot.

Since Fedora 10 and DBus issue I now get this message from GUI:

Unable to mount location
Cannot invoke CheckForMedia on HAL: org.freedesktop.DBus.Error.NoReply: Did not
receive a reply. Possible causes include: the remote application did not send a
reply, the message bus security policy blocked the reply, the reply timeout
expired, or the network connection was broken.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/21

------------------------------------------------------------------------
On 2009-01-22T18:56:21+00:00 scott wrote:

I am having the same problem. I have built a machine around an Asus P5N-
EM motherboard. I am running a 32-bit kernel, not 64-bit.

Originally, I had an IDE boot drive with Fedora 10 installed on it, and
a spare SATA drive for server disk space.

I had this install of F10 completely updated with the latest "yum
update" but can't access that drive now to see what kernel version it
had - it was whatever version was current this week (ca. Jan 20 - I
worked on this problem off and on all week).

The machine would run for a few minutes, and then the disk light would
come on and stay on. The machine still ran, but disk I/O quit working.
Anything in memory (cached, I guess) would still work - I could open
xterm, and read files like "messages" that were not yet written to disk.
But new commands that were not in cache would not work (said
"Input/output error." at the shell prompt), and I could not ctl-alt-F8
to the text console and log in as root.

Suspecting a hardware problem, I spent a lot of time running diagnostics
like smartctrl and booting from Hiren's boot disk and running Seagate
and Maxtor utilities. Every single disk diagnostic comes back clean. The
problem is not in the IDE drive itself, or if it is, it's a problem that
diagnostics can't find.

Today, I unplugged the IDE drive, and put a base install of F10 on the
SATA drive which has 2.6.27.5-117.fc10.i686. The machine no longer locks
up while it is running. But it prints this message every 10-20 seconds:

ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata5.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
ata5.01: status: { DRDY }
ata5: soft resetting link
ata5: nv_mode_filter: 0x1&0x1f01f->0x1, BIOS=0x1f000 (0xc50000) ACPI=0x1f01f
(600:30:0x1c)
ata5.01: configured for PIO0
ata5: EH complete

This machine was built back in November/December, and worked fine for a
while - but I never had time to finish doing anything with it. This
week, I booted and ran the "yum update" and it was the first time I
noticed these problems. However, they could have been there all along,
but it certainly didn't lock up as it was running.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/23

------------------------------------------------------------------------
On 2009-01-27T17:09:28+00:00 scott wrote:

I have installed CentOS 5.2 on this machine, and have no ATA errors like
this.

What bothers me is this is the third SATA bug I've encountered where a
working system breaks for no apparent reason because of an upgrade. One
was fixed, and the other two are open. I have a Blu-Ray burner that is a
brick because of one of these errors. All three of these are situations
where Fedora worked fine on the hardware, but an upgrade broke the
existing system. The first one was a year or two ago, and was eventually
fixed. But these two are open and inactive. I can't live with this any
more, so this is the end of the line with Fedora for me. I have to run
Fedora for some IBM software I develop with, but will try to see if I
can get it to run on SuSE, or CentOS. I've used Red Hat since 5.2, but
can't deal with these broken systems any longer.

I could understand if Fedora had bleeding-edge new stuff that wasn't
working. I don't have any problems with that. But I have problems with
existing, working code suddenly breaking to the point the systems aren't
usable.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/24

------------------------------------------------------------------------
On 2009-04-05T18:49:49+00:00 Martin wrote:

Hi to all,

I'm having the same troubles ,as in comment's 14 16 17 18.

For note 12 ,I now for "sure" ,that the main disk never gets

locked that way. In my experiences ,the disk witch contain the '/'

never gets involved in this .I say for "sure" because I do shuffle

a lot with disks ,file systems ,and partitioning schemes .

In my case ,the "normal" disks set-up is :

sda ,sdb ,sdc = sata build-in HD's

ata 5 = DVD-RW (Aopen)

the '/' is on sdb ,
/tmp ;/var/tmp ;/var/spool ;/var/cache/yum ;/usr ;/usr/lib64 ;/usr/share ;
and some other subdir's are all in there own partition divided over the
3 disks .All of this for flexibility ,performance (by using parallelism) etc.

For now ,since the update 2009/02/23 ,disks sda and sdc are no longer
locking as

in message 17 . But ata5 (DVD) does .

Because ,from time to time ,I also use other disk's and file-systems
,I'm

"nearly sure" there is no hardware ore file system issue .

Also noteworthy is that there is "something & somewhere" polling via the
D-bus

all of the time ,which slows down seriously other system functions.

It seems the D-bus is occupied by this problem ,but I can't find a clue .

OS = F10 x86_64 all in ext4 except /boot ,latest update yesterday.

I also have a machine around an Asus P5VDC-MX motherboard.

Is there a solution somehow ?

Thanks a lot

martin

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/27

------------------------------------------------------------------------
On 2009-04-17T09:14:39+00:00 Stanislaw wrote:

Only change related with ata between 2.6.27-0.392.rc8.git7.fc10 and
2.6.27-0.354.rc7.git3.fc10 is sata_nv hardreset commit:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commitdiff;h=4c1eb90a0908c0c60db2169dce08fb672e7582f1

It is know that the commit cause a problems, which where reported in two
places:

http://bugzilla.kernel.org/show_bug.cgi?id=12176
http://bugzilla.kernel.org/show_bug.cgi?id=11195

And fixed in two further commits:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commit;h=2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commitdiff;h=2da462eba7e5b585d54c17d76c6a662e4fbb3c32

So the bug should fixed in the newest updates of fedora kernel
(2.6.27.21 based). Johan could you confirm that ?

BTW: Johan in your dmesg are lot of messages like that:

attempt to access beyond end of device
sdc: rw=0, want=625160072, limit=312581808
Buffer I/O error on device sdc1, logical block 78144752
attempt to access beyond end of device
sdc: rw=0, want=625160072, limit=312581808
Buffer I/O error on device sdc1, logical block 78144752
attempt to access beyond end of device

It is serious problem which can cause data corruption. It can be something wrong
it the software working on the top of ata devices (filesystem, device maper) or
maybe in ata itself or a hardware problem (memory corruption, chipsets etc...)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/28

------------------------------------------------------------------------
On 2009-07-12T19:05:03+00:00 Jeff wrote:

When will this problem get a resolution?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/29

------------------------------------------------------------------------
On 2009-07-14T17:52:46+00:00 Stanislaw wrote:

(In reply to comment #22)
> When will this problem get a resolution?

As base kernel version for fedora 10 and 11 is now 2.6.29, I believe
this problem it is already solved. Jeff, can you reproduce this issue on
fedora 10 or 11?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/30

------------------------------------------------------------------------
On 2009-07-15T18:27:13+00:00 Jeff wrote:

I checked my version it is:
uname -r
2.6.27.25-170.2.72.fc10.x86_64

I will upgrade the kernel and give update.

Regards
Jeff

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/31

------------------------------------------------------------------------
On 2009-08-15T18:10:03+00:00 Jeff wrote:

i have upgraded to F11,
$ uname -r
2.6.29.6-217.2.3.fc11.x86_64
The CDrom and DVD appears to work for longer period of time before locking up.
but it still locks up. In the past i would get DBUS error. Now, I get no error
at all on eject or rescan. I can not eject the device manually from externally.
Tell what logs or traces i can provide to help resolve this issue. I'm
attempting to reboot to capture screen shot of working scenario.

regards
Jeff

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/32

------------------------------------------------------------------------
On 2009-08-16T04:20:12+00:00 Jeff wrote:

It appears that after firefox file download, and I perform a "open
folder containing" if .ISOs are present they are automatically mounted.
In my case about 15 ISOs. Then immediately the mplayer also runs. This
kills the cdrom/DVD devices.

Regards
Jeff

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/33

------------------------------------------------------------------------
On 2009-11-18T07:56:09+00:00 Bug wrote:

This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 10 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/34

------------------------------------------------------------------------
On 2009-12-18T06:31:07+00:00 Bug wrote:

Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/comments/35

** Changed in: linux (Fedora)
Importance: Unknown => High

** Bug watch added: Linux Kernel Bug Tracker #12176
https://bugzilla.kernel.org/show_bug.cgi?id=12176

** Bug watch added: Linux Kernel Bug Tracker #11195
https://bugzilla.kernel.org/show_bug.cgi?id=11195

--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/309901

Title:
SATA timeout causing soft lockup during heavy disk activity

Status in linux package in Ubuntu:
Won't Fix
Status in linux package in Fedora:
Won't Fix

Bug description:
Binary package hint: linux-source-2.6.27

During heavy disk activity, SATA drive will timeout and the system
will stop responding. The system rarely recovers and normally needs a
cold boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/309901/+subscriptions

--
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 309901] Re: SATA timeout causing soft lockup during heavy disk activity

Reply via email to