Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Tue, Feb 12, 2008 at 01:03:35AM +0100, Torfinn Ingolfsen wrote: On Mon, 11 Feb 2008 13:00:57 +0100 [EMAIL PROTECTED] (Remco van Bekkum) wrote: here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. FWIW, I have the almost the same motherboard (m2a-vm hdmi) with an AMD Phenom 9500 and 4GB RAM[1]. Different disk, though. The (single) disk drive has worked without problems so far. I'm using standard ufs2 filesystems on that disk. I'm running RELENG_7: [EMAIL PROTECTED] uname -a FreeBSD kg-vm.kg4.no 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #6: Sat Jan 26 20:58:51 CET 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC amd64 [EMAIL PROTECTED] atacontrol list ATA channel 0: Master: acd0 Optiarc DVD RW AD-5170A/1.12 ATA/ATAPI revision 0 Slave: no device present ATA channel 2: Master: ad4 SAMSUNG HD501LJ/CR100-12 Serial ATA II Slave: no device present ATA channel 3: Master: no device present Slave: no device present ATA channel 4: Master: no device present Slave: no device present ATA channel 5: Master: no device present Slave: no device present References: 1) http://tingox.googlepages.com/asus_m2a-vm_hdmi_freebsd -- Regards, Torfinn Ingolfsen Thanks, here some more detailed info from me: xaero# dmesg | grep atapci atapci0: ATI AHCI controller port 0xfc00-0xfc07,0xf800-0xf803,0xf400-0xf407,0xf000-0xf003,0xec00-0xec0f mem 0xfe02f000-0xfe02f3ff irq 22 at device 18.0 on pci0 atapci0: [ITHREAD] atapci0: AHCI Version 01.10 controller with 4 ports detected ata2: ATA channel 0 on atapci0 ata3: ATA channel 1 on atapci0 ata4: ATA channel 2 on atapci0 ata5: ATA channel 3 on atapci0 atapci1: ATI IXP600 UDMA133 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xe400-0xe40f at device 20.1 on pci0 ata0: ATA channel 0 on atapci1 xaero# atacontrol list ATA channel 0: Master: no device present Slave: no device present ATA channel 2: Master: ad4 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present ATA channel 3: Master: ad6 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present ATA channel 4: Master: ad8 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present ATA channel 5: Master: ad10 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present xaero# uname -a FreeBSD xaero.spacemarines.us 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #1: Sun Feb 10 16:07:39 CET 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC amd64 I'm using bios 1603, and a seasonic 330W PSU. The errors appear to happen at random, heavy I/O doesn't trigger it. I can rebuild world without problems, so I guess the CPU is ok. The memory has been tested and showed no errors. What's left is cables and mainboard. But how error-prone are sata cables? Considering that I've got 50% failing... Okay, maybe that should prove that the mainboard is faulty :) -Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Mon, Feb 11, 2008 at 07:24:55AM -1000, Clifton Royston wrote: On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote: On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Did you try replacing cabling as a previous poster recommended? I've had similar problems with both traditional parallel ATA and SATA due to marginal cables, which of course are not solved by swapping drives. Not saying there's not a software problem here, just that there is still one area to eliminate. -- Clifton -- Clifton Royston -- [EMAIL PROTECTED] / [EMAIL PROTECTED] President - I and I Computing * http://www.iandicomputing.com/ Custom programming, network design, systems and network consulting services Hi Clifton, I don't recall exactly anymore, but at least 3 cables have been used without problems on other systems. I'm wondering, the mainboard acts weird sometimes as well: when I press the reset button, it sometimes powers down. Also, I just did a reset after it deadlocked on shutdown because of the errors, and when the system booted, 2 disks were not seen by the bios. I had to power down the box and when it came up again, the disks were back. Can software leave the disks in a state that the bios doesn't detect them after pressing the reset button? I'm 100% certain that on my previous installation, in a 100% different system, I got the same errors. That should normally mean either software or disk. The disk has been replaced, the OS is the same. I'm either having really bad luck or something else is wrong. What is a good way of stress testing disks? Thanks! - Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote: On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: Joe, I wanted to send you a note about something that I'm still in the process of dealing with. The timing couldn't be more ironic. I decided it would be worthwhile to migrate from my two-disk ZFS stripe with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 disks combined (since they're all the same size). I had another terminal with gstat -I500ms running in it, so I could see overall I/O. All was going well until about the 81GB mark of the copy. gstat started showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg (summarised): ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 ad6: FAILURE - WRITE_DMA timed out LBA=13951071 ad6: FAILURE - WRITE_DMA timed out LBA=13951327 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 ad6: FAILURE - WRITE_DMA timed out LBA=13951583 ad6: FAILURE - WRITE_DMA timed out LBA=13951839 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad blocks. Actually, after letting things go for a while, I realised the box just locked up. Probably kernel panic'd due to the I/O problem. I'll have to poke at SMART stats later to see what showed up. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] Hi all, After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Feb 11 12:34:29 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:33 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:37 xaero kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:41 xaero kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: FAILURE - WRITE_DMA48 timed out LBA=298013590 So of 6 new disk I have 4 with the same errors. It would be quite safe then to not blame the disks imho. I've tested the second drive in another machine, but still got these timeout errors. What's wrong here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. Regards, Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: Joe, I wanted to send you a note about something that I'm still in the process of dealing with. The timing couldn't be more ironic. I decided it would be worthwhile to migrate from my two-disk ZFS stripe with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 disks combined (since they're all the same size). I had another terminal with gstat -I500ms running in it, so I could see overall I/O. All was going well until about the 81GB mark of the copy. gstat started showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg (summarised): ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 ad6: FAILURE - WRITE_DMA timed out LBA=13951071 ad6: FAILURE - WRITE_DMA timed out LBA=13951327 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 ad6: FAILURE - WRITE_DMA timed out LBA=13951583 ad6: FAILURE - WRITE_DMA timed out LBA=13951839 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad blocks. Actually, after letting things go for a while, I realised the box just locked up. Probably kernel panic'd due to the I/O problem. I'll have to poke at SMART stats later to see what showed up. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] Hi all, After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Feb 11 12:34:29 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:33 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:37 xaero kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:41 xaero kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: FAILURE - WRITE_DMA48 timed out LBA=298013590 So of 6 new disk I have 4 with the same errors. It would be quite safe then to not blame the disks imho. I've tested the second drive in another machine, but still got these timeout errors. What's wrong here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. Regards, Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1
Well it looks like in my case it is hardware related after all. It failed to read the boot block several times now. 2nd sort of DOA of this disk... Remco On Sat, Jan 26, 2008 at 04:28:29PM -0700, Joe Peterson wrote: Remco van Bekkum wrote: Same here. On an amd64 system with 1x sata disk (Western Digital Caviar Green Power) on an amd690G chipset, with UFS and intensive disk activity the system hangs and in the end it may panic. I've csupped today and rebuild world generic kernel but still it's very unstable, sometimes it even hangs when activating geom volumes at boot time... I must add that this is a new system so I'm not 100% sure the hardware is sane. Using ZFS it also crashed when doing intensive I/O. This is very interesting. It seems to there are several of us who are experiencing something that *looks* like hardware (disk) issues when using 7.0. Could this be related to the mouse freeze issue? Could some process be locking/grabbing the CPU at inopportune times and causing not only the freezing symptoms but also reads/writes problems? Can anyone else using 7.0 who hasn't already (especially those using ZFS) check his/her /var/log/messages for disk TIMEOUTs or other disk error messages? If this is widespread, I think the chances re slim that it is a hardware problem in every case. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1
Well the problem is that things *sometimes* work... sometimes not But my backup is on a system running FreeBSD 6.3 with a promise sata controller and that one has crashed too while backing up. So I'm kind of cautious with testing. First I need some reliable storage to try to recover my data. Remco On Sun, Jan 27, 2008 at 10:27:38AM -0700, Joe Peterson wrote: Remco van Bekkum wrote: Well it looks like in my case it is hardware related after all. It failed to read the boot block several times now. 2nd sort of DOA of this disk... Have you tried reading the block in another OS or using SeaTools? That would at least verify that it's hardware. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1
Same here. On an amd64 system with 1x sata disk (Western Digital Caviar Green Power) on an amd690G chipset, with UFS and intensive disk activity the system hangs and in the end it may panic. I've csupped today and rebuild world generic kernel but still it's very unstable, sometimes it even hangs when activating geom volumes at boot time... I must add that this is a new system so I'm not 100% sure the hardware is sane. Using ZFS it also crashed when doing intensive I/O. I can supply additional info later if that may help. Cheers, Remco On Sat, Jan 26, 2008 at 10:54:17PM +0100, Nikolaj Farrell wrote: After upgrading from RELENG_6_2 to 7.0-RC1 I am experiencing system becoming unresponsive during intensive disk operations. Only solution is a power off. These hangs occured pretty much within a few hours of first running 7.0-RC1. 6.2R was fine. I am not running ZFS. The hangs are easy to reproduce by an unrar of an archive ~4.4GB. The system will not hang during normal operations. Below are the errors I get, output from dmesg, atacontrol cap and smartctl tests. (The long test is aborted below, but as indicated previously completed without errors) These are the errors: --- Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=2415 Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Jan 26 19:55:36 athlon kernel: ad8: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=2543 --- and so on... This computer has only the one SATA hard drive. dmesg info: atapci1: VIA 8237A SATA150 controller port 0xcc00-0xcc07,0xc880-0xc883,0xc800-0xc807,0xc480-0xc483,0xc400-0xc40f,0xc000-0xc0ff irq 21 at device 15.0 on pci0 atapci1: [ITHREAD] ata4: ATA channel 0 on atapci1 ad8: 238475MB SAMSUNG SP2504C VT100-50 at ata4-master SATA150 --- atacontrol cap ad8 Protocol Serial ATA II device model SAMSUNG SP2504C serial number S09QJ1CP201268 firmware revision VT100-50 cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 488397168 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cacheyes yes read ahead yes yes Native Command Queuing (NCQ) yes - 31/0x1F Tagged Command Queuing (TCQ) no no 31/0x1F SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 0/0x00 automatic acoustic management yes no 0/0x00 254/0xFE --- smartctl -a /dev/ad8: === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint P120 series Device Model: SAMSUNG SP2504C Serial Number:S09QJ1CP201268 Firmware Version: VT100-50 User Capacity:250 059 350 016 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a Local Time is:Sat Jan 26 22:44:43 2008 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x02) Offline data collection activity was completed without error. Auto Offline Data Collection: Disabled. Self-test execution status: ( 25) The self-test routine was aborted by the host. Total time to complete Offline data collection: (5028) seconds. Offline data collection capabilities:(0x5b) SMART execute Offline