Re: PROBLEM: Buffer I/O error on device hdg1, system freeze.
Hi Bartlomiej, Thanks for your link. # > hdg: dma_intr: status=0x51 { DriveReady SeekComplete Error } # > hdg: dma_intr: error=0x40 { UncorrectableError }, LBAsect=262311, high=0, low=262311, sector=262311 # > ide: failed opcode was: unknown # > end_request: I/O error, dev hdg, sector 262311 # > Buffer I/O error on device hdg1, logical block 131124 # > # > fscking this disk freezes the entire system. # > # > The disk was remounted ro afterwards. # > Disk itself is ok. Is a new one. # http://smartmontools.sf.net Extract from /usr/share/doc/smartmontools/WARNINGS.gz: SYSTEM: Promise 20265 IDE-controller PROBLEM: Smartctl locks system solid when used on CDROM/DVD device REPORTER: see link below LINK: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208964 NOTE: Problem seems to affect kernel 2.4.21 only. SYSTEM: Promise IDE-controllers and perhaps others also PROBLEM: System freezes under heavy load, perhaps when running SMART commands REPORTER: Mario 'BitKoenig' Holbe [EMAIL PROTECTED] LINK: http://groups.google.de/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=1wUXW- 2FA-9%40gated-at.bofh.it NOTE: Before freezing, SYSLOG shows the following message(s) kernel: hdf: dma timer expiry: dma status == 0xXX where XX is two hexidecimal digits. This may be a kernel bug or an underlying hardware problem. It's not clear if smartmontools plays a role in provoking this problem. FINAL NOTE: Problem was COMPLETELY resolved by replacing the power supply. See URL above, entry on May 29, 2004 by Holbe. Other things to try are exchanging cables, and cleaning PCI slots. This sounds highly familiar and shows an at least hidden correlation(-potential) between this kind of error and the Promise controller PDC drivers. Ok, maybe I'm suffering prejudices now. We'll see. A year ago, other disks (IBM/WD) had trouble on the PDC also, but not on onboard controllers. And they are still spinning today. (Means, they had not to be replaced for hard disk errors) Fact is however, that as mailed last year, even after a complete exchange of mainboard and processor, the problem perexists through any kernel-version. Furthermore, countless posts indicate similar or same symptoms. Nevertheless, I keep the list up-to-date in case of new info. smartctl -a /dev/hdc gives: Error 18 occurred at disk power-on lifetime: 2249 hours (93 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 f8 a8 05 c3 e0 Error: UNC at LBA = 0x00c305a8 = 12780968 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 24 00 f8 a7 05 c3 06 00 00:08:14.850 READ SECTOR(S) EXT 25 00 00 9f 05 c3 06 00 00:08:14.850 READ DMA EXT 25 00 00 9f 04 c3 06 00 00:08:14.850 READ DMA EXT 25 00 00 9f 03 c3 06 00 00:08:14.850 READ DMA EXT 25 00 00 9f 02 c3 06 00 00:08:14.850 READ DMA EXT Error 17 occurred at disk power-on lifetime: 2249 hours (93 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 47 06 c3 e0 Error: UNC at LBA = 0x00c30647 = 12781127 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 25 00 00 9f 05 c3 06 00 00:07:48.550 READ DMA EXT 25 00 00 9f 04 c3 06 00 00:07:48.550 READ DMA EXT 25 00 00 9f 03 c3 06 00 00:07:48.550 READ DMA EXT 25 00 00 9f 02 c3 06 00 00:07:48.550 READ DMA EXT 25 00 00 9f 01 c3 06 00 00:07:48.550 READ DMA EXT Error 16 occurred at disk power-on lifetime: 2249 hours (93 days + 17 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 20 b0 f2 57 e0 Error: UNC at LBA = 0x0057f2b0 = 5763760 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 24 00 20 af f2 57 10 00 00:43:45.600 READ SECTOR(S) EXT 25 00 28 a7 f2 57 10 00 00:43:45.600 READ DMA EXT 25 00 18 77 f2 57 10 00 00:43:45.600 READ DMA EXT 25 00 18 5f 28 57 11 00 00:43:45.600 READ DMA EXT 25 00 08 7f 10 54 10 00 00:43:45.600 READ DMA EXT Error 15 occurred at disk power-on lifetime: 2249 hours (93 days + 17 hours) When the command that caused the error occurred, the device was doing SM
Re: ethX interface rx errors
Hi Omer, hi others, # i'm wondering if you ever found a solution to the problem you # have described here: http://lkml.org/lkml/2004/12/5/81 I'll send a small update today with this email. # i'm having the exact same issue with one of my linux machines, # and i would really appreciate any advice you can give. Only symptomatical cure: reboot. But: Make sure you power off and boot up, consecutively!! Just reboot will not work (at least for me). Make your box having a complete power cycle. uptime: 10:55:35 up 6 days, 19:06, 1 user, load average: 1.11, 0.89, 0.85 So far, now rx errors. Yet. Linux service 2.6.10service #1 Thu Jan 6 21:53:31 CET 2005 i686 GNU/Linux and Linux service 2.6.11tf #1 Thu Jan 6 21:53:31 CET 2005 i686 GNU/Linux [EMAIL PROTECTED]:~# ifconfig eth0 eth0 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) [EMAIL PROTECTED]:~# ifconfig eth1 eth1 RX packets:27521716 errors:0 dropped:0 overruns:0 frame:0 TX packets:43011137 errors:0 dropped:0 overruns:10 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2384780308 (2.2 GiB) TX bytes:2096012953 (1.9 GiB) Using bridge: br0 RX packets:27490318 errors:0 dropped:0 overruns:0 frame:0 TX packets:43026044 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2009612925 (1.8 GiB) TX bytes:2082138120 (1.9 GiB) BTW (OT): Would be great having vpn/racoon being capable getting attached to network devices. This way you could implement transparent vpn. VPN: WLAN <--> LAN, all same (sub)net but encrypted on the wireless half. (vrf?, hm did not chk that, so far) Use latest (vanilla) kernels. I do not have any experiences with other kernels, be it -ac, -mm, rh, suse, debian. Honestly, I do not want to use them any longer. (Yes, ok, there _is_ experience, but, you know..;-) 2.6.10 had some serious trouble. I. e. (OT) suspend/resume on notebooks: did not restore hw clock time. 2.6.11 does. 2.6.10: problem with ati driver, but 2.6.11 does not. There seems to be a serious relation with USB. Having whatever usb device connected (it is _not_ important whether this device got its modules loaded!) it accelerates the rx error count increase. I first suspected the binary webcam module for the philips webcams responsible. But now, I tend to say that was rather hasardous. That binary module is responsible for kernel Oopses in the USB context. This might be the connection between these premisses. Don't think too much about that module beeing responsible for the rx errors, any more. And I do not use the webcam any longer.. Pity, quality is quite good. It renders also more possible the first rx error. If w/o usb device the first occurrence is under havy (net or not?) load after an hour, it will quite sure be after 2 minutes with usb device attached.. There might be a correlation with samba! smbd/nmbd seems prone to cause similar effects. What is no problem is to recognise remotely that you _have_ the problem "rx errors": just try to transfer a at leas 10mb file to the problematic box running the samba server. Let's have a coffee or sex or read a book or whatever meanwhile. If you come back to your so long ago started upload: Do not be stunned, it will not have finished yet. In fact, it never will.. # thanks very much. You're welcome. At least we're not alone ;-) Actually, there are about 4 to 5 known people out there having this problem. Not too much, it seems. But it also seems an (slowly) increasing number. AND: they are all using rtl chipset cards. But: Know that the rx error problem occurred with 3com 3c905 also! One more detail: in circumstances without any rx errors the maximum throughput (using samba) reached on site is about 4-5MB per second. Hope could help somehow.. Nils -- A+ * N.Radtke@ * University of Stuttgart *icq / lc * * www.Think-Future.de *dep.comp.science * 9336272/92045 * :xUTM 32 0515651 5394088 :) Overdrawn? But I still have checks left! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ethX interface rx errors AND RE: Promise module (old) broken
Hi, a month and a half ago, I asked about a solution for the bad performance of 8139too _and_ 3c509 cards. Another email was about the bad performance of the PROMISE20565 controller and making hard disks go mad (and me, too). Thx to Nick Warne, Bernd Eckenfels (both about RX errors), Bartlomiej Zolnierkiewicz, Alan Pope (both about PROMISE20565) for answers and suggestions. Both topics have more or less been solved (at least they are handled in a way the system runs stably again), see below, read on: The main (ethx RX errors) problem has not been solved yet. Still, there are RX errors. But: they amount of RX errors has dramatically decreased! What did I do? 1) plug the cards into different pci-slots. Mainly, I swapped all of the plugged PCI cards one card slot above. Previously, the cards were plugged in PCI slot 4 and 5, so they are now at 3 and for. I did this as on the Asus P3BF mainboard slots 4 and 5 share interrupts with each other. Slot 5 also shares IRQ with USB. Moreover, I cared about not to let share interrupts between the slot the NICs and the PROMISE20565 cards are plugged. The PROMISE20565 controller card driver is also something special and causing trouble on itself. More on that below. The problem was the same with one 8139too and one 3c509 cards plugged. Now, there are only two 8139too cards in the system, again. 2) switch to kernel 2.6.10 Thats what took me some convincing with myself as 2.6.8-2.6.9 had caused some serious issues with the beloved PROMISE20565 driver.. Hey, who dares wins! (sometimes) With 2.6.10 (and the PCI slots swapped) the RX errors are much less AND the PROMISE20565 controller works (almost) out of the box. The controller driver previously made me beleive a brandnew WD 120GB HDD be defective. But it isn't. It's the driver making the HDD shaking. Or better, has been. Now, the HDD is running fast (as it wasn't before) and w/o disk drive seek errors all the time, causing the kernel to read-only mount the disk. That was nasty. Better now. (PROMISE20565 controller card is on a different PCI slot as well, now) Thanks for all your suggestions and answers! With kind regards, Nils Radtke -- A+ * N.Radtke@ * University of Stuttgart *icq / lc * * www.Think-Future.de *dep.comp.science * 9336272/92045 * :xUTM 32 0515651 5394088 :) You just wait, I'll sin till I blow up! -- Dylan Thomas signature.asc Description: Digital signature