Re: PROBLEM: Buffer I/O error on device hdg1, system freeze.

2005-03-18 Thread Nils Radtke

Hi Bartlomiej,

Thanks for your link.

# >  hdg: dma_intr: status=0x51 { DriveReady SeekComplete Error }
# >  hdg: dma_intr: error=0x40 { UncorrectableError }, LBAsect=262311, high=0, 
low=262311, sector=262311
# >  ide: failed opcode was: unknown
# >  end_request: I/O error, dev hdg, sector 262311
# >  Buffer I/O error on device hdg1, logical block 131124
# > 
# >   fscking this disk freezes the entire system.
# > 
# >  The disk was remounted ro afterwards.
# >  Disk itself is ok. Is a new one.

# http://smartmontools.sf.net
Extract from /usr/share/doc/smartmontools/WARNINGS.gz:

SYSTEM:   Promise 20265 IDE-controller
PROBLEM:  Smartctl locks system solid when used on CDROM/DVD device
REPORTER: see link below
LINK: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208964
NOTE: Problem seems to affect kernel 2.4.21 only.


SYSTEM:   Promise IDE-controllers and perhaps others also
PROBLEM:  System freezes under heavy load, perhaps when running SMART
commands
REPORTER: Mario 'BitKoenig' Holbe [EMAIL PROTECTED]
LINK:
http://groups.google.de/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=1wUXW-
2FA-9%40gated-at.bofh.it
NOTE: Before freezing, SYSLOG shows the following message(s)
  kernel: hdf: dma timer expiry: dma status == 0xXX
  where XX is two hexidecimal digits. This may be a kernel bug
  or an underlying hardware problem.  It's not clear if
  smartmontools plays a role in provoking this problem.  FINAL
  NOTE: Problem was COMPLETELY resolved by replacing the power
  supply.  See URL above, entry on May 29, 2004 by Holbe.  Other
  things to try are exchanging cables, and cleaning PCI slots.


This sounds highly familiar and shows an at least hidden
correlation(-potential) between this kind of error and the Promise controller 
PDC drivers.
Ok, maybe I'm suffering prejudices now. We'll see.
A year ago, other disks (IBM/WD) had trouble on the PDC also, but not on onboard
controllers. And they are still spinning today. (Means, they had not to
be replaced for hard disk errors)

Fact is however, that as mailed last year, even after a complete
exchange of mainboard and processor, the problem perexists through any
kernel-version. Furthermore, countless posts indicate similar or same
symptoms.

Nevertheless, I keep the list up-to-date in case of new info.

smartctl -a /dev/hdc gives:
Error 18 occurred at disk power-on lifetime: 2249 hours (93 days + 17
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 f8 a8 05 c3 e0  Error: UNC at LBA = 0x00c305a8 = 12780968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  24 00 f8 a7 05 c3 06 00  00:08:14.850  READ SECTOR(S) EXT
  25 00 00 9f 05 c3 06 00  00:08:14.850  READ DMA EXT
  25 00 00 9f 04 c3 06 00  00:08:14.850  READ DMA EXT
  25 00 00 9f 03 c3 06 00  00:08:14.850  READ DMA EXT
  25 00 00 9f 02 c3 06 00  00:08:14.850  READ DMA EXT

Error 17 occurred at disk power-on lifetime: 2249 hours (93 days + 17
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 47 06 c3 e0  Error: UNC at LBA = 0x00c30647 = 12781127

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  25 00 00 9f 05 c3 06 00  00:07:48.550  READ DMA EXT
  25 00 00 9f 04 c3 06 00  00:07:48.550  READ DMA EXT
  25 00 00 9f 03 c3 06 00  00:07:48.550  READ DMA EXT
  25 00 00 9f 02 c3 06 00  00:07:48.550  READ DMA EXT
  25 00 00 9f 01 c3 06 00  00:07:48.550  READ DMA EXT

Error 16 occurred at disk power-on lifetime: 2249 hours (93 days + 17
hours)
  When the command that caused the error occurred, the device was doing
SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 20 b0 f2 57 e0  Error: UNC at LBA = 0x0057f2b0 = 5763760

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  24 00 20 af f2 57 10 00  00:43:45.600  READ SECTOR(S) EXT
  25 00 28 a7 f2 57 10 00  00:43:45.600  READ DMA EXT
  25 00 18 77 f2 57 10 00  00:43:45.600  READ DMA EXT
  25 00 18 5f 28 57 11 00  00:43:45.600  READ DMA EXT
  25 00 08 7f 10 54 10 00  00:43:45.600  READ DMA EXT

Error 15 occurred at disk power-on lifetime: 2249 hours (93 days + 17
hours)
  When the command that caused the error occurred, the device was doing
SM

Re: ethX interface rx errors

2005-03-12 Thread Nils Radtke

Hi Omer, hi others,


# i'm wondering if you ever found a solution to the problem you
# have described here: http://lkml.org/lkml/2004/12/5/81
I'll send a small update today with this email.

# i'm having the exact same issue with one of my linux machines,
# and i would really appreciate any advice you can give.

Only symptomatical cure: reboot. 
But: Make sure you power off and boot up, consecutively!! 

Just reboot will not work (at least for me). Make your box having a complete
power cycle.

uptime: 10:55:35 up 6 days, 19:06,  1 user,  load average: 1.11, 0.89, 0.85
So far, now rx errors. Yet.
Linux service 2.6.10service #1 Thu Jan 6 21:53:31 CET 2005 i686 GNU/Linux
and
Linux service 2.6.11tf #1 Thu Jan 6 21:53:31 CET 2005 i686 GNU/Linux

[EMAIL PROTECTED]:~# ifconfig eth0 
eth0  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

[EMAIL PROTECTED]:~# ifconfig eth1 
eth1  RX packets:27521716 errors:0 dropped:0 overruns:0 frame:0
  TX packets:43011137 errors:0 dropped:0 overruns:10 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:2384780308 (2.2 GiB)  TX bytes:2096012953 (1.9 GiB)

Using bridge:
br0   RX packets:27490318 errors:0 dropped:0 overruns:0 frame:0
  TX packets:43026044 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0 
  RX bytes:2009612925 (1.8 GiB)  TX bytes:2082138120 (1.9 GiB)

BTW (OT): Would be great having vpn/racoon being capable getting attached to
network devices. This way you could implement transparent vpn. VPN:
WLAN <--> LAN, all same (sub)net but encrypted on the wireless half.
(vrf?, hm did not chk that, so far)

Use latest (vanilla) kernels. I do not have any experiences with other
kernels, be it -ac, -mm, rh, suse, debian. Honestly, I do not want to
use them any longer. (Yes, ok, there _is_ experience, but, you know..;-)

2.6.10 had some serious trouble. I. e. (OT) suspend/resume on notebooks:
did not restore hw clock time. 2.6.11 does. 2.6.10: problem with ati
driver, but 2.6.11 does not.

There seems to be a serious relation with USB. Having whatever usb
device connected (it is _not_ important whether this device got its
modules loaded!) it accelerates the rx error count increase. 

I first suspected the binary webcam module for the philips webcams
responsible. But now, I tend to say that was rather hasardous. That
binary module is responsible for kernel Oopses in the USB context.
This might be the connection between these premisses.
Don't think too much about that module beeing responsible for the rx
errors, any more. 
And I do not use the webcam any longer.. Pity, quality is quite good.

It renders also more possible the first rx error. If w/o usb device the
first occurrence is under havy (net or not?) load after an hour, it will
quite sure be after 2 minutes with usb device attached..

There might be a correlation with samba! smbd/nmbd seems prone to cause
similar effects. 

What is no problem is to recognise remotely that you _have_ the problem "rx
errors": just try to transfer a at leas 10mb file to the problematic box 
running the samba server. Let's have a coffee or sex or read a book or whatever
meanwhile. If you come back to your so long ago started upload: Do not
be stunned, it will not have finished yet. In fact, it never will..
 
# thanks very much.
You're welcome. At least we're not alone ;-)
Actually, there are about 4 to 5 known people out there having this problem.
Not too much, it seems. But it also seems an (slowly) increasing number.

AND: they are all using rtl chipset cards. But: Know that the rx error
problem occurred with 3com 3c905 also!

One more detail: in circumstances without any rx errors the maximum
throughput (using samba) reached on site is about 4-5MB per second.

Hope could help somehow..


Nils


-- 
A+
* N.Radtke@ * University of Stuttgart *icq / lc   *
*  www.Think-Future.de  *dep.comp.science * 9336272/92045 *
:xUTM 32 0515651 5394088 :)
   Overdrawn?  But I still have checks left! 
   
   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ethX interface rx errors AND RE: Promise module (old) broken

2005-01-16 Thread Nils Radtke

  Hi, 

 a month and a half ago, I asked about a solution for the bad performance 
 of 8139too _and_ 3c509 cards.
Another email was about the bad performance of the PROMISE20565
controller and making hard disks go mad (and me, too).

Thx to Nick Warne, Bernd Eckenfels (both about RX errors), Bartlomiej
Zolnierkiewicz, Alan Pope (both about PROMISE20565) for answers and suggestions.

Both topics have more or less been solved (at least they are handled in a way
the system runs stably again), see below, read on:

The main (ethx RX errors) problem has not been solved yet. Still, there are 
RX errors.

But: they amount of RX errors has dramatically decreased!


What did I do?

1) plug the cards into different pci-slots. Mainly, I swapped all of the
plugged PCI cards one card slot above. Previously, the cards were
plugged in PCI slot 4 and 5, so they are now at 3 and for. I did this as
on the Asus P3BF mainboard slots 4 and 5 share interrupts with each
other. Slot 5 also shares IRQ with USB. Moreover, I cared about not to
let share interrupts between the slot the NICs and the PROMISE20565
cards are plugged. The PROMISE20565 controller card driver is also
something special and causing trouble on itself. More on that below.
The problem was the same with one 8139too and one 3c509 cards plugged.
Now, there are only two 8139too cards in the system, again.

2) switch to kernel 2.6.10 
Thats what took me some convincing with myself as 2.6.8-2.6.9 had caused
some serious issues with the beloved PROMISE20565 driver..
Hey, who dares wins! (sometimes) 
With 2.6.10 (and the PCI slots swapped) the RX errors are much less AND 
the PROMISE20565 controller works (almost) out of the box. 
The controller driver previously made me beleive a brandnew WD 120GB HDD 
be defective. 
But it isn't. It's the driver making the HDD shaking. 
Or better, has been. Now, the HDD is running fast (as it wasn't before) 
and w/o disk drive seek errors all the time, causing the kernel to 
read-only mount the disk. That was nasty. Better now.
(PROMISE20565 controller card is on a different PCI slot as well, now)


  Thanks for all your suggestions and answers!


With kind regards,


  Nils Radtke


-- 
A+
* N.Radtke@ * University of Stuttgart *icq / lc   *
*  www.Think-Future.de  *dep.comp.science * 9336272/92045 *
:xUTM 32 0515651 5394088 :)
   You just wait, I'll sin till I blow up!   -- Dylan Thomas 
   
   


signature.asc
Description: Digital signature