Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-18 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> I would comment the message out.  I added it to see how often the recovery
> was triggering..

i'll probably do that eventually. so far it's triggered 97 times in
249 seconds.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-18 Thread Stephen Hemminger
On Mon, 18 Dec 2006 10:24:59 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger <[EMAIL PROTECTED]> writes:
> 
> > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later
> > version see:
> > http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz
> >
> > It is too noisy in the console log, because it shows how many times
> > the driver dope slaps itself senseless...  Basically every 250ms when
> > it is idle it resets, sorry it's the kind of code you right to "make it 
> > work"
> > and ship it which is why vendor drivers suck.
> 
> i am running now with your fixed version. indeed, it is very noisy, i
> get a constant stream of:
> 
> kernel: eth0: Attempting recovery
> kernel: eth0: receiver stuck?
> 
> but it works. let's see how long it takes to fill up the root
> partition... :-(
> 

I would comment the message out.  I added it to see how often the recovery
was triggering..

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-18 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later
> version see:
>   http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz
>
> It is too noisy in the console log, because it shows how many times
> the driver dope slaps itself senseless...  Basically every 250ms when
> it is idle it resets, sorry it's the kind of code you right to "make it work"
> and ship it which is why vendor drivers suck.

i am running now with your fixed version. indeed, it is very noisy, i
get a constant stream of:

kernel: eth0: Attempting recovery
kernel: eth0: receiver stuck?

but it works. let's see how long it takes to fill up the root
partition... :-(

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-15 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later
> version see:
>   http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz
>
> It is too noisy in the console log, because it shows how many times
> the driver dope slaps itself senseless...  Basically every 250ms when
> it is idle it resets, sorry it's the kind of code you right to "make it work"
> and ship it which is why vendor drivers suck.

i'll give it a try on monday when i go back to work. in the meantime
i've been running with my "fixed" version of the vendor driver and so
far it's been working without any problems (i've been transferring
lots of data in and out of the computer the whole day). if there is
anything i can do to help debug the kernel sky2 driver let me know.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-15 Thread Stephen Hemminger
On Thu, 14 Dec 2006 19:53:45 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger <[EMAIL PROTECTED]> writes:
> 
> > I have a fixed up version of the vendor driver, I'll repackage it tomorrow.
> 
> as per the include file, i ended up replacing all the CHECKSUM_HW with
> CHECkSUM_PARTIAL since the functions in questions had to do with
> transmit. seems to be working so far without any lockups. we'll see
> how long this lasts.
> 
> --alex--
> 

I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later
version see:
http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz

It is too noisy in the console log, because it shows how many times
the driver dope slaps itself senseless...  Basically every 250ms when
it is idle it resets, sorry it's the kind of code you right to "make it work"
and ship it which is why vendor drivers suck.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> I have a fixed up version of the vendor driver, I'll repackage it tomorrow.

as per the include file, i ended up replacing all the CHECKSUM_HW with
CHECkSUM_PARTIAL since the functions in questions had to do with
transmit. seems to be working so far without any lockups. we'll see
how long this lasts.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Stephen Hemminger
On Fri, 15 Dec 2006 13:24:32 +1100
Herbert Xu <[EMAIL PROTECTED]> wrote:

> Alex Romosan <[EMAIL PROTECTED]> wrote:
>  /** does the HW need to evaluate checksum for TCP or UDP packets?
> > if (pMessage->ip_summed == CHECKSUM_HW)
> > 
> > maybe this needs to be replace with CHECKSUM_PARTIAL. the second one
> > 
> > /** TCP checksum offload
> > if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) &&
> > (SetOpcodePacketFlag == SK_TRUE)
> > 
> > i wonder if this is supposed to be CHECKSUM_COMPLETE
> 
> The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX.
> 
> Cheers,

I have a fixed up version of the vendor driver, I'll repackage it tomorrow.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Herbert Xu
Alex Romosan <[EMAIL PROTECTED]> wrote:
 /** does the HW need to evaluate checksum for TCP or UDP packets?
> if (pMessage->ip_summed == CHECKSUM_HW)
> 
> maybe this needs to be replace with CHECKSUM_PARTIAL. the second one
> 
> /** TCP checksum offload
> if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) &&
> (SetOpcodePacketFlag == SK_TRUE)
> 
> i wonder if this is supposed to be CHECKSUM_COMPLETE

The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> If this is repeatable... and mac_pause is always one then the
> problem is hardware flow control.  I saw bugs before in the bus
> interface where it would not resume on unaligned buffer, but
> that was on receive.

i tried to switch over to the latest vendor driver but unfortunately
it doesn't work with kernel 2.6.19+. it still uses CHECKSUM_HW which
looks like it was replaced by CHECKSUM_PARTIAL and CHECKSUM_COMPLETE
was also added. i think i can replace CHECKSUM_HW in the marvell
driver with CHECKSUM_PARTIAL, except for a couple of places where i
i am not sure what i am supposed to do. the first instance it says (i
am kind of paraphrasing here since i am copying from the screen and
not cutting and pasting):

/** does the HW need to evaluate checksum for TCP or UDP packets?
if (pMessage->ip_summed == CHECKSUM_HW)

maybe this needs to be replace with CHECKSUM_PARTIAL. the second one

/** TCP checksum offload
if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) &&
(SetOpcodePacketFlag == SK_TRUE)

i wonder if this is supposed to be CHECKSUM_COMPLETE

if you have any suggestions, i'll appreciate it.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Stephen Hemminger
On Thu, 14 Dec 2006 15:21:00 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger <[EMAIL PROTECTED]> writes:
> 
> > Another useful bit of information is the statistics (ethtool -S eth0).
> > When there were flow control bugs, they would show up as count of 1.
> 
> the driver locked up again, even with msi interrupts disabled and
> idle_timeout=10. the console message was pretty much as before:
> 
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336
> kernel: sky2 hardware hung? flushing
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336
> kernel: sky2 status report lost?
> 
> and this is the output from ethtool -S:
> 
> NIC statistics:
>  tx_bytes: 3092123897
>  rx_bytes: 546577898
>  tx_broadcast: 20
>  rx_broadcast: 4376
>  tx_multicast: 0
>  rx_multicast: 459
>  tx_unicast: 2585993
>  rx_unicast: 1550758
>  tx_mac_pause: 1

If this is repeatable... and mac_pause is always one then the
problem is hardware flow control.  I saw bugs before in the bus
interface where it would not resume on unaligned buffer, but
that was on receive.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> Another useful bit of information is the statistics (ethtool -S eth0).
> When there were flow control bugs, they would show up as count of 1.

the driver locked up again, even with msi interrupts disabled and
idle_timeout=10. the console message was pretty much as before:

kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336
kernel: sky2 hardware hung? flushing
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336
kernel: sky2 status report lost?

and this is the output from ethtool -S:

NIC statistics:
 tx_bytes: 3092123897
 rx_bytes: 546577898
 tx_broadcast: 20
 rx_broadcast: 4376
 tx_multicast: 0
 rx_multicast: 459
 tx_unicast: 2585993
 rx_unicast: 1550758
 tx_mac_pause: 1
 rx_mac_pause: 0
 collisions: 0
 late_collision: 0
 aborted: 0
 single_collisions: 0
 multi_collisions: 0
 rx_short: 0
 rx_runt: 0
 rx_64_byte_packets: 850693
 rx_65_to_127_byte_packets: 297029
 rx_128_to_255_byte_packets: 62116
 rx_256_to_511_byte_packets: 28795
 rx_512_to_1023_byte_packets: 31357
 rx_1024_to_1518_byte_packets: 285603
 rx_1518_to_max_byte_packets: 0
 rx_too_long: 0
 rx_fifo_overflow: 0
 rx_jabber: 0
 rx_fcs_error: 0
 tx_64_byte_packets: 194159
 tx_65_to_127_byte_packets: 239961
 tx_128_to_255_byte_packets: 48148
 tx_256_to_511_byte_packets: 27635
 tx_512_to_1023_byte_packets: 95557
 tx_1024_to_1518_byte_packets: 1980554
 tx_1519_to_max_byte_packets: 0
 tx_fifo_underrun: 0

time to try the vendor driver and see if that provides any clues.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> Another useful bit of information is the statistics (ethtool -S
> eth0). When there were flow control bugs, they would show up as
> count of 1.
>
> Are you doing jumbo frames (MTU > 1500)?

i just did 'ethtool -S eth0' (the card is still working fine) and i
don't think there are any jumbo frames. anyway, this is the output:

NIC statistics:
 tx_bytes: 2697577533
 rx_bytes: 503104106
 tx_broadcast: 18
 rx_broadcast: 4068
 tx_multicast: 0
 rx_multicast: 416
 tx_unicast: 2276028
 rx_unicast: 1359009
 tx_mac_pause: 0
 rx_mac_pause: 0
 collisions: 0
 late_collision: 0
 aborted: 0
 single_collisions: 0
 multi_collisions: 0
 rx_short: 0
 rx_runt: 0
 rx_64_byte_packets: 713826
 rx_65_to_127_byte_packets: 271861
 rx_128_to_255_byte_packets: 57307
 rx_256_to_511_byte_packets: 25689
 rx_512_to_1023_byte_packets: 28502
 rx_1024_to_1518_byte_packets: 266308
 rx_1518_to_max_byte_packets: 0
 rx_too_long: 0
 rx_fifo_overflow: 0
 rx_jabber: 0
 rx_fcs_error: 0
 tx_64_byte_packets: 174188
 tx_65_to_127_byte_packets: 225242
 tx_128_to_255_byte_packets: 44294
 tx_256_to_511_byte_packets: 24475
 tx_512_to_1023_byte_packets: 80147
 tx_1024_to_1518_byte_packets: 1727700
 tx_1519_to_max_byte_packets: 0
 tx_fifo_underrun: 0

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> Another useful bit of information is the statistics (ethtool -S eth0).
> When there were flow control bugs, they would show up as count of 1.

we'll see if the machine locks up again.

> Are you doing jumbo frames (MTU > 1500)?

no (or at least i don't think so). how can i tell?

assuming the machine doesn't lock up with msi interrupts disabled, do
you want me to do anything to debug why the driver locks up when the
msi interrupts are enabled?

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Stephen Hemminger
On Thu, 14 Dec 2006 14:25:06 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger <[EMAIL PROTECTED]> writes:
> 
> > 4) What is the IRQ routing?
> >There are two issues here, first the driver will never work with edge
> >trigger IRQ's, some motherboards also have busted BIOS and chipsets
> >that don't do MSI properly. A couple of module parameters are available
> >to help:
> >   disable_msi=1 avoids using MSI
> >   idle_timeout=10   polls for lost IRQ's every N ms (10)
> 
> i didn't take long to lock up the machine again. i've rebooted back
> into stock 2.6.20-rc1 and added the two module parameters above. cat
> /proc/interrupts now gives me:
> 
>  17:203   IO-APIC-fasteoi   eth0, CMI8738
> 
> so i guess the MSI interrupts are disabled. we'll see how this works.

probably won't do much but now the IRQ ends up shared.

> > 5) What are the messages in the console log when problem happens?
> 
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406
> kernel: sky2 status report lost?

The transmit timeout code trys to be smart, but doesn't really
recover properly if hardware is stuck.


> > 7) Please get a current version of ethtool from:
> >git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
> >and run ethtool register dump after a problem occurs:
> >   ethtool -d eth0
> 
> this is the output after it stopped working:
> 
> 
> PCI config
> --
> 00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00
> 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05
> 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00
> 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14
> 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
> 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00
> 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Control Registers
> -
> Register Access Port 0x00
> LED Control/Status   0xA603164A
> Interrupt Source 0x4000
> Interrupt Mask   0xC01D
> Interrupt Hardware Error Source  0x
> Interrupt Hardware Error Mask0x2E003F3F
> 
> Bus Management Unit
> ---
> CSR Receive Queue 1  0x0001
> CSR Sync Queue 1 0x
> CSR Async Queue 10x
> 
> MAC Addresses
> ---
> Addr 100 11 09 DA 39 A3
> Addr 200 11 09 DA 39 A3
> Addr 300 00 00 00 00 00
> 
> Connector type   0x4A (J)
> PMD type 0x54 (T)
> PHY type 0x80
> Chip Id  0xB6 Yukon-2 EC
>  (rev 0)
> Ram Buffer   0x0C
> 
> Status BMU:
> ---
> Control0x0002220A
> Last Index 0x07FF
> Put Index  0x0601
> List Address   0x7FBF8000
> Transmit 1 done index  0x0196
> Transmit index threshold   0x000A
> 
> Status FIFO
>   Write Pointer0x16
>   Read Pointer 0x16
>   Level0x00
>   Watermark0x10
>   ISR Watermark0x10
> Status level
>   Init 0x30D4 Value 0x0D00
>   Test 0x04   Control 0x02
> TX status
>   Init 0x0001E848 Value 0x0001E848
>   Test 0x04   Control 0x02
> ISR
>   Init 0x09C4 Value 0x09C4
>   Test 0x04   Control 0x02
> 
> GMAC control 0x005A
> GPHY control 0x2002
> LINK control 0x02
> 
> GMAC 1
> Status   0xD000
> Control  0x1800
> Transmit 0x1000
> Receive  0xE000
> Transmit flow control0x
> Transmit parameter   0xD7C4
> Serial mode  0x221E
>   Source address:  00 11 09 DA 39 A3
> Physical address:  00 11 09 DA 39 A3
> 
> Rx GMAC 1
> End Address  0x007F
> Almost Full Thresh   0x0070
> Control/Test 0x0900228A
> FIFO Flush Mask  0x18FB
> FIFO Flush Threshold 0x000B
> Truncation Threshold 0x017C
> Upper Pause Threshold0x
> Lower Pause Threshold0x0081
> VLAN Tag 0x0074
> FIFO Write Pointer   0x
> FIFO Write Level 0x007B
> FIFO Read Pointer0x
> FIFO Read Level  0x0079
> 
> Tx GMAC 1
> End Address  0x007F
> Almost Full Thresh   0x0010
> Control/Test 0x0102220A
> FIFO Flush Mask  0x
> FIFO Flush Threshold 0x
> Truncation Thres

Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> 4) What is the IRQ routing?
>There are two issues here, first the driver will never work with edge
>trigger IRQ's, some motherboards also have busted BIOS and chipsets
>that don't do MSI properly. A couple of module parameters are available
>to help:
>   disable_msi=1   avoids using MSI
>   idle_timeout=10 polls for lost IRQ's every N ms (10)

i didn't take long to lock up the machine again. i've rebooted back
into stock 2.6.20-rc1 and added the two module parameters above. cat
/proc/interrupts now gives me:

 17:203   IO-APIC-fasteoi   eth0, CMI8738

so i guess the MSI interrupts are disabled. we'll see how this works.

> 5) What are the messages in the console log when problem happens?

kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406
kernel: sky2 status report lost?
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 406 .. 361 report=406 done=406
kernel: sky2 hardware hung? flushing
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 361 .. 321 report=406 done=406
kernel: sky2 status report lost?
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 406 .. 366 report=406 done=406
kernel: sky2 hardware hung? flushing

> 7) Please get a current version of ethtool from:
>git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
>and run ethtool register dump after a problem occurs:
>   ethtool -d eth0

this is the output after it stopped working:


PCI config
--
00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00
10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05
30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00
40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14
50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00
70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Control Registers
-
Register Access Port 0x00
LED Control/Status   0xA603164A
Interrupt Source 0x4000
Interrupt Mask   0xC01D
Interrupt Hardware Error Source  0x
Interrupt Hardware Error Mask0x2E003F3F

Bus Management Unit
---
CSR Receive Queue 1  0x0001
CSR Sync Queue 1 0x
CSR Async Queue 10x

MAC Addresses
---
Addr 100 11 09 DA 39 A3
Addr 200 11 09 DA 39 A3
Addr 300 00 00 00 00 00

Connector type   0x4A (J)
PMD type 0x54 (T)
PHY type 0x80
Chip Id  0xB6 Yukon-2 EC
 (rev 0)
Ram Buffer   0x0C

Status BMU:
---
Control0x0002220A
Last Index 0x07FF
Put Index  0x0601
List Address   0x7FBF8000
Transmit 1 done index  0x0196
Transmit index threshold   0x000A

Status FIFO
Write Pointer0x16
Read Pointer 0x16
Level0x00
Watermark0x10
ISR Watermark0x10
Status level
Init 0x30D4 Value 0x0D00
Test 0x04   Control 0x02
TX status
Init 0x0001E848 Value 0x0001E848
Test 0x04   Control 0x02
ISR
Init 0x09C4 Value 0x09C4
Test 0x04   Control 0x02

GMAC control 0x005A
GPHY control 0x2002
LINK control 0x02

GMAC 1
Status   0xD000
Control  0x1800
Transmit 0x1000
Receive  0xE000
Transmit flow control0x
Transmit parameter   0xD7C4
Serial mode  0x221E
  Source address:  00 11 09 DA 39 A3
Physical address:  00 11 09 DA 39 A3

Rx GMAC 1
End Address  0x007F
Almost Full Thresh   0x0070
Control/Test 0x0900228A
FIFO Flush Mask  0x18FB
FIFO Flush Threshold 0x000B
Truncation Threshold 0x017C
Upper Pause Threshold0x
Lower Pause Threshold0x0081
VLAN Tag 0x0074
FIFO Write Pointer   0x
FIFO Write Level 0x007B
FIFO Read Pointer0x
FIFO Read Level  0x0079

Tx GMAC 1
End Address  0x007F
Almost Full Thresh   0x0010
Control/Test 0x0102220A
FIFO Flush Mask  0x
FIF

Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Alex Romosan
Stephen Hemminger <[EMAIL PROTECTED]> writes:

> On Thu, 14 Dec 2006 12:47:05 -0800
> Alex Romosan <[EMAIL PROTECTED]> wrote:
>
>> under heavy network load the sky2 driver (compiled in the kernel)
>> locks up and the only way i can get the network back is to reboot the
>> machine (bringing the network down and back up again doesn't help).
>> this happens on an amd64 machine (athlon 3500+ processor) and the card
>> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
>> Ethernet Controller (rev 15) (from lspci). this is what i see in the
>> syslog:
>> 
>> kernel: sky2 eth0: rx error, status 0x414a414a length 0
>> kernel: eth0: hw csum failure.
>> kernel: 
>> kernel: Call Trace:
>> kernel:[] __skb_checksum_complete+0x4d/0x66
>> kernel:  [] tcp_v4_rcv+0x147/0x8ea
>> kernel:  [] raw_rcv_skb+0x9/0x20
>> kernel:  [] raw_rcv+0xbe/0xc4
>> kernel:  [] ip_local_deliver+0x170/0x21b
>> kernel:  [] ip_rcv+0x478/0x4ab
>> kernel:  [] netif_receive_skb+0x184/0x20e
>> kernel:  [] sky2_poll+0x68f/0x93c
>> kernel:  [] scheduler_tick+0x23/0x2f9
>> kernel:  [] net_rx_action+0x61/0xf0
>> kernel:  [] __do_softirq+0x40/0x8a
>> kernel:  [] call_softirq+0x1c/0x28
>> kernel:  [] do_softirq+0x2c/0x7d
>> kernel:  [] irq_exit+0x36/0x42
>> kernel:  [] do_IRQ+0x8c/0x9e
>> kernel:  [] default_idle+0x0/0x3a
>> kernel:  [] ret_from_intr+0x0/0xa
>> kernel:[] default_idle+0x26/0x3a
>> kernel:  [] cpu_idle+0x42/0x75
>> kernel:  [] start_kernel+0x1ce/0x1d3
>> kernel:  [] _sinittext+0x140/0x144
>> kernel: 
>> kernel: eth0: hw csum failure.
>> kernel: 
>> kernel: Call Trace:
>> kernel:[] __skb_checksum_complete+0x4d/0x66
>> kernel:  [] tcp_v4_rcv+0x147/0x8ea
>> kernel:  [] raw_rcv_skb+0x9/0x20
>> kernel:  [] raw_rcv+0xbe/0xc4
>> kernel:  [] ip_local_deliver+0x170/0x21b
>> kernel:  [] ip_rcv+0x478/0x4ab
>> kernel:  [] netif_receive_skb+0x184/0x20e
>> kernel:  [] sky2_poll+0x68f/0x93c
>> kernel:  [] tcp_delack_timer+0x0/0x1b5
>> kernel:  [] net_rx_action+0x61/0xf0
>> kernel:  [] __do_softirq+0x40/0x8a
>> kernel:  [] call_softirq+0x1c/0x28
>> kernel:  [] do_softirq+0x2c/0x7d
>> kernel:  [] irq_exit+0x36/0x42
>> kernel:  [] do_IRQ+0x8c/0x9e
>> kernel:  [] ret_from_intr+0x0/0xa
>> kernel:[] inode2sd+0x104/0x117
>> kernel:  [] search_by_key+0xa08/0xbfe
>> kernel:  [] search_by_key+0x183/0xbfe
>> kernel:  [] ll_rw_block+0x89/0x9e
>> kernel:  [] search_by_key+0x183/0xbfe
>> kernel:  [] __find_get_block_slow+0x101/0x10d
>> kernel:  [] __find_get_block+0x197/0x1a5
>> kernel:  [] inode_get_bytes+0x2a/0x52
>> kernel:  [] reiserfs_update_sd_size+0x7e/0x284
>> kernel:  [] kthread+0xed/0xfd
>> kernel:  [] do_journal_end+0x34b/0xbdd
>> kernel:  [] reiserfs_dirty_inode+0x56/0x76
>> kernel:  [] block_prepare_write+0x1a/0x24
>> kernel:  [] __mark_inode_dirty+0x29/0x197
>> kernel:  [] reiserfs_commit_write+0x10d/0x19f
>> kernel:  [] block_prepare_write+0x1a/0x24
>> kernel:  [] generic_file_buffered_write+0x4ad/0x6c4
>> kernel:  [] __pollwait+0x0/0xe0
>> kernel:  [] current_fs_time+0x35/0x3b
>> kernel:  [] __generic_file_aio_write_nolock+0x379/0x3ec
>> kernel:  [] unix_dgram_recvmsg+0x1be/0x1d9
>> kernel:  [] __mutex_lock_slowpath+0x205/0x210
>> kernel:  [] generic_file_aio_write+0x61/0xc1
>> kernel:  [] generic_file_aio_write+0x0/0xc1
>> kernel:  [] do_sync_readv_writev+0xc0/0x107
>> kernel:  [] autoremove_wake_function+0x0/0x2e
>> kernel:  [] getnstimeofday+0x10/0x28
>> kernel:  [] rw_copy_check_uvector+0x6c/0xdc
>> kernel:  [] do_readv_writev+0xb2/0x18b
>> kernel:  [] sys_writev+0x45/0x93
>> kernel:  [] system_call+0x7e/0x83
>> 
>> and so on. some times i don't get this trace but instead i get:
>> 
>> kernel: sky2 eth0: tx timeout
>> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181
>> kernel: sky2 status report lost?
>> kernel: NETDEV WATCHDOG: eth0: transmit timed out
>> kernel: sky2 eth0: tx timeout
>> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181
>> kernel: sky2 hardware hung? flushing
>> 
> Pleas report these problems to netdev@vger.kernel.org, I rarely go
> looking in LKML.
>
> These are the things you need to debug a sky2 related problem.
>
> 1) What is exact kernel version in use?  This is important because
>problems get fixed but it can be a long while until the fix bubbles down
>to the vendor kernels.

this is stock kernel.org kernel version 2.6.20-rc1 i downloaded this
morning. 2.6.19 and 2.6.19-rc6 i referred to in my original message
were also donloaded from kernel.org.

> 2) What is the chip version?  The driver prints this out on boot up in
>the console log.   (dmesg | grep sky2)
>This matters because each chip version has different
>bugs to deal with.

sky2 v1.10 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1
sky2 eth0: addr 00:11:09:da:39:a3
sky2 eth0: enabling interface
sky2 eth0: ram buffer 48K
sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both


> 3) Does it work with the vendor driver?
>The vendor driver does a number of things differently than 

Re: 2.6.20-rc1 sky2 problems (regression?)

2006-12-14 Thread Stephen Hemminger
On Thu, 14 Dec 2006 12:47:05 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> under heavy network load the sky2 driver (compiled in the kernel)
> locks up and the only way i can get the network back is to reboot the
> machine (bringing the network down and back up again doesn't help).
> this happens on an amd64 machine (athlon 3500+ processor) and the card
> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
> Ethernet Controller (rev 15) (from lspci). this is what i see in the
> syslog:
> 
> kernel: sky2 eth0: rx error, status 0x414a414a length 0
> kernel: eth0: hw csum failure.
> kernel: 
> kernel: Call Trace:
> kernel:[] __skb_checksum_complete+0x4d/0x66
> kernel:  [] tcp_v4_rcv+0x147/0x8ea
> kernel:  [] raw_rcv_skb+0x9/0x20
> kernel:  [] raw_rcv+0xbe/0xc4
> kernel:  [] ip_local_deliver+0x170/0x21b
> kernel:  [] ip_rcv+0x478/0x4ab
> kernel:  [] netif_receive_skb+0x184/0x20e
> kernel:  [] sky2_poll+0x68f/0x93c
> kernel:  [] scheduler_tick+0x23/0x2f9
> kernel:  [] net_rx_action+0x61/0xf0
> kernel:  [] __do_softirq+0x40/0x8a
> kernel:  [] call_softirq+0x1c/0x28
> kernel:  [] do_softirq+0x2c/0x7d
> kernel:  [] irq_exit+0x36/0x42
> kernel:  [] do_IRQ+0x8c/0x9e
> kernel:  [] default_idle+0x0/0x3a
> kernel:  [] ret_from_intr+0x0/0xa
> kernel:[] default_idle+0x26/0x3a
> kernel:  [] cpu_idle+0x42/0x75
> kernel:  [] start_kernel+0x1ce/0x1d3
> kernel:  [] _sinittext+0x140/0x144
> kernel: 
> kernel: eth0: hw csum failure.
> kernel: 
> kernel: Call Trace:
> kernel:[] __skb_checksum_complete+0x4d/0x66
> kernel:  [] tcp_v4_rcv+0x147/0x8ea
> kernel:  [] raw_rcv_skb+0x9/0x20
> kernel:  [] raw_rcv+0xbe/0xc4
> kernel:  [] ip_local_deliver+0x170/0x21b
> kernel:  [] ip_rcv+0x478/0x4ab
> kernel:  [] netif_receive_skb+0x184/0x20e
> kernel:  [] sky2_poll+0x68f/0x93c
> kernel:  [] tcp_delack_timer+0x0/0x1b5
> kernel:  [] net_rx_action+0x61/0xf0
> kernel:  [] __do_softirq+0x40/0x8a
> kernel:  [] call_softirq+0x1c/0x28
> kernel:  [] do_softirq+0x2c/0x7d
> kernel:  [] irq_exit+0x36/0x42
> kernel:  [] do_IRQ+0x8c/0x9e
> kernel:  [] ret_from_intr+0x0/0xa
> kernel:[] inode2sd+0x104/0x117
> kernel:  [] search_by_key+0xa08/0xbfe
> kernel:  [] search_by_key+0x183/0xbfe
> kernel:  [] ll_rw_block+0x89/0x9e
> kernel:  [] search_by_key+0x183/0xbfe
> kernel:  [] __find_get_block_slow+0x101/0x10d
> kernel:  [] __find_get_block+0x197/0x1a5
> kernel:  [] inode_get_bytes+0x2a/0x52
> kernel:  [] reiserfs_update_sd_size+0x7e/0x284
> kernel:  [] kthread+0xed/0xfd
> kernel:  [] do_journal_end+0x34b/0xbdd
> kernel:  [] reiserfs_dirty_inode+0x56/0x76
> kernel:  [] block_prepare_write+0x1a/0x24
> kernel:  [] __mark_inode_dirty+0x29/0x197
> kernel:  [] reiserfs_commit_write+0x10d/0x19f
> kernel:  [] block_prepare_write+0x1a/0x24
> kernel:  [] generic_file_buffered_write+0x4ad/0x6c4
> kernel:  [] __pollwait+0x0/0xe0
> kernel:  [] current_fs_time+0x35/0x3b
> kernel:  [] __generic_file_aio_write_nolock+0x379/0x3ec
> kernel:  [] unix_dgram_recvmsg+0x1be/0x1d9
> kernel:  [] __mutex_lock_slowpath+0x205/0x210
> kernel:  [] generic_file_aio_write+0x61/0xc1
> kernel:  [] generic_file_aio_write+0x0/0xc1
> kernel:  [] do_sync_readv_writev+0xc0/0x107
> kernel:  [] autoremove_wake_function+0x0/0x2e
> kernel:  [] getnstimeofday+0x10/0x28
> kernel:  [] rw_copy_check_uvector+0x6c/0xdc
> kernel:  [] do_readv_writev+0xb2/0x18b
> kernel:  [] sys_writev+0x45/0x93
> kernel:  [] system_call+0x7e/0x83
> 
> and so on. some times i don't get this trace but instead i get:
> 
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181
> kernel: sky2 status report lost?
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181
> kernel: sky2 hardware hung? flushing
> 
> but the end result is the same, the network card stops responding and
> i have to reboot the machine. i can reproduce this on a consistent
> basis so if there are any patches, i can try them out and see if they
> fix the problem.
> 
> this is probably not a regression per se as i saw it as well with
> 2.6.19 and 2.6.19-rc6. i am not sure if it was there previous to
> 2.6.19-rc6. suggestions, patches welcome. thanks.

Pleas report these problems to netdev@vger.kernel.org, I rarely go
looking in LKML.

These are the things you need to debug a sky2 related problem.

1) What is exact kernel version in use?  This is important because
   problems get fixed but it can be a long while until the fix bubbles down
   to the vendor kernels.

2) What is the chip version?  The driver prints this out on boot up in
   the console log.   (dmesg | grep sky2)
   This matters because each chip version has different
   bugs to deal with.

3) Does it work with the vendor driver?
   The vendor driver does a number of things differently than the sky2 driver
   and can mask problems, but if it doesn't work as well that is a useful
   data point.  If you want to know why the sky2