Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > I would comment the message out. I added it to see how often the recovery > was triggering.. i'll probably do that eventually. so far it's triggered 97 times in 249 seconds. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
On Mon, 18 Dec 2006 10:24:59 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote: > Stephen Hemminger <[EMAIL PROTECTED]> writes: > > > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later > > version see: > > http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz > > > > It is too noisy in the console log, because it shows how many times > > the driver dope slaps itself senseless... Basically every 250ms when > > it is idle it resets, sorry it's the kind of code you right to "make it > > work" > > and ship it which is why vendor drivers suck. > > i am running now with your fixed version. indeed, it is very noisy, i > get a constant stream of: > > kernel: eth0: Attempting recovery > kernel: eth0: receiver stuck? > > but it works. let's see how long it takes to fill up the root > partition... :-( > I would comment the message out. I added it to see how often the recovery was triggering.. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later > version see: > http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz > > It is too noisy in the console log, because it shows how many times > the driver dope slaps itself senseless... Basically every 250ms when > it is idle it resets, sorry it's the kind of code you right to "make it work" > and ship it which is why vendor drivers suck. i am running now with your fixed version. indeed, it is very noisy, i get a constant stream of: kernel: eth0: Attempting recovery kernel: eth0: receiver stuck? but it works. let's see how long it takes to fill up the root partition... :-( --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later > version see: > http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz > > It is too noisy in the console log, because it shows how many times > the driver dope slaps itself senseless... Basically every 250ms when > it is idle it resets, sorry it's the kind of code you right to "make it work" > and ship it which is why vendor drivers suck. i'll give it a try on monday when i go back to work. in the meantime i've been running with my "fixed" version of the vendor driver and so far it's been working without any problems (i've been transferring lots of data in and out of the computer the whole day). if there is anything i can do to help debug the kernel sky2 driver let me know. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
On Thu, 14 Dec 2006 19:53:45 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote: > Stephen Hemminger <[EMAIL PROTECTED]> writes: > > > I have a fixed up version of the vendor driver, I'll repackage it tomorrow. > > as per the include file, i ended up replacing all the CHECKSUM_HW with > CHECkSUM_PARTIAL since the functions in questions had to do with > transmit. seems to be working so far without any lockups. we'll see > how long this lasts. > > --alex-- > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later version see: http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz It is too noisy in the console log, because it shows how many times the driver dope slaps itself senseless... Basically every 250ms when it is idle it resets, sorry it's the kind of code you right to "make it work" and ship it which is why vendor drivers suck. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > I have a fixed up version of the vendor driver, I'll repackage it tomorrow. as per the include file, i ended up replacing all the CHECKSUM_HW with CHECkSUM_PARTIAL since the functions in questions had to do with transmit. seems to be working so far without any lockups. we'll see how long this lasts. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
On Fri, 15 Dec 2006 13:24:32 +1100 Herbert Xu <[EMAIL PROTECTED]> wrote: > Alex Romosan <[EMAIL PROTECTED]> wrote: > /** does the HW need to evaluate checksum for TCP or UDP packets? > > if (pMessage->ip_summed == CHECKSUM_HW) > > > > maybe this needs to be replace with CHECKSUM_PARTIAL. the second one > > > > /** TCP checksum offload > > if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) && > > (SetOpcodePacketFlag == SK_TRUE) > > > > i wonder if this is supposed to be CHECKSUM_COMPLETE > > The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX. > > Cheers, I have a fixed up version of the vendor driver, I'll repackage it tomorrow. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Alex Romosan <[EMAIL PROTECTED]> wrote: /** does the HW need to evaluate checksum for TCP or UDP packets? > if (pMessage->ip_summed == CHECKSUM_HW) > > maybe this needs to be replace with CHECKSUM_PARTIAL. the second one > > /** TCP checksum offload > if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) && > (SetOpcodePacketFlag == SK_TRUE) > > i wonder if this is supposed to be CHECKSUM_COMPLETE The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > If this is repeatable... and mac_pause is always one then the > problem is hardware flow control. I saw bugs before in the bus > interface where it would not resume on unaligned buffer, but > that was on receive. i tried to switch over to the latest vendor driver but unfortunately it doesn't work with kernel 2.6.19+. it still uses CHECKSUM_HW which looks like it was replaced by CHECKSUM_PARTIAL and CHECKSUM_COMPLETE was also added. i think i can replace CHECKSUM_HW in the marvell driver with CHECKSUM_PARTIAL, except for a couple of places where i i am not sure what i am supposed to do. the first instance it says (i am kind of paraphrasing here since i am copying from the screen and not cutting and pasting): /** does the HW need to evaluate checksum for TCP or UDP packets? if (pMessage->ip_summed == CHECKSUM_HW) maybe this needs to be replace with CHECKSUM_PARTIAL. the second one /** TCP checksum offload if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) && (SetOpcodePacketFlag == SK_TRUE) i wonder if this is supposed to be CHECKSUM_COMPLETE if you have any suggestions, i'll appreciate it. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
On Thu, 14 Dec 2006 15:21:00 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote: > Stephen Hemminger <[EMAIL PROTECTED]> writes: > > > Another useful bit of information is the statistics (ethtool -S eth0). > > When there were flow control bugs, they would show up as count of 1. > > the driver locked up again, even with msi interrupts disabled and > idle_timeout=10. the console message was pretty much as before: > > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336 > kernel: sky2 hardware hung? flushing > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336 > kernel: sky2 status report lost? > > and this is the output from ethtool -S: > > NIC statistics: > tx_bytes: 3092123897 > rx_bytes: 546577898 > tx_broadcast: 20 > rx_broadcast: 4376 > tx_multicast: 0 > rx_multicast: 459 > tx_unicast: 2585993 > rx_unicast: 1550758 > tx_mac_pause: 1 If this is repeatable... and mac_pause is always one then the problem is hardware flow control. I saw bugs before in the bus interface where it would not resume on unaligned buffer, but that was on receive. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > Another useful bit of information is the statistics (ethtool -S eth0). > When there were flow control bugs, they would show up as count of 1. the driver locked up again, even with msi interrupts disabled and idle_timeout=10. the console message was pretty much as before: kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336 kernel: sky2 hardware hung? flushing kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336 kernel: sky2 status report lost? and this is the output from ethtool -S: NIC statistics: tx_bytes: 3092123897 rx_bytes: 546577898 tx_broadcast: 20 rx_broadcast: 4376 tx_multicast: 0 rx_multicast: 459 tx_unicast: 2585993 rx_unicast: 1550758 tx_mac_pause: 1 rx_mac_pause: 0 collisions: 0 late_collision: 0 aborted: 0 single_collisions: 0 multi_collisions: 0 rx_short: 0 rx_runt: 0 rx_64_byte_packets: 850693 rx_65_to_127_byte_packets: 297029 rx_128_to_255_byte_packets: 62116 rx_256_to_511_byte_packets: 28795 rx_512_to_1023_byte_packets: 31357 rx_1024_to_1518_byte_packets: 285603 rx_1518_to_max_byte_packets: 0 rx_too_long: 0 rx_fifo_overflow: 0 rx_jabber: 0 rx_fcs_error: 0 tx_64_byte_packets: 194159 tx_65_to_127_byte_packets: 239961 tx_128_to_255_byte_packets: 48148 tx_256_to_511_byte_packets: 27635 tx_512_to_1023_byte_packets: 95557 tx_1024_to_1518_byte_packets: 1980554 tx_1519_to_max_byte_packets: 0 tx_fifo_underrun: 0 time to try the vendor driver and see if that provides any clues. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > Another useful bit of information is the statistics (ethtool -S > eth0). When there were flow control bugs, they would show up as > count of 1. > > Are you doing jumbo frames (MTU > 1500)? i just did 'ethtool -S eth0' (the card is still working fine) and i don't think there are any jumbo frames. anyway, this is the output: NIC statistics: tx_bytes: 2697577533 rx_bytes: 503104106 tx_broadcast: 18 rx_broadcast: 4068 tx_multicast: 0 rx_multicast: 416 tx_unicast: 2276028 rx_unicast: 1359009 tx_mac_pause: 0 rx_mac_pause: 0 collisions: 0 late_collision: 0 aborted: 0 single_collisions: 0 multi_collisions: 0 rx_short: 0 rx_runt: 0 rx_64_byte_packets: 713826 rx_65_to_127_byte_packets: 271861 rx_128_to_255_byte_packets: 57307 rx_256_to_511_byte_packets: 25689 rx_512_to_1023_byte_packets: 28502 rx_1024_to_1518_byte_packets: 266308 rx_1518_to_max_byte_packets: 0 rx_too_long: 0 rx_fifo_overflow: 0 rx_jabber: 0 rx_fcs_error: 0 tx_64_byte_packets: 174188 tx_65_to_127_byte_packets: 225242 tx_128_to_255_byte_packets: 44294 tx_256_to_511_byte_packets: 24475 tx_512_to_1023_byte_packets: 80147 tx_1024_to_1518_byte_packets: 1727700 tx_1519_to_max_byte_packets: 0 tx_fifo_underrun: 0 --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > Another useful bit of information is the statistics (ethtool -S eth0). > When there were flow control bugs, they would show up as count of 1. we'll see if the machine locks up again. > Are you doing jumbo frames (MTU > 1500)? no (or at least i don't think so). how can i tell? assuming the machine doesn't lock up with msi interrupts disabled, do you want me to do anything to debug why the driver locks up when the msi interrupts are enabled? --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc1 sky2 problems (regression?)
On Thu, 14 Dec 2006 14:25:06 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote: > Stephen Hemminger <[EMAIL PROTECTED]> writes: > > > 4) What is the IRQ routing? > >There are two issues here, first the driver will never work with edge > >trigger IRQ's, some motherboards also have busted BIOS and chipsets > >that don't do MSI properly. A couple of module parameters are available > >to help: > > disable_msi=1 avoids using MSI > > idle_timeout=10 polls for lost IRQ's every N ms (10) > > i didn't take long to lock up the machine again. i've rebooted back > into stock 2.6.20-rc1 and added the two module parameters above. cat > /proc/interrupts now gives me: > > 17:203 IO-APIC-fasteoi eth0, CMI8738 > > so i guess the MSI interrupts are disabled. we'll see how this works. probably won't do much but now the IRQ ends up shared. > > 5) What are the messages in the console log when problem happens? > > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406 > kernel: sky2 status report lost? The transmit timeout code trys to be smart, but doesn't really recover properly if hardware is stuck. > > 7) Please get a current version of ethtool from: > >git://git.kernel.org/pub/scm/network/ethtool/ethtool.git > >and run ethtool register dump after a problem occurs: > > ethtool -d eth0 > > this is the output after it stopped working: > > > PCI config > -- > 00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00 > 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00 > 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05 > 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00 > 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 > 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 > 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00 > 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Control Registers > - > Register Access Port 0x00 > LED Control/Status 0xA603164A > Interrupt Source 0x4000 > Interrupt Mask 0xC01D > Interrupt Hardware Error Source 0x > Interrupt Hardware Error Mask0x2E003F3F > > Bus Management Unit > --- > CSR Receive Queue 1 0x0001 > CSR Sync Queue 1 0x > CSR Async Queue 10x > > MAC Addresses > --- > Addr 100 11 09 DA 39 A3 > Addr 200 11 09 DA 39 A3 > Addr 300 00 00 00 00 00 > > Connector type 0x4A (J) > PMD type 0x54 (T) > PHY type 0x80 > Chip Id 0xB6 Yukon-2 EC > (rev 0) > Ram Buffer 0x0C > > Status BMU: > --- > Control0x0002220A > Last Index 0x07FF > Put Index 0x0601 > List Address 0x7FBF8000 > Transmit 1 done index 0x0196 > Transmit index threshold 0x000A > > Status FIFO > Write Pointer0x16 > Read Pointer 0x16 > Level0x00 > Watermark0x10 > ISR Watermark0x10 > Status level > Init 0x30D4 Value 0x0D00 > Test 0x04 Control 0x02 > TX status > Init 0x0001E848 Value 0x0001E848 > Test 0x04 Control 0x02 > ISR > Init 0x09C4 Value 0x09C4 > Test 0x04 Control 0x02 > > GMAC control 0x005A > GPHY control 0x2002 > LINK control 0x02 > > GMAC 1 > Status 0xD000 > Control 0x1800 > Transmit 0x1000 > Receive 0xE000 > Transmit flow control0x > Transmit parameter 0xD7C4 > Serial mode 0x221E > Source address: 00 11 09 DA 39 A3 > Physical address: 00 11 09 DA 39 A3 > > Rx GMAC 1 > End Address 0x007F > Almost Full Thresh 0x0070 > Control/Test 0x0900228A > FIFO Flush Mask 0x18FB > FIFO Flush Threshold 0x000B > Truncation Threshold 0x017C > Upper Pause Threshold0x > Lower Pause Threshold0x0081 > VLAN Tag 0x0074 > FIFO Write Pointer 0x > FIFO Write Level 0x007B > FIFO Read Pointer0x > FIFO Read Level 0x0079 > > Tx GMAC 1 > End Address 0x007F > Almost Full Thresh 0x0010 > Control/Test 0x0102220A > FIFO Flush Mask 0x > FIFO Flush Threshold 0x > Truncation Thres
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > 4) What is the IRQ routing? >There are two issues here, first the driver will never work with edge >trigger IRQ's, some motherboards also have busted BIOS and chipsets >that don't do MSI properly. A couple of module parameters are available >to help: > disable_msi=1 avoids using MSI > idle_timeout=10 polls for lost IRQ's every N ms (10) i didn't take long to lock up the machine again. i've rebooted back into stock 2.6.20-rc1 and added the two module parameters above. cat /proc/interrupts now gives me: 17:203 IO-APIC-fasteoi eth0, CMI8738 so i guess the MSI interrupts are disabled. we'll see how this works. > 5) What are the messages in the console log when problem happens? kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406 kernel: sky2 status report lost? kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 406 .. 361 report=406 done=406 kernel: sky2 hardware hung? flushing kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 361 .. 321 report=406 done=406 kernel: sky2 status report lost? kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 406 .. 366 report=406 done=406 kernel: sky2 hardware hung? flushing > 7) Please get a current version of ethtool from: >git://git.kernel.org/pub/scm/network/ethtool/ethtool.git >and run ethtool register dump after a problem occurs: > ethtool -d eth0 this is the output after it stopped working: PCI config -- 00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Control Registers - Register Access Port 0x00 LED Control/Status 0xA603164A Interrupt Source 0x4000 Interrupt Mask 0xC01D Interrupt Hardware Error Source 0x Interrupt Hardware Error Mask0x2E003F3F Bus Management Unit --- CSR Receive Queue 1 0x0001 CSR Sync Queue 1 0x CSR Async Queue 10x MAC Addresses --- Addr 100 11 09 DA 39 A3 Addr 200 11 09 DA 39 A3 Addr 300 00 00 00 00 00 Connector type 0x4A (J) PMD type 0x54 (T) PHY type 0x80 Chip Id 0xB6 Yukon-2 EC (rev 0) Ram Buffer 0x0C Status BMU: --- Control0x0002220A Last Index 0x07FF Put Index 0x0601 List Address 0x7FBF8000 Transmit 1 done index 0x0196 Transmit index threshold 0x000A Status FIFO Write Pointer0x16 Read Pointer 0x16 Level0x00 Watermark0x10 ISR Watermark0x10 Status level Init 0x30D4 Value 0x0D00 Test 0x04 Control 0x02 TX status Init 0x0001E848 Value 0x0001E848 Test 0x04 Control 0x02 ISR Init 0x09C4 Value 0x09C4 Test 0x04 Control 0x02 GMAC control 0x005A GPHY control 0x2002 LINK control 0x02 GMAC 1 Status 0xD000 Control 0x1800 Transmit 0x1000 Receive 0xE000 Transmit flow control0x Transmit parameter 0xD7C4 Serial mode 0x221E Source address: 00 11 09 DA 39 A3 Physical address: 00 11 09 DA 39 A3 Rx GMAC 1 End Address 0x007F Almost Full Thresh 0x0070 Control/Test 0x0900228A FIFO Flush Mask 0x18FB FIFO Flush Threshold 0x000B Truncation Threshold 0x017C Upper Pause Threshold0x Lower Pause Threshold0x0081 VLAN Tag 0x0074 FIFO Write Pointer 0x FIFO Write Level 0x007B FIFO Read Pointer0x FIFO Read Level 0x0079 Tx GMAC 1 End Address 0x007F Almost Full Thresh 0x0010 Control/Test 0x0102220A FIFO Flush Mask 0x FIF
Re: 2.6.20-rc1 sky2 problems (regression?)
Stephen Hemminger <[EMAIL PROTECTED]> writes: > On Thu, 14 Dec 2006 12:47:05 -0800 > Alex Romosan <[EMAIL PROTECTED]> wrote: > >> under heavy network load the sky2 driver (compiled in the kernel) >> locks up and the only way i can get the network back is to reboot the >> machine (bringing the network down and back up again doesn't help). >> this happens on an amd64 machine (athlon 3500+ processor) and the card >> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit >> Ethernet Controller (rev 15) (from lspci). this is what i see in the >> syslog: >> >> kernel: sky2 eth0: rx error, status 0x414a414a length 0 >> kernel: eth0: hw csum failure. >> kernel: >> kernel: Call Trace: >> kernel:[] __skb_checksum_complete+0x4d/0x66 >> kernel: [] tcp_v4_rcv+0x147/0x8ea >> kernel: [] raw_rcv_skb+0x9/0x20 >> kernel: [] raw_rcv+0xbe/0xc4 >> kernel: [] ip_local_deliver+0x170/0x21b >> kernel: [] ip_rcv+0x478/0x4ab >> kernel: [] netif_receive_skb+0x184/0x20e >> kernel: [] sky2_poll+0x68f/0x93c >> kernel: [] scheduler_tick+0x23/0x2f9 >> kernel: [] net_rx_action+0x61/0xf0 >> kernel: [] __do_softirq+0x40/0x8a >> kernel: [] call_softirq+0x1c/0x28 >> kernel: [] do_softirq+0x2c/0x7d >> kernel: [] irq_exit+0x36/0x42 >> kernel: [] do_IRQ+0x8c/0x9e >> kernel: [] default_idle+0x0/0x3a >> kernel: [] ret_from_intr+0x0/0xa >> kernel:[] default_idle+0x26/0x3a >> kernel: [] cpu_idle+0x42/0x75 >> kernel: [] start_kernel+0x1ce/0x1d3 >> kernel: [] _sinittext+0x140/0x144 >> kernel: >> kernel: eth0: hw csum failure. >> kernel: >> kernel: Call Trace: >> kernel:[] __skb_checksum_complete+0x4d/0x66 >> kernel: [] tcp_v4_rcv+0x147/0x8ea >> kernel: [] raw_rcv_skb+0x9/0x20 >> kernel: [] raw_rcv+0xbe/0xc4 >> kernel: [] ip_local_deliver+0x170/0x21b >> kernel: [] ip_rcv+0x478/0x4ab >> kernel: [] netif_receive_skb+0x184/0x20e >> kernel: [] sky2_poll+0x68f/0x93c >> kernel: [] tcp_delack_timer+0x0/0x1b5 >> kernel: [] net_rx_action+0x61/0xf0 >> kernel: [] __do_softirq+0x40/0x8a >> kernel: [] call_softirq+0x1c/0x28 >> kernel: [] do_softirq+0x2c/0x7d >> kernel: [] irq_exit+0x36/0x42 >> kernel: [] do_IRQ+0x8c/0x9e >> kernel: [] ret_from_intr+0x0/0xa >> kernel:[] inode2sd+0x104/0x117 >> kernel: [] search_by_key+0xa08/0xbfe >> kernel: [] search_by_key+0x183/0xbfe >> kernel: [] ll_rw_block+0x89/0x9e >> kernel: [] search_by_key+0x183/0xbfe >> kernel: [] __find_get_block_slow+0x101/0x10d >> kernel: [] __find_get_block+0x197/0x1a5 >> kernel: [] inode_get_bytes+0x2a/0x52 >> kernel: [] reiserfs_update_sd_size+0x7e/0x284 >> kernel: [] kthread+0xed/0xfd >> kernel: [] do_journal_end+0x34b/0xbdd >> kernel: [] reiserfs_dirty_inode+0x56/0x76 >> kernel: [] block_prepare_write+0x1a/0x24 >> kernel: [] __mark_inode_dirty+0x29/0x197 >> kernel: [] reiserfs_commit_write+0x10d/0x19f >> kernel: [] block_prepare_write+0x1a/0x24 >> kernel: [] generic_file_buffered_write+0x4ad/0x6c4 >> kernel: [] __pollwait+0x0/0xe0 >> kernel: [] current_fs_time+0x35/0x3b >> kernel: [] __generic_file_aio_write_nolock+0x379/0x3ec >> kernel: [] unix_dgram_recvmsg+0x1be/0x1d9 >> kernel: [] __mutex_lock_slowpath+0x205/0x210 >> kernel: [] generic_file_aio_write+0x61/0xc1 >> kernel: [] generic_file_aio_write+0x0/0xc1 >> kernel: [] do_sync_readv_writev+0xc0/0x107 >> kernel: [] autoremove_wake_function+0x0/0x2e >> kernel: [] getnstimeofday+0x10/0x28 >> kernel: [] rw_copy_check_uvector+0x6c/0xdc >> kernel: [] do_readv_writev+0xb2/0x18b >> kernel: [] sys_writev+0x45/0x93 >> kernel: [] system_call+0x7e/0x83 >> >> and so on. some times i don't get this trace but instead i get: >> >> kernel: sky2 eth0: tx timeout >> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181 >> kernel: sky2 status report lost? >> kernel: NETDEV WATCHDOG: eth0: transmit timed out >> kernel: sky2 eth0: tx timeout >> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181 >> kernel: sky2 hardware hung? flushing >> > Pleas report these problems to netdev@vger.kernel.org, I rarely go > looking in LKML. > > These are the things you need to debug a sky2 related problem. > > 1) What is exact kernel version in use? This is important because >problems get fixed but it can be a long while until the fix bubbles down >to the vendor kernels. this is stock kernel.org kernel version 2.6.20-rc1 i downloaded this morning. 2.6.19 and 2.6.19-rc6 i referred to in my original message were also donloaded from kernel.org. > 2) What is the chip version? The driver prints this out on boot up in >the console log. (dmesg | grep sky2) >This matters because each chip version has different >bugs to deal with. sky2 v1.10 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1 sky2 eth0: addr 00:11:09:da:39:a3 sky2 eth0: enabling interface sky2 eth0: ram buffer 48K sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both > 3) Does it work with the vendor driver? >The vendor driver does a number of things differently than
Re: 2.6.20-rc1 sky2 problems (regression?)
On Thu, 14 Dec 2006 12:47:05 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote: > under heavy network load the sky2 driver (compiled in the kernel) > locks up and the only way i can get the network back is to reboot the > machine (bringing the network down and back up again doesn't help). > this happens on an amd64 machine (athlon 3500+ processor) and the card > in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit > Ethernet Controller (rev 15) (from lspci). this is what i see in the > syslog: > > kernel: sky2 eth0: rx error, status 0x414a414a length 0 > kernel: eth0: hw csum failure. > kernel: > kernel: Call Trace: > kernel:[] __skb_checksum_complete+0x4d/0x66 > kernel: [] tcp_v4_rcv+0x147/0x8ea > kernel: [] raw_rcv_skb+0x9/0x20 > kernel: [] raw_rcv+0xbe/0xc4 > kernel: [] ip_local_deliver+0x170/0x21b > kernel: [] ip_rcv+0x478/0x4ab > kernel: [] netif_receive_skb+0x184/0x20e > kernel: [] sky2_poll+0x68f/0x93c > kernel: [] scheduler_tick+0x23/0x2f9 > kernel: [] net_rx_action+0x61/0xf0 > kernel: [] __do_softirq+0x40/0x8a > kernel: [] call_softirq+0x1c/0x28 > kernel: [] do_softirq+0x2c/0x7d > kernel: [] irq_exit+0x36/0x42 > kernel: [] do_IRQ+0x8c/0x9e > kernel: [] default_idle+0x0/0x3a > kernel: [] ret_from_intr+0x0/0xa > kernel:[] default_idle+0x26/0x3a > kernel: [] cpu_idle+0x42/0x75 > kernel: [] start_kernel+0x1ce/0x1d3 > kernel: [] _sinittext+0x140/0x144 > kernel: > kernel: eth0: hw csum failure. > kernel: > kernel: Call Trace: > kernel:[] __skb_checksum_complete+0x4d/0x66 > kernel: [] tcp_v4_rcv+0x147/0x8ea > kernel: [] raw_rcv_skb+0x9/0x20 > kernel: [] raw_rcv+0xbe/0xc4 > kernel: [] ip_local_deliver+0x170/0x21b > kernel: [] ip_rcv+0x478/0x4ab > kernel: [] netif_receive_skb+0x184/0x20e > kernel: [] sky2_poll+0x68f/0x93c > kernel: [] tcp_delack_timer+0x0/0x1b5 > kernel: [] net_rx_action+0x61/0xf0 > kernel: [] __do_softirq+0x40/0x8a > kernel: [] call_softirq+0x1c/0x28 > kernel: [] do_softirq+0x2c/0x7d > kernel: [] irq_exit+0x36/0x42 > kernel: [] do_IRQ+0x8c/0x9e > kernel: [] ret_from_intr+0x0/0xa > kernel:[] inode2sd+0x104/0x117 > kernel: [] search_by_key+0xa08/0xbfe > kernel: [] search_by_key+0x183/0xbfe > kernel: [] ll_rw_block+0x89/0x9e > kernel: [] search_by_key+0x183/0xbfe > kernel: [] __find_get_block_slow+0x101/0x10d > kernel: [] __find_get_block+0x197/0x1a5 > kernel: [] inode_get_bytes+0x2a/0x52 > kernel: [] reiserfs_update_sd_size+0x7e/0x284 > kernel: [] kthread+0xed/0xfd > kernel: [] do_journal_end+0x34b/0xbdd > kernel: [] reiserfs_dirty_inode+0x56/0x76 > kernel: [] block_prepare_write+0x1a/0x24 > kernel: [] __mark_inode_dirty+0x29/0x197 > kernel: [] reiserfs_commit_write+0x10d/0x19f > kernel: [] block_prepare_write+0x1a/0x24 > kernel: [] generic_file_buffered_write+0x4ad/0x6c4 > kernel: [] __pollwait+0x0/0xe0 > kernel: [] current_fs_time+0x35/0x3b > kernel: [] __generic_file_aio_write_nolock+0x379/0x3ec > kernel: [] unix_dgram_recvmsg+0x1be/0x1d9 > kernel: [] __mutex_lock_slowpath+0x205/0x210 > kernel: [] generic_file_aio_write+0x61/0xc1 > kernel: [] generic_file_aio_write+0x0/0xc1 > kernel: [] do_sync_readv_writev+0xc0/0x107 > kernel: [] autoremove_wake_function+0x0/0x2e > kernel: [] getnstimeofday+0x10/0x28 > kernel: [] rw_copy_check_uvector+0x6c/0xdc > kernel: [] do_readv_writev+0xb2/0x18b > kernel: [] sys_writev+0x45/0x93 > kernel: [] system_call+0x7e/0x83 > > and so on. some times i don't get this trace but instead i get: > > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181 > kernel: sky2 status report lost? > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181 > kernel: sky2 hardware hung? flushing > > but the end result is the same, the network card stops responding and > i have to reboot the machine. i can reproduce this on a consistent > basis so if there are any patches, i can try them out and see if they > fix the problem. > > this is probably not a regression per se as i saw it as well with > 2.6.19 and 2.6.19-rc6. i am not sure if it was there previous to > 2.6.19-rc6. suggestions, patches welcome. thanks. Pleas report these problems to netdev@vger.kernel.org, I rarely go looking in LKML. These are the things you need to debug a sky2 related problem. 1) What is exact kernel version in use? This is important because problems get fixed but it can be a long while until the fix bubbles down to the vendor kernels. 2) What is the chip version? The driver prints this out on boot up in the console log. (dmesg | grep sky2) This matters because each chip version has different bugs to deal with. 3) Does it work with the vendor driver? The vendor driver does a number of things differently than the sky2 driver and can mask problems, but if it doesn't work as well that is a useful data point. If you want to know why the sky2