Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 13:23, Maciej S. Szmigiero wrote: > On 11.10.2018 10:24, Chris Clayton wrote: >> On 11/10/2018 01:12, Maciej S. Szmigiero wrote: >>> On 11.10.2018 00:49, Chris Clayton wrote: > Now, knowing the "right" value you can experiment with what > rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). > This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002400e >>> >>> Now we can finally see some difference... >>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST >>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this >>> is kind of expected - one can see that the working configuration >>> post-resume has bit 14 (or 0x4000) set, too. >>> >>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is >>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. >>> >>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same >>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following >>> change: >>> --- r8169.c >>> +++ r8169.c >>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 >>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: >>> case RTL_GIGA_MAC_VER_34: >>> case RTL_GIGA_MAC_VER_35: >>> + case RTL_GIGA_MAC_VER_38: >>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | >>> RX_DMA_BURST); >>> break; >>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: >>> >>> This will add RX_MULTI_EN also for your chip model (you need to add back >>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). >>> >> >> That's done the trick. With the above change applied, my network runs >> running fine after a suspend/resume cycle and the >> ping times are back in the 14-15ms range. > > Nice! > > I will submit a patch, it would be great if you could test it and then > add a "Tested-by:" tag. > Will do, Maciej. Thanks for solving this. >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 13:23, Maciej S. Szmigiero wrote: > On 11.10.2018 10:24, Chris Clayton wrote: >> On 11/10/2018 01:12, Maciej S. Szmigiero wrote: >>> On 11.10.2018 00:49, Chris Clayton wrote: > Now, knowing the "right" value you can experiment with what > rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). > This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002400e >>> >>> Now we can finally see some difference... >>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST >>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this >>> is kind of expected - one can see that the working configuration >>> post-resume has bit 14 (or 0x4000) set, too. >>> >>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is >>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. >>> >>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same >>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following >>> change: >>> --- r8169.c >>> +++ r8169.c >>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 >>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: >>> case RTL_GIGA_MAC_VER_34: >>> case RTL_GIGA_MAC_VER_35: >>> + case RTL_GIGA_MAC_VER_38: >>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | >>> RX_DMA_BURST); >>> break; >>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: >>> >>> This will add RX_MULTI_EN also for your chip model (you need to add back >>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). >>> >> >> That's done the trick. With the above change applied, my network runs >> running fine after a suspend/resume cycle and the >> ping times are back in the 14-15ms range. > > Nice! > > I will submit a patch, it would be great if you could test it and then > add a "Tested-by:" tag. > Will do, Maciej. Thanks for solving this. >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11.10.2018 10:24, Chris Clayton wrote: > On 11/10/2018 01:12, Maciej S. Szmigiero wrote: >> On 11.10.2018 00:49, Chris Clayton wrote: Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() writes (under the "default:" label for your NIC model). >>> >>> This might be more interesting. Through a combination of viewing the output >>> from pr_notice() and the output from >>> "ethtool -d", I can see RxConfig with the following values >>> >>> During boot:0x00028700 >>> Before suspend: 0x0002870e >>> During resume: 0x00024000 >>> Post resume:0x0002870e >>> >>> As I did with 4.18.10 early on in the process, I removed the call to >>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >>> installed and rebooted. Now I see the following values: >>> >>> During boot:0x00028700 >>> Before suspend: 0x0002870e >>> During resume: 0x00024000 >>> Post resume:0x0002400e >>> >> >> Now we can finally see some difference... >> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST >> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this >> is kind of expected - one can see that the working configuration >> post-resume has bit 14 (or 0x4000) set, too. >> >> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is >> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. >> >> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same >> family as your RTL_GIGA_MAC_VER_38, so can you please try the following >> change: >> --- r8169.c >> +++ r8169.c >> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 >> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: >> case RTL_GIGA_MAC_VER_34: >> case RTL_GIGA_MAC_VER_35: >> +case RTL_GIGA_MAC_VER_38: >> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | >> RX_DMA_BURST); >> break; >> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: >> >> This will add RX_MULTI_EN also for your chip model (you need to add back >> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). >> > > That's done the trick. With the above change applied, my network runs running > fine after a suspend/resume cycle and the > ping times are back in the 14-15ms range. Nice! I will submit a patch, it would be great if you could test it and then add a "Tested-by:" tag. > Chris Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11.10.2018 10:24, Chris Clayton wrote: > On 11/10/2018 01:12, Maciej S. Szmigiero wrote: >> On 11.10.2018 00:49, Chris Clayton wrote: Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() writes (under the "default:" label for your NIC model). >>> >>> This might be more interesting. Through a combination of viewing the output >>> from pr_notice() and the output from >>> "ethtool -d", I can see RxConfig with the following values >>> >>> During boot:0x00028700 >>> Before suspend: 0x0002870e >>> During resume: 0x00024000 >>> Post resume:0x0002870e >>> >>> As I did with 4.18.10 early on in the process, I removed the call to >>> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >>> installed and rebooted. Now I see the following values: >>> >>> During boot:0x00028700 >>> Before suspend: 0x0002870e >>> During resume: 0x00024000 >>> Post resume:0x0002400e >>> >> >> Now we can finally see some difference... >> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST >> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this >> is kind of expected - one can see that the working configuration >> post-resume has bit 14 (or 0x4000) set, too. >> >> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is >> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. >> >> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same >> family as your RTL_GIGA_MAC_VER_38, so can you please try the following >> change: >> --- r8169.c >> +++ r8169.c >> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 >> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: >> case RTL_GIGA_MAC_VER_34: >> case RTL_GIGA_MAC_VER_35: >> +case RTL_GIGA_MAC_VER_38: >> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | >> RX_DMA_BURST); >> break; >> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: >> >> This will add RX_MULTI_EN also for your chip model (you need to add back >> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). >> > > That's done the trick. With the above change applied, my network runs running > fine after a suspend/resume cycle and the > ping times are back in the 14-15ms range. Nice! I will submit a patch, it would be great if you could test it and then add a "Tested-by:" tag. > Chris Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 01:12, Maciej S. Szmigiero wrote: > On 11.10.2018 00:49, Chris Clayton wrote: >>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >>> writes (under the "default:" label for your NIC model). >>> >> >> This might be more interesting. Through a combination of viewing the output >> from pr_notice() and the output from >> "ethtool -d", I can see RxConfig with the following values >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002870e >> >> As I did with 4.18.10 early on in the process, I removed the call to >> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >> installed and rebooted. Now I see the following values: >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002400e >> > > Now we can finally see some difference... > Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST > (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this > is kind of expected - one can see that the working configuration > post-resume has bit 14 (or 0x4000) set, too. > > This bit is described in the driver as RX_MULTI_EN ("8111c only") and is > set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. > > RTL_GIGA_MAC_VER_35 is described in the driver as being in the same > family as your RTL_GIGA_MAC_VER_38, so can you please try the following > change: > --- r8169.c > +++ r8169.c > @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 > case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: > case RTL_GIGA_MAC_VER_34: > case RTL_GIGA_MAC_VER_35: > + case RTL_GIGA_MAC_VER_38: > RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | > RX_DMA_BURST); > break; > case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: > > This will add RX_MULTI_EN also for your chip model (you need to add back > the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). > That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the ping times are back in the 14-15ms range. Chris > If this does not help then I would try another values in the above write: > 1) RTL_W32(tp, RxConfig, 0x00024000); > 2) RTL_W32(tp, RxConfig, 0x4000); > 3) RTL_W32(tp, RxConfig, RX_DMA_BURST); > 4) RTL_W32(tp, RxConfig, RX128_INT_EN); > >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11/10/2018 01:12, Maciej S. Szmigiero wrote: > On 11.10.2018 00:49, Chris Clayton wrote: >>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >>> writes (under the "default:" label for your NIC model). >>> >> >> This might be more interesting. Through a combination of viewing the output >> from pr_notice() and the output from >> "ethtool -d", I can see RxConfig with the following values >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002870e >> >> As I did with 4.18.10 early on in the process, I removed the call to >> rtl_init_rxcfg() from rtl_hw_start() and rebuilt, >> installed and rebooted. Now I see the following values: >> >> During boot:0x00028700 >> Before suspend: 0x0002870e >> During resume: 0x00024000 >> Post resume:0x0002400e >> > > Now we can finally see some difference... > Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST > (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this > is kind of expected - one can see that the working configuration > post-resume has bit 14 (or 0x4000) set, too. > > This bit is described in the driver as RX_MULTI_EN ("8111c only") and is > set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. > > RTL_GIGA_MAC_VER_35 is described in the driver as being in the same > family as your RTL_GIGA_MAC_VER_38, so can you please try the following > change: > --- r8169.c > +++ r8169.c > @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 > case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: > case RTL_GIGA_MAC_VER_34: > case RTL_GIGA_MAC_VER_35: > + case RTL_GIGA_MAC_VER_38: > RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | > RX_DMA_BURST); > break; > case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: > > This will add RX_MULTI_EN also for your chip model (you need to add back > the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). > That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the ping times are back in the 14-15ms range. Chris > If this does not help then I would try another values in the above write: > 1) RTL_W32(tp, RxConfig, 0x00024000); > 2) RTL_W32(tp, RxConfig, 0x4000); > 3) RTL_W32(tp, RxConfig, RX_DMA_BURST); > 4) RTL_W32(tp, RxConfig, RX128_INT_EN); > >> Chris > > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11.10.2018 00:49, Chris Clayton wrote: >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). >> > > This might be more interesting. Through a combination of viewing the output > from pr_notice() and the output from > "ethtool -d", I can see RxConfig with the following values > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > > As I did with 4.18.10 early on in the process, I removed the call to > rtl_init_rxcfg() from rtl_hw_start() and rebuilt, > installed and rebooted. Now I see the following values: > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002400e > Now we can finally see some difference... Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this is kind of expected - one can see that the working configuration post-resume has bit 14 (or 0x4000) set, too. This bit is described in the driver as RX_MULTI_EN ("8111c only") and is set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. RTL_GIGA_MAC_VER_35 is described in the driver as being in the same family as your RTL_GIGA_MAC_VER_38, so can you please try the following change: --- r8169.c +++ r8169.c @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: case RTL_GIGA_MAC_VER_34: case RTL_GIGA_MAC_VER_35: + case RTL_GIGA_MAC_VER_38: RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST); break; case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: This will add RX_MULTI_EN also for your chip model (you need to add back the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). If this does not help then I would try another values in the above write: 1) RTL_W32(tp, RxConfig, 0x00024000); 2) RTL_W32(tp, RxConfig, 0x4000); 3) RTL_W32(tp, RxConfig, RX_DMA_BURST); 4) RTL_W32(tp, RxConfig, RX128_INT_EN); > Chris Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 11.10.2018 00:49, Chris Clayton wrote: >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). >> > > This might be more interesting. Through a combination of viewing the output > from pr_notice() and the output from > "ethtool -d", I can see RxConfig with the following values > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > > As I did with 4.18.10 early on in the process, I removed the call to > rtl_init_rxcfg() from rtl_hw_start() and rebuilt, > installed and rebooted. Now I see the following values: > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002400e > Now we can finally see some difference... Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this is kind of expected - one can see that the working configuration post-resume has bit 14 (or 0x4000) set, too. This bit is described in the driver as RX_MULTI_EN ("8111c only") and is set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35. RTL_GIGA_MAC_VER_35 is described in the driver as being in the same family as your RTL_GIGA_MAC_VER_38, so can you please try the following change: --- r8169.c +++ r8169.c @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816 case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24: case RTL_GIGA_MAC_VER_34: case RTL_GIGA_MAC_VER_35: + case RTL_GIGA_MAC_VER_38: RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST); break; case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51: This will add RX_MULTI_EN also for your chip model (you need to add back the call to rtl_init_rxcfg() to rtl_hw_start(), naturally). If this does not help then I would try another values in the above write: 1) RTL_W32(tp, RxConfig, 0x00024000); 2) RTL_W32(tp, RxConfig, 0x4000); 3) RTL_W32(tp, RxConfig, RX_DMA_BURST); 4) RTL_W32(tp, RxConfig, RX128_INT_EN); > Chris Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a > resume from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). > This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002400e As with 4.18.10, networking now appears to be stable after the resume. Starting a browser results in my homepage being displayed and I've spent a few minutes surfing with no interruptions. Similarly, ping runs without stopping. I simply don't know enough to know what might now be enabled or disabled by this change in value, but hopefully it will provide a clue to someone as to what is going on. Chris > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a > resume from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). > This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002400e As with 4.18.10, networking now appears to be stable after the resume. Starting a browser results in my homepage being displayed and I've spent a few minutes surfing with no interruptions. Similarly, ping runs without stopping. I simply don't know enough to know what might now be enabled or disabled by this change in value, but hopefully it will provide a clue to someone as to what is going on. Chris > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Too late at night to be doing this stuff. Clicked send instead of saving a draft. Sorry, please ignore. On 10/10/2018 23:30, Chris Clayton wrote: > OK, right kernel/module used this time. Please see findings below. > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, >> but tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a >> resume from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite >> so quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > > This change made no difference. Networking still dies if I open a browser or > leave ping running long enough. > >> 2) Check the original value of RxConfig (after a resume) before >> rtl_init_rxcfg() overwrites it (compile tested only): >> --- r8169.c.ori >> +++ r8169.c >> @@ -5155,6 +5155,9 @@ >> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ >> RTL_R8(tp, IntrMask); >> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); >> + >> +pr_notice("RxConfig before init was %.8x\n", >> +(unsigned int)RTL_R32(tp, RxConfig)); >> rtl_init_rxcfg(tp); >> rtl_set_tx_config_registers(tp); >> >> >> This should be the value that you got when you removed the call to >> rtl_init_rxcfg() for testing. >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). > > This might be more interesting. Through combination of viewing the output > from pr_notice() and the output from "ethtool > -d", I can see RxConfig with the following values > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > > I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, > installed and rebooted. Now I see the > following values: > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > >> >> Hope this helps, >> Maciej >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Too late at night to be doing this stuff. Clicked send instead of saving a draft. Sorry, please ignore. On 10/10/2018 23:30, Chris Clayton wrote: > OK, right kernel/module used this time. Please see findings below. > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, >> but tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a >> resume from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite >> so quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > > This change made no difference. Networking still dies if I open a browser or > leave ping running long enough. > >> 2) Check the original value of RxConfig (after a resume) before >> rtl_init_rxcfg() overwrites it (compile tested only): >> --- r8169.c.ori >> +++ r8169.c >> @@ -5155,6 +5155,9 @@ >> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ >> RTL_R8(tp, IntrMask); >> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); >> + >> +pr_notice("RxConfig before init was %.8x\n", >> +(unsigned int)RTL_R32(tp, RxConfig)); >> rtl_init_rxcfg(tp); >> rtl_set_tx_config_registers(tp); >> >> >> This should be the value that you got when you removed the call to >> rtl_init_rxcfg() for testing. >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). > > This might be more interesting. Through combination of viewing the output > from pr_notice() and the output from "ethtool > -d", I can see RxConfig with the following values > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > > I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, > installed and rebooted. Now I see the > following values: > > During boot:0x00028700 > Before suspend: 0x0002870e > During resume: 0x00024000 > Post resume:0x0002870e > >> >> Hope this helps, >> Maciej >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a > resume from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e > > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
OK, right kernel/module used this time. Please see findings below. On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a > resume from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > This change made no difference. Networking still dies if I open a browser or leave ping running long enough. > 2) Check the original value of RxConfig (after a resume) before > rtl_init_rxcfg() overwrites it (compile tested only): > --- r8169.c.ori > +++ r8169.c > @@ -5155,6 +5155,9 @@ > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + > + pr_notice("RxConfig before init was %.8x\n", > + (unsigned int)RTL_R32(tp, RxConfig)); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > > This should be the value that you got when you removed the call to > rtl_init_rxcfg() for testing. > Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() > writes (under the "default:" label for your NIC model). This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool -d", I can see RxConfig with the following values During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the following values: During boot:0x00028700 Before suspend: 0x0002870e During resume: 0x00024000 Post resume:0x0002870e > > Hope this helps, > Maciej >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7 kernel/module. I'll test your proposals properly later. Chris On 10/10/2018 09:09, Chris Clayton wrote: > > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, >> but tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a >> resume from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite >> so quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > After testing your first proposal, which made no difference, I founf the > following in dmesg in the output from dmesg: > > [ 761.999468] [ cut here ] > [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 > dev_watchdog+0x1e9/0x1f0 > [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep > iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE > nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc > videobuf2_memops snd_hda_codec_via > videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common > usbhid realtek coretemp snd_hda_intel hwmon > snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last > unloaded: btintel] > [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 > [ 761.999504] Hardware name: Notebook W65_67SZ > /W65_67SZ >, BIOS 1.03.05 02/26/2014 > [ 761.999508] Workqueue: events rtl_task [r8169] > [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 > [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b > c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 > 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 > c7 07 00 00 00 00 > [ 761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282 > [ 761.999514] RAX: RBX: RCX: > 0006 > [ 761.999516] RDX: 0007 RSI: 0096 RDI: > 88040f8153d0 > [ 761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: > 034e > [ 761.999517] R10: 0007 R11: R12: > 88040ca9a39c > [ 761.999518] R13: 88040ca9a000 R14:
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7 kernel/module. I'll test your proposals properly later. Chris On 10/10/2018 09:09, Chris Clayton wrote: > > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, >> but tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a >> resume from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite >> so quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > After testing your first proposal, which made no difference, I founf the > following in dmesg in the output from dmesg: > > [ 761.999468] [ cut here ] > [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 > dev_watchdog+0x1e9/0x1f0 > [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep > iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE > nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc > videobuf2_memops snd_hda_codec_via > videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common > usbhid realtek coretemp snd_hda_intel hwmon > snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last > unloaded: btintel] > [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 > [ 761.999504] Hardware name: Notebook W65_67SZ > /W65_67SZ >, BIOS 1.03.05 02/26/2014 > [ 761.999508] Workqueue: events rtl_task [r8169] > [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 > [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b > c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 > 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 > c7 07 00 00 00 00 > [ 761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282 > [ 761.999514] RAX: RBX: RCX: > 0006 > [ 761.999516] RDX: 0007 RSI: 0096 RDI: > 88040f8153d0 > [ 761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: > 034e > [ 761.999517] R10: 0007 R11: R12: > 88040ca9a39c > [ 761.999518] R13: 88040ca9a000 R14:
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a > resume from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > After testing your first proposal, which made no difference, I founf the following in dmesg in the output from dmesg: [ 761.999468] [ cut here ] [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0 [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel] [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 [ 761.999504] Hardware name: Notebook W65_67SZ /W65_67SZ , BIOS 1.03.05 02/26/2014 [ 761.999508] Workqueue: events rtl_task [r8169] [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 [ 761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282 [ 761.999514] RAX: RBX: RCX: 0006 [ 761.999516] RDX: 0007 RSI: 0096 RDI: 88040f8153d0 [ 761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: 034e [ 761.999517] R10: 0007 R11: R12: 88040ca9a39c [ 761.999518] R13: 88040ca9a000 R14: 0001 R15: 8803ea17cc80 [ 761.999520] FS: () GS:88040f80() knlGS: [ 761.999521] CS: 0010 DS: ES: CR0: 80050033 [ 761.999522] CR2: 7f67280206b8 CR3: 0200a002 CR4: 001606f0 [ 761.999523] Call Trace: [ 761.999525] [ 761.999527] ? qdisc_reset+0xe0/0xe0 [ 761.999529] ? qdisc_reset+0xe0/0xe0 [ 761.999532] call_timer_fn+0x11/0x70 [ 761.999534] expire_timers+0x8e/0xa0 [ 761.999535]
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 10/10/2018 01:24, Maciej S. Szmigiero wrote: > On 09.10.2018 22:36, Heiner Kallweit wrote: >> On 09.10.2018 16:40, Chris Clayton wrote: >>> Thanks to Maciej and Heiner for their replies. >>> >>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a > resume from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). >>> Maciej suggested comparing the output from lspci -vv for the ethernet >>> device. They are identical. >>> >>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >>> and post suspend. Again, they are identical. >>> Heiner specifically suggested looking at the RxConfig. The value of that is >>> 0x0002870e both pre and post suspend. >>> >> Hmm, this is very weird, especially taking into account that in your original >> report you state that removing the call to rtl_init_rxcfg() from >> rtl_hw_start() >> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >> register values seem to be the same before and after resume. So how can the >> chip behave differently? >> So far my best guess is that some chip quirk causes it to accept writes to >> register RxConfig, but to misinterpret or ignore the written value. >> So far your report is the only one (affecting RTL8411), but we don't know >> whether other chip versions are affected too. > > Also, it is interesting that even if one removes a call to > rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get > written to moments later by rtl_set_rx_mode(). > > The only chip accesses in the meantime seems to be a write to TxConfig by > rtl_set_tx_config_registers() and then a read of RxConfig plus two writes > to MAR0 earlier in rtl_set_rx_mode(). > > My proposals are: > 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" > in rtl_hw_start(). > Maybe the chip does not like sometimes that RxConfig is written before > TxConfig. > After testing your first proposal, which made no difference, I founf the following in dmesg in the output from dmesg: [ 761.999468] [ cut here ] [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0 [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel] [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 [ 761.999504] Hardware name: Notebook W65_67SZ /W65_67SZ , BIOS 1.03.05 02/26/2014 [ 761.999508] Workqueue: events rtl_task [r8169] [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 [ 761.999513] RSP: 0018:88040f803e98 EFLAGS: 00010282 [ 761.999514] RAX: RBX: RCX: 0006 [ 761.999516] RDX: 0007 RSI: 0096 RDI: 88040f8153d0 [ 761.999517] RBP: 88040ca9a3b8 R08: 813565f0 R09: 034e [ 761.999517] R10: 0007 R11: R12: 88040ca9a39c [ 761.999518] R13: 88040ca9a000 R14: 0001 R15: 8803ea17cc80 [ 761.999520] FS: () GS:88040f80() knlGS: [ 761.999521] CS: 0010 DS: ES: CR0: 80050033 [ 761.999522] CR2: 7f67280206b8 CR3: 0200a002 CR4: 001606f0 [ 761.999523] Call Trace: [ 761.999525] [ 761.999527] ? qdisc_reset+0xe0/0xe0 [ 761.999529] ? qdisc_reset+0xe0/0xe0 [ 761.999532] call_timer_fn+0x11/0x70 [ 761.999534] expire_timers+0x8e/0xa0 [ 761.999535]
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09.10.2018 22:36, Heiner Kallweit wrote: > On 09.10.2018 16:40, Chris Clayton wrote: >> Thanks to Maciej and Heiner for their replies. >> >> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>> On 07.10.2018 21:36, Chris Clayton wrote: Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. >>> >>> You can try comparing chip registers (ethtool -d eth0) in the working >>> state (before a suspend) and in the broken state (after a resume). >>> Maybe there will be some obvious in the difference. >>> >>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>> >> Maciej suggested comparing the output from lspci -vv for the ethernet >> device. They are identical. >> >> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >> and post suspend. Again, they are identical. >> Heiner specifically suggested looking at the RxConfig. The value of that is >> 0x0002870e both pre and post suspend. >> > Hmm, this is very weird, especially taking into account that in your original > report you state that removing the call to rtl_init_rxcfg() from > rtl_hw_start() > fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and > register values seem to be the same before and after resume. So how can the > chip behave differently? > So far my best guess is that some chip quirk causes it to accept writes to > register RxConfig, but to misinterpret or ignore the written value. > So far your report is the only one (affecting RTL8411), but we don't know > whether other chip versions are affected too. Also, it is interesting that even if one removes a call to rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get written to moments later by rtl_set_rx_mode(). The only chip accesses in the meantime seems to be a write to TxConfig by rtl_set_tx_config_registers() and then a read of RxConfig plus two writes to MAR0 earlier in rtl_set_rx_mode(). My proposals are: 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" in rtl_hw_start(). Maybe the chip does not like sometimes that RxConfig is written before TxConfig. 2) Check the original value of RxConfig (after a resume) before rtl_init_rxcfg() overwrites it (compile tested only): --- r8169.c.ori +++ r8169.c @@ -5155,6 +5155,9 @@ /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ RTL_R8(tp, IntrMask); RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); + + pr_notice("RxConfig before init was %.8x\n", + (unsigned int)RTL_R32(tp, RxConfig)); rtl_init_rxcfg(tp); rtl_set_tx_config_registers(tp); This should be the value that you got when you removed the call to rtl_init_rxcfg() for testing. Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() writes (under the "default:" label for your NIC model). Hope this helps, Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09.10.2018 22:36, Heiner Kallweit wrote: > On 09.10.2018 16:40, Chris Clayton wrote: >> Thanks to Maciej and Heiner for their replies. >> >> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>> On 07.10.2018 21:36, Chris Clayton wrote: Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. >>> >>> You can try comparing chip registers (ethtool -d eth0) in the working >>> state (before a suspend) and in the broken state (after a resume). >>> Maybe there will be some obvious in the difference. >>> >>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>> >> Maciej suggested comparing the output from lspci -vv for the ethernet >> device. They are identical. >> >> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >> and post suspend. Again, they are identical. >> Heiner specifically suggested looking at the RxConfig. The value of that is >> 0x0002870e both pre and post suspend. >> > Hmm, this is very weird, especially taking into account that in your original > report you state that removing the call to rtl_init_rxcfg() from > rtl_hw_start() > fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and > register values seem to be the same before and after resume. So how can the > chip behave differently? > So far my best guess is that some chip quirk causes it to accept writes to > register RxConfig, but to misinterpret or ignore the written value. > So far your report is the only one (affecting RTL8411), but we don't know > whether other chip versions are affected too. Also, it is interesting that even if one removes a call to rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get written to moments later by rtl_set_rx_mode(). The only chip accesses in the meantime seems to be a write to TxConfig by rtl_set_tx_config_registers() and then a read of RxConfig plus two writes to MAR0 earlier in rtl_set_rx_mode(). My proposals are: 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" in rtl_hw_start(). Maybe the chip does not like sometimes that RxConfig is written before TxConfig. 2) Check the original value of RxConfig (after a resume) before rtl_init_rxcfg() overwrites it (compile tested only): --- r8169.c.ori +++ r8169.c @@ -5155,6 +5155,9 @@ /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ RTL_R8(tp, IntrMask); RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); + + pr_notice("RxConfig before init was %.8x\n", + (unsigned int)RTL_R32(tp, RxConfig)); rtl_init_rxcfg(tp); rtl_set_tx_config_registers(tp); This should be the value that you got when you removed the call to rtl_init_rxcfg() for testing. Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() writes (under the "default:" label for your NIC model). Hope this helps, Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09/10/2018 22:39, Heiner Kallweit wrote: > On 09.10.2018 16:40, Chris Clayton wrote: >> Thanks to Maciej and Heiner for their replies. >> >> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>> On 07.10.2018 21:36, Chris Clayton wrote: Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. >>> >>> You can try comparing chip registers (ethtool -d eth0) in the working >>> state (before a suspend) and in the broken state (after a resume). >>> Maybe there will be some obvious in the difference. >>> >>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>> >> Maciej suggested comparing the output from lspci -vv for the ethernet >> device. They are identical. >> >> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >> and post suspend. Again, they are identical. >> Heiner specifically suggested looking at the RxConfig. The value of that is >> 0x0002870e both pre and post suspend. >> >> I've attached files I redirected the outputs to. >> >> Please don't hesitate to ask for any other information needed to solve this >> problem. In the meantime, I've now got >> scripts that stop the network during suspend and restart it during resume. >> (Those scripts were removed whilst I gathered >> the diagnostics shown in the attachments.) >> > I'd like to check whether it may be a timing issue. The following > experimental patch > adds a PCI commit after writing register ChipCmd. Could you please check > whether > it changes anything? > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 7d3f671e1..f3c359492 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp) > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + RTL_R8(tp, ChipCmd); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > Sorry, this patch doesn't make any difference - my network still fails. After a suspend/resume my browsers (chromium and firefox) both fail to open my home page (https://www.google.co.uk). The ping time for one of my ISP's name servers increases from 14-15ms to more than 1000ms, although it after a few pings it does reduce. As the screen grab below shows, the network does eventually fail $ ping NS1 PING ns1 (90.207.238.97): 56 data bytes 64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms 64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms 64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms 64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms 64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms 64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms 64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms 64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms 64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms 64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms 64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms 64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms 64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms 64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms 64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms 64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms 64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms 64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms 64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms 64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms 64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms 64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms 64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms 64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms 64 bytes from 90.207.238.97: icmp_seq=34 ttl=251 time=176.696 ms 64 bytes from 90.207.238.97: icmp_seq=35 ttl=251 time=1017.462 ms 64 bytes from 90.207.238.97: icmp_seq=36 ttl=251 time=16.394 ms 64 bytes from 90.207.238.97: icmp_seq=37 ttl=251 time=20.402 ms 64 bytes from 90.207.238.97: icmp_seq=38 ttl=251 time=37.795 ms 64 bytes from 90.207.238.97: icmp_seq=39 ttl=251 time=141.997 ms 92 bytes from laptop.local.lan
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09/10/2018 22:39, Heiner Kallweit wrote: > On 09.10.2018 16:40, Chris Clayton wrote: >> Thanks to Maciej and Heiner for their replies. >> >> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>> On 07.10.2018 21:36, Chris Clayton wrote: Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. >>> >>> You can try comparing chip registers (ethtool -d eth0) in the working >>> state (before a suspend) and in the broken state (after a resume). >>> Maybe there will be some obvious in the difference. >>> >>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>> >> Maciej suggested comparing the output from lspci -vv for the ethernet >> device. They are identical. >> >> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre >> and post suspend. Again, they are identical. >> Heiner specifically suggested looking at the RxConfig. The value of that is >> 0x0002870e both pre and post suspend. >> >> I've attached files I redirected the outputs to. >> >> Please don't hesitate to ask for any other information needed to solve this >> problem. In the meantime, I've now got >> scripts that stop the network during suspend and restart it during resume. >> (Those scripts were removed whilst I gathered >> the diagnostics shown in the attachments.) >> > I'd like to check whether it may be a timing issue. The following > experimental patch > adds a PCI commit after writing register ChipCmd. Could you please check > whether > it changes anything? > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 7d3f671e1..f3c359492 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp) > /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ > RTL_R8(tp, IntrMask); > RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); > + RTL_R8(tp, ChipCmd); > rtl_init_rxcfg(tp); > rtl_set_tx_config_registers(tp); > > Sorry, this patch doesn't make any difference - my network still fails. After a suspend/resume my browsers (chromium and firefox) both fail to open my home page (https://www.google.co.uk). The ping time for one of my ISP's name servers increases from 14-15ms to more than 1000ms, although it after a few pings it does reduce. As the screen grab below shows, the network does eventually fail $ ping NS1 PING ns1 (90.207.238.97): 56 data bytes 64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms 64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms 64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms 64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms 64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms 64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms 64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms 64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms 64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms 64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms 64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms 64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms 64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms 64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms 64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms 64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms 64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms 64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms 64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms 64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms 64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms 64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms 64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms 64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms 64 bytes from 90.207.238.97: icmp_seq=34 ttl=251 time=176.696 ms 64 bytes from 90.207.238.97: icmp_seq=35 ttl=251 time=1017.462 ms 64 bytes from 90.207.238.97: icmp_seq=36 ttl=251 time=16.394 ms 64 bytes from 90.207.238.97: icmp_seq=37 ttl=251 time=20.402 ms 64 bytes from 90.207.238.97: icmp_seq=38 ttl=251 time=37.795 ms 64 bytes from 90.207.238.97: icmp_seq=39 ttl=251 time=141.997 ms 92 bytes from laptop.local.lan
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09.10.2018 16:40, Chris Clayton wrote: > Thanks to Maciej and Heiner for their replies. > > On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >> On 07.10.2018 21:36, Chris Clayton wrote: >>> Hi again, >>> >>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>> tried it anyway. I can confirm that the >>> regression is still present and my network still fails when, after a resume >>> from suspend (to ram or disk), I open my >>> browser or my mail client. In both those cases the failure is almost >>> immediate - e.g. my home page doesn't get displayed >>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>> quickly but the reported time increases from >>> 14-15ms to more than 1000ms. >> >> You can try comparing chip registers (ethtool -d eth0) in the working >> state (before a suspend) and in the broken state (after a resume). >> Maybe there will be some obvious in the difference. >> >> The same goes for the PCI configuration (lspci -d :8168 -vv). >> > Maciej suggested comparing the output from lspci -vv for the ethernet device. > They are identical. > > Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre > and post suspend. Again, they are identical. > Heiner specifically suggested looking at the RxConfig. The value of that is > 0x0002870e both pre and post suspend. > > I've attached files I redirected the outputs to. > > Please don't hesitate to ask for any other information needed to solve this > problem. In the meantime, I've now got > scripts that stop the network during suspend and restart it during resume. > (Those scripts were removed whilst I gathered > the diagnostics shown in the attachments.) > I'd like to check whether it may be a timing issue. The following experimental patch adds a PCI commit after writing register ChipCmd. Could you please check whether it changes anything? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 7d3f671e1..f3c359492 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp) /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ RTL_R8(tp, IntrMask); RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); + RTL_R8(tp, ChipCmd); rtl_init_rxcfg(tp); rtl_set_tx_config_registers(tp); -- 2.19.1
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09.10.2018 16:40, Chris Clayton wrote: > Thanks to Maciej and Heiner for their replies. > > On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >> On 07.10.2018 21:36, Chris Clayton wrote: >>> Hi again, >>> >>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>> tried it anyway. I can confirm that the >>> regression is still present and my network still fails when, after a resume >>> from suspend (to ram or disk), I open my >>> browser or my mail client. In both those cases the failure is almost >>> immediate - e.g. my home page doesn't get displayed >>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>> quickly but the reported time increases from >>> 14-15ms to more than 1000ms. >> >> You can try comparing chip registers (ethtool -d eth0) in the working >> state (before a suspend) and in the broken state (after a resume). >> Maybe there will be some obvious in the difference. >> >> The same goes for the PCI configuration (lspci -d :8168 -vv). >> > Maciej suggested comparing the output from lspci -vv for the ethernet device. > They are identical. > > Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre > and post suspend. Again, they are identical. > Heiner specifically suggested looking at the RxConfig. The value of that is > 0x0002870e both pre and post suspend. > > I've attached files I redirected the outputs to. > > Please don't hesitate to ask for any other information needed to solve this > problem. In the meantime, I've now got > scripts that stop the network during suspend and restart it during resume. > (Those scripts were removed whilst I gathered > the diagnostics shown in the attachments.) > I'd like to check whether it may be a timing issue. The following experimental patch adds a PCI commit after writing register ChipCmd. Could you please check whether it changes anything? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 7d3f671e1..f3c359492 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp) /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ RTL_R8(tp, IntrMask); RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); + RTL_R8(tp, ChipCmd); rtl_init_rxcfg(tp); rtl_set_tx_config_registers(tp); -- 2.19.1
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09.10.2018 16:40, Chris Clayton wrote: > Thanks to Maciej and Heiner for their replies. > > On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >> On 07.10.2018 21:36, Chris Clayton wrote: >>> Hi again, >>> >>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>> tried it anyway. I can confirm that the >>> regression is still present and my network still fails when, after a resume >>> from suspend (to ram or disk), I open my >>> browser or my mail client. In both those cases the failure is almost >>> immediate - e.g. my home page doesn't get displayed >>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>> quickly but the reported time increases from >>> 14-15ms to more than 1000ms. >> >> You can try comparing chip registers (ethtool -d eth0) in the working >> state (before a suspend) and in the broken state (after a resume). >> Maybe there will be some obvious in the difference. >> >> The same goes for the PCI configuration (lspci -d :8168 -vv). >> > Maciej suggested comparing the output from lspci -vv for the ethernet device. > They are identical. > > Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre > and post suspend. Again, they are identical. > Heiner specifically suggested looking at the RxConfig. The value of that is > 0x0002870e both pre and post suspend. > Hmm, this is very weird, especially taking into account that in your original report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start() fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and register values seem to be the same before and after resume. So how can the chip behave differently? So far my best guess is that some chip quirk causes it to accept writes to register RxConfig, but to misinterpret or ignore the written value. So far your report is the only one (affecting RTL8411), but we don't know whether other chip versions are affected too. One option could be to call rtl_init_rxcfg() for chip versions <= 06 only because for them we know that they need this call. > I've attached files I redirected the outputs to. > > Please don't hesitate to ask for any other information needed to solve this > problem. In the meantime, I've now got > scripts that stop the network during suspend and restart it during resume. > (Those scripts were removed whilst I gathered > the diagnostics shown in the attachments.) > > Chris > >>> Chris >> >> Maciej >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 09.10.2018 16:40, Chris Clayton wrote: > Thanks to Maciej and Heiner for their replies. > > On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >> On 07.10.2018 21:36, Chris Clayton wrote: >>> Hi again, >>> >>> I didn't think there was anything in 4.19-rc7 to fix this regression, but >>> tried it anyway. I can confirm that the >>> regression is still present and my network still fails when, after a resume >>> from suspend (to ram or disk), I open my >>> browser or my mail client. In both those cases the failure is almost >>> immediate - e.g. my home page doesn't get displayed >>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >>> quickly but the reported time increases from >>> 14-15ms to more than 1000ms. >> >> You can try comparing chip registers (ethtool -d eth0) in the working >> state (before a suspend) and in the broken state (after a resume). >> Maybe there will be some obvious in the difference. >> >> The same goes for the PCI configuration (lspci -d :8168 -vv). >> > Maciej suggested comparing the output from lspci -vv for the ethernet device. > They are identical. > > Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre > and post suspend. Again, they are identical. > Heiner specifically suggested looking at the RxConfig. The value of that is > 0x0002870e both pre and post suspend. > Hmm, this is very weird, especially taking into account that in your original report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start() fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and register values seem to be the same before and after resume. So how can the chip behave differently? So far my best guess is that some chip quirk causes it to accept writes to register RxConfig, but to misinterpret or ignore the written value. So far your report is the only one (affecting RTL8411), but we don't know whether other chip versions are affected too. One option could be to call rtl_init_rxcfg() for chip versions <= 06 only because for them we know that they need this call. > I've attached files I redirected the outputs to. > > Please don't hesitate to ask for any other information needed to solve this > problem. In the meantime, I've now got > scripts that stop the network during suspend and restart it during resume. > (Those scripts were removed whilst I gathered > the diagnostics shown in the attachments.) > > Chris > >>> Chris >> >> Maciej >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, but >> tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a resume >> from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >> quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. I've attached files I redirected the outputs to. Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered the diagnostics shown in the attachments.) Chris >> Chris > > Maciej > ethtool -d eth0 === RealTek RTL8411 registers: 0x00: MAC Address 80:fa:5b:08:d0:3d 0x08: Multicast Address Filter 0x 0x0080 0x10: Dump Tally Counter Command 0x0c2ec000 0x0004 0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x0004 0x28: Tx High Priority Ring Addr 0x 0x 0x30: Flash memory read/write 0x 0x34: Early Rx Byte Count 0 0x36: Early Rx Status 0x00 0x37: Command 0x0c Rx on, Tx on 0x3C: Interrupt Mask 0x803f SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 0x3E: Interrupt Status0x 0x40: Tx Configuration0x4b800f80 0x44: Rx Configuration0x0002870e 0x48: Timer count 0x 0x4C: Missed packet counter 0x00 0x50: EEPROM Command0x10 0x51: Config 0 0x00 0x52: Config 1 0xcf 0x53: Config 2 0x3c 0x54: Config 3 0x60 0x55: Config 4 0x10 0x56: Config 5 0x02 0x58: Timer interrupt 0x 0x5C: Multiple Interrupt Select 0x 0x60: PHY access 0x80040de1 0x64: TBI control and status 0x2701 0x68: TBI Autonegotiation advertisement (ANAR)0xf70c 0x6A: TBI Link partner ability (LPAR) 0x0002 0x6C: PHY status0xeb 0x84: PM wakeup frame 00x 0x 0x8C: PM wakeup frame 10x 0x 0x94: PM wakeup frame 2 (low) 0x 0x 0x9C: PM wakeup frame 2 (high) 0x 0x 0xA4: PM wakeup frame 3 (low) 0x 0x 0xAC: PM wakeup frame 3 (high) 0x 0x 0xB4: PM wakeup frame 4 (low) 0x 0x 0xBC: PM wakeup frame 4 (high) 0x 0x 0xC4: Wakeup frame 0 CRC 0x 0xC6: Wakeup frame 1 CRC 0x 0xC8: Wakeup frame 2 CRC 0x 0xCA: Wakeup frame 3 CRC 0x 0xCC: Wakeup frame 4 CRC 0x 0xDA: RX packet maximum size 0x4000 0xE0: C+ Command 0x20e1 VLAN de-tagging RX checksumming 0xE2: Interrupt Mitigation0x5151 TxTimer: 5 TxPackets: 1 RxTimer: 5 RxPackets: 1 0xE4: Rx Ring Addr 0x07935000 0x0004 0xEC: Early Tx threshold0x27 0xF0: Func Event 0x0040003f 0xF4: Func Event Mask 0x 0xF8: Func Preset State 0x00031eff 0xFC: Func Force Event0x lspci -d :8168 -vv == pcilib: sysfs_read_vpd: read failed: Input/output error 05:00.2
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks to Maciej and Heiner for their replies. On 09/10/2018 13:32, Maciej S. Szmigiero wrote: > On 07.10.2018 21:36, Chris Clayton wrote: >> Hi again, >> >> I didn't think there was anything in 4.19-rc7 to fix this regression, but >> tried it anyway. I can confirm that the >> regression is still present and my network still fails when, after a resume >> from suspend (to ram or disk), I open my >> browser or my mail client. In both those cases the failure is almost >> immediate - e.g. my home page doesn't get displayed >> in the browser. Pinging one of my ISPs name servers doesn't fail quite so >> quickly but the reported time increases from >> 14-15ms to more than 1000ms. > > You can try comparing chip registers (ethtool -d eth0) in the working > state (before a suspend) and in the broken state (after a resume). > Maybe there will be some obvious in the difference. > > The same goes for the PCI configuration (lspci -d :8168 -vv). > Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical. Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical. Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend. I've attached files I redirected the outputs to. Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered the diagnostics shown in the attachments.) Chris >> Chris > > Maciej > ethtool -d eth0 === RealTek RTL8411 registers: 0x00: MAC Address 80:fa:5b:08:d0:3d 0x08: Multicast Address Filter 0x 0x0080 0x10: Dump Tally Counter Command 0x0c2ec000 0x0004 0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x0004 0x28: Tx High Priority Ring Addr 0x 0x 0x30: Flash memory read/write 0x 0x34: Early Rx Byte Count 0 0x36: Early Rx Status 0x00 0x37: Command 0x0c Rx on, Tx on 0x3C: Interrupt Mask 0x803f SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 0x3E: Interrupt Status0x 0x40: Tx Configuration0x4b800f80 0x44: Rx Configuration0x0002870e 0x48: Timer count 0x 0x4C: Missed packet counter 0x00 0x50: EEPROM Command0x10 0x51: Config 0 0x00 0x52: Config 1 0xcf 0x53: Config 2 0x3c 0x54: Config 3 0x60 0x55: Config 4 0x10 0x56: Config 5 0x02 0x58: Timer interrupt 0x 0x5C: Multiple Interrupt Select 0x 0x60: PHY access 0x80040de1 0x64: TBI control and status 0x2701 0x68: TBI Autonegotiation advertisement (ANAR)0xf70c 0x6A: TBI Link partner ability (LPAR) 0x0002 0x6C: PHY status0xeb 0x84: PM wakeup frame 00x 0x 0x8C: PM wakeup frame 10x 0x 0x94: PM wakeup frame 2 (low) 0x 0x 0x9C: PM wakeup frame 2 (high) 0x 0x 0xA4: PM wakeup frame 3 (low) 0x 0x 0xAC: PM wakeup frame 3 (high) 0x 0x 0xB4: PM wakeup frame 4 (low) 0x 0x 0xBC: PM wakeup frame 4 (high) 0x 0x 0xC4: Wakeup frame 0 CRC 0x 0xC6: Wakeup frame 1 CRC 0x 0xC8: Wakeup frame 2 CRC 0x 0xCA: Wakeup frame 3 CRC 0x 0xCC: Wakeup frame 4 CRC 0x 0xDA: RX packet maximum size 0x4000 0xE0: C+ Command 0x20e1 VLAN de-tagging RX checksumming 0xE2: Interrupt Mitigation0x5151 TxTimer: 5 TxPackets: 1 RxTimer: 5 RxPackets: 1 0xE4: Rx Ring Addr 0x07935000 0x0004 0xEC: Early Tx threshold0x27 0xF0: Func Event 0x0040003f 0xF4: Func Event Mask 0x 0xF8: Func Preset State 0x00031eff 0xFC: Func Force Event0x lspci -d :8168 -vv == pcilib: sysfs_read_vpd: read failed: Input/output error 05:00.2
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a resume > from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). > Chris Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 07.10.2018 21:36, Chris Clayton wrote: > Hi again, > > I didn't think there was anything in 4.19-rc7 to fix this regression, but > tried it anyway. I can confirm that the > regression is still present and my network still fails when, after a resume > from suspend (to ram or disk), I open my > browser or my mail client. In both those cases the failure is almost > immediate - e.g. my home page doesn't get displayed > in the browser. Pinging one of my ISPs name servers doesn't fail quite so > quickly but the reported time increases from > 14-15ms to more than 1000ms. You can try comparing chip registers (ethtool -d eth0) in the working state (before a suspend) and in the broken state (after a resume). Maybe there will be some obvious in the difference. The same goes for the PCI configuration (lspci -d :8168 -vv). > Chris Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. Chris On 04/10/2018 09:41, Chris Clayton wrote: > Hi Heiner, > > Here's the reply to your questions. Sorry for the delay. > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: Hi, > Hi, > > I upgraded my kernel to 4.18.10 recently and have since been experiencing > network problems after resuming from a > suspend to RAM or disk. I previously had 4.18.6 and that was OK. > > The pattern of the problem is that when I first boot, the network is > fine. But, after resume from suspend I find that > the time taken for a ping of one of my ISP's nameservers increases from > 14-15ms to more than 1000ms. Moreover, when I > open a browser (chromium or firefox), it fails to retrieve my home page > (https://www.google.co.uk) and pings of the > nameserver fail with the message "Destination Host Unreachable". Often, I > can revive the network by stopping it with > /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 > module and load it again. Please have a look at the following thread: https://lkml.org/lkml/2018/9/25/1118 >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? > > $ dmesg | grep r8169 > [5.274938] libphy: r8169: probed > [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver > [RTL8211E Gigabit Ethernet] > (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) > [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow > control rx/tx > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? >> > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". I've re-enabled it now. > > Chris > >> Heiner >> Maciej >>> Chris >>> >> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi again, I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from 14-15ms to more than 1000ms. Chris On 04/10/2018 09:41, Chris Clayton wrote: > Hi Heiner, > > Here's the reply to your questions. Sorry for the delay. > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: Hi, > Hi, > > I upgraded my kernel to 4.18.10 recently and have since been experiencing > network problems after resuming from a > suspend to RAM or disk. I previously had 4.18.6 and that was OK. > > The pattern of the problem is that when I first boot, the network is > fine. But, after resume from suspend I find that > the time taken for a ping of one of my ISP's nameservers increases from > 14-15ms to more than 1000ms. Moreover, when I > open a browser (chromium or firefox), it fails to retrieve my home page > (https://www.google.co.uk) and pings of the > nameserver fail with the message "Destination Host Unreachable". Often, I > can revive the network by stopping it with > /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 > module and load it again. Please have a look at the following thread: https://lkml.org/lkml/2018/9/25/1118 >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? > > $ dmesg | grep r8169 > [5.274938] libphy: r8169: probed > [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver > [RTL8211E Gigabit Ethernet] > (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) > [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow > control rx/tx > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? >> > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". I've re-enabled it now. > > Chris > >> Heiner >> Maciej >>> Chris >>> >> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi Heiner, Here's the reply to your questions. Sorry for the delay. On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> Hi, I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a suspend to RAM or disk. I previously had 4.18.6 and that was OK. The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep r8169 [5.274938] libphy: r8169: probed [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". I've re-enabled it now. Chris > Heiner > >>> Maciej >>> >> Chris >> > >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi Heiner, Here's the reply to your questions. Sorry for the delay. On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> Hi, I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a suspend to RAM or disk. I previously had 4.18.6 and that was OK. The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep r8169 [5.274938] libphy: r8169: probed [5.276563] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.278158] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-502:00, irq=IGNORE) [9.460876] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 11.005336] r8169 :05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". I've re-enabled it now. Chris > Heiner > >>> Maciej >>> >> Chris >> > >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, sent by accident. Note to self - don't attempt email until after second cup of coffee. On 29/09/2018 08:25, Chris Clayton wrote: > > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: Hi, > Hi, > > I upgraded my kernel to 4.18.10 recently and have since been experiencing > network problems after resuming from a > suspend to RAM or disk. I previously had 4.18.6 and that was OK. > > The pattern of the problem is that when I first boot, the network is > fine. But, after resume from suspend I find that > the time taken for a ping of one of my ISP's nameservers increases from > 14-15ms to more than 1000ms. Moreover, when I > open a browser (chromium or firefox), it fails to retrieve my home page > (https://www.google.co.uk) and pings of the > nameserver fail with the message "Destination Host Unreachable". Often, I > can revive the network by stopping it with > /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 > module and load it again. Please have a look at the following thread: https://lkml.org/lkml/2018/9/25/1118 >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable series kernel in which eth0 continues to function reliably after a suspend/resume cycle. The second dmesg output below is taken from that kernel. The first one was from an up-to-date 4.19 kernel > > $ dmesg | grep -i r8169 > [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 19 > [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.235218] r8169 :05:00.2 eth0: link down > [ 11.717460] r8169 :05:00.2 eth0: link up > > $ dmesg | grep -i r8169 > [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.459217] r8169 :05:00.2 eth0: link down > [ 10.459880] r8169 :05:00.2 eth0: link down > [ 12.015158] r8169 :05:00.2 eth0: link up > > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". As I said above I have re-enabled MSI. > >> >> Heiner >> Maciej >>> Chris >>> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Sorry, sent by accident. Note to self - don't attempt email until after second cup of coffee. On 29/09/2018 08:25, Chris Clayton wrote: > > > On 28/09/2018 23:13, Heiner Kallweit wrote: >> On 29.09.2018 00:00, Chris Clayton wrote: >>> Thanks Maciej. >>> >>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: Hi, > Hi, > > I upgraded my kernel to 4.18.10 recently and have since been experiencing > network problems after resuming from a > suspend to RAM or disk. I previously had 4.18.6 and that was OK. > > The pattern of the problem is that when I first boot, the network is > fine. But, after resume from suspend I find that > the time taken for a ping of one of my ISP's nameservers increases from > 14-15ms to more than 1000ms. Moreover, when I > open a browser (chromium or firefox), it fails to retrieve my home page > (https://www.google.co.uk) and pings of the > nameserver fail with the message "Destination Host Unreachable". Often, I > can revive the network by stopping it with > /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 > module and load it again. Please have a look at the following thread: https://lkml.org/lkml/2018/9/25/1118 >>> >>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the >>> problem is not solved by it. Similarly, I applied >>> Heiner's patch to the 4.19, but again the problem is not solved. >>> >> I think we talk about two different issues here. The one the fix is for has >> no link to suspend/resume. >> >> Chris, the lspci output doesn't provide enough detail to determine the exact >> chip version. >> Can you provide the dmesg part with the XID? I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable series kernel in which eth0 continues to function reliably after a suspend/resume cycle. The second dmesg output below is taken from that kernel. The first one was from an up-to-date 4.19 kernel > > $ dmesg | grep -i r8169 > [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 19 > [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.235218] r8169 :05:00.2 eth0: link down > [ 11.717460] r8169 :05:00.2 eth0: link up > > $ dmesg | grep -i r8169 > [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM > control > [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID > 48800800, IRQ 29 > [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended > config space, falling back to CSI > [ 10.459217] r8169 :05:00.2 eth0: link down > [ 10.459880] r8169 :05:00.2 eth0: link down > [ 12.015158] r8169 :05:00.2 eth0: link up > > >> According to your lspci output neither MSI nor MSI-X is active. >> Do you have to use nomsi for whatever reason? > > No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% > sure that it used to be - I've no idea how > it got dropped. If I'm not sure about an option, I start by taking the > recommendation in the kconfig help. Help on MSI > has a very clear "say Y". As I said above I have re-enabled MSI. > >> >> Heiner >> Maciej >>> Chris >>> >>
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> Hi, I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a suspend to RAM or disk. I previously had 4.18.6 and that was OK. The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep -i r8169 [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19 [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.235218] r8169 :05:00.2 eth0: link down [ 11.717460] r8169 :05:00.2 eth0: link up $ dmesg | grep -i r8169 [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.459217] r8169 :05:00.2 eth0: link down [ 10.459880] r8169 :05:00.2 eth0: link down [ 12.015158] r8169 :05:00.2 eth0: link up > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". > > Heiner > >>> Maciej >>> >> Chris >> >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 28/09/2018 23:13, Heiner Kallweit wrote: > On 29.09.2018 00:00, Chris Clayton wrote: >> Thanks Maciej. >> >> On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >>> Hi, >>> Hi, I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a suspend to RAM or disk. I previously had 4.18.6 and that was OK. The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again. >>> >>> Please have a look at the following thread: >>> https://lkml.org/lkml/2018/9/25/1118 >>> >> >> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem >> is not solved by it. Similarly, I applied >> Heiner's patch to the 4.19, but again the problem is not solved. >> > I think we talk about two different issues here. The one the fix is for has > no link to suspend/resume. > > Chris, the lspci output doesn't provide enough detail to determine the exact > chip version. > Can you provide the dmesg part with the XID? $ dmesg | grep -i r8169 [5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.321432] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.322892] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19 [5.323786] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.232077] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.235218] r8169 :05:00.2 eth0: link down [ 11.717460] r8169 :05:00.2 eth0: link up $ dmesg | grep -i r8169 [5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [5.208677] r8169 :05:00.2: can't disable ASPM; OS doesn't have ASPM control [5.210066] r8169 :05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29 [5.210676] r8169 :05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 10.456081] r8169 :05:00.2 eth0: No native access to PCI extended config space, falling back to CSI [ 10.459217] r8169 :05:00.2 eth0: link down [ 10.459880] r8169 :05:00.2 eth0: link down [ 12.015158] r8169 :05:00.2 eth0: link up > According to your lspci output neither MSI nor MSI-X is active. > Do you have to use nomsi for whatever reason? No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI has a very clear "say Y". > > Heiner > >>> Maciej >>> >> Chris >> >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 29.09.2018 00:00, Chris Clayton wrote: > Thanks Maciej. > > On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >> Hi, >> >>> Hi, >>> >>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>> network problems after resuming from a >>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>> >>> The pattern of the problem is that when I first boot, the network is fine. >>> But, after resume from suspend I find that >>> the time taken for a ping of one of my ISP's nameservers increases from >>> 14-15ms to more than 1000ms. Moreover, when I >>> open a browser (chromium or firefox), it fails to retrieve my home page >>> (https://www.google.co.uk) and pings of the >>> nameserver fail with the message "Destination Host Unreachable". Often, I >>> can revive the network by stopping it with >>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>> module and load it again. >> >> Please have a look at the following thread: >> https://lkml.org/lkml/2018/9/25/1118 >> > > I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem > is not solved by it. Similarly, I applied > Heiner's patch to the 4.19, but again the problem is not solved. > I think we talk about two different issues here. The one the fix is for has no link to suspend/resume. Chris, the lspci output doesn't provide enough detail to determine the exact chip version. Can you provide the dmesg part with the XID? According to your lspci output neither MSI nor MSI-X is active. Do you have to use nomsi for whatever reason? Heiner >> Maciej >> > Chris >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
On 29.09.2018 00:00, Chris Clayton wrote: > Thanks Maciej. > > On 28/09/2018 16:54, Maciej S. Szmigiero wrote: >> Hi, >> >>> Hi, >>> >>> I upgraded my kernel to 4.18.10 recently and have since been experiencing >>> network problems after resuming from a >>> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >>> >>> The pattern of the problem is that when I first boot, the network is fine. >>> But, after resume from suspend I find that >>> the time taken for a ping of one of my ISP's nameservers increases from >>> 14-15ms to more than 1000ms. Moreover, when I >>> open a browser (chromium or firefox), it fails to retrieve my home page >>> (https://www.google.co.uk) and pings of the >>> nameserver fail with the message "Destination Host Unreachable". Often, I >>> can revive the network by stopping it with >>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >>> module and load it again. >> >> Please have a look at the following thread: >> https://lkml.org/lkml/2018/9/25/1118 >> > > I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem > is not solved by it. Similarly, I applied > Heiner's patch to the 4.19, but again the problem is not solved. > I think we talk about two different issues here. The one the fix is for has no link to suspend/resume. Chris, the lspci output doesn't provide enough detail to determine the exact chip version. Can you provide the dmesg part with the XID? According to your lspci output neither MSI nor MSI-X is active. Do you have to use nomsi for whatever reason? Heiner >> Maciej >> > Chris >
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks Maciej. On 28/09/2018 16:54, Maciej S. Szmigiero wrote: > Hi, > >> Hi, >> >> I upgraded my kernel to 4.18.10 recently and have since been experiencing >> network problems after resuming from a >> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >> >> The pattern of the problem is that when I first boot, the network is fine. >> But, after resume from suspend I find that >> the time taken for a ping of one of my ISP's nameservers increases from >> 14-15ms to more than 1000ms. Moreover, when I >> open a browser (chromium or firefox), it fails to retrieve my home page >> (https://www.google.co.uk) and pings of the >> nameserver fail with the message "Destination Host Unreachable". Often, I >> can revive the network by stopping it with >> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >> module and load it again. > > Please have a look at the following thread: > https://lkml.org/lkml/2018/9/25/1118 > I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied Heiner's patch to the 4.19, but again the problem is not solved. > Maciej > Chris
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Thanks Maciej. On 28/09/2018 16:54, Maciej S. Szmigiero wrote: > Hi, > >> Hi, >> >> I upgraded my kernel to 4.18.10 recently and have since been experiencing >> network problems after resuming from a >> suspend to RAM or disk. I previously had 4.18.6 and that was OK. >> >> The pattern of the problem is that when I first boot, the network is fine. >> But, after resume from suspend I find that >> the time taken for a ping of one of my ISP's nameservers increases from >> 14-15ms to more than 1000ms. Moreover, when I >> open a browser (chromium or firefox), it fails to retrieve my home page >> (https://www.google.co.uk) and pings of the >> nameserver fail with the message "Destination Host Unreachable". Often, I >> can revive the network by stopping it with >> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 >> module and load it again. > > Please have a look at the following thread: > https://lkml.org/lkml/2018/9/25/1118 > I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied Heiner's patch to the 4.19, but again the problem is not solved. > Maciej > Chris
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi, > Hi, > > I upgraded my kernel to 4.18.10 recently and have since been experiencing > network problems after resuming from a > suspend to RAM or disk. I previously had 4.18.6 and that was OK. > > The pattern of the problem is that when I first boot, the network is fine. > But, after resume from suspend I find that > the time taken for a ping of one of my ISP's nameservers increases from > 14-15ms to more than 1000ms. Moreover, when I > open a browser (chromium or firefox), it fails to retrieve my home page > (https://www.google.co.uk) and pings of the > nameserver fail with the message "Destination Host Unreachable". Often, I can > revive the network by stopping it with > /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 > module and load it again. Please have a look at the following thread: https://lkml.org/lkml/2018/9/25/1118 Maciej
Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
Hi, > Hi, > > I upgraded my kernel to 4.18.10 recently and have since been experiencing > network problems after resuming from a > suspend to RAM or disk. I previously had 4.18.6 and that was OK. > > The pattern of the problem is that when I first boot, the network is fine. > But, after resume from suspend I find that > the time taken for a ping of one of my ISP's nameservers increases from > 14-15ms to more than 1000ms. Moreover, when I > open a browser (chromium or firefox), it fails to retrieve my home page > (https://www.google.co.uk) and pings of the > nameserver fail with the message "Destination Host Unreachable". Often, I can > revive the network by stopping it with > /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 > module and load it again. Please have a look at the following thread: https://lkml.org/lkml/2018/9/25/1118 Maciej