Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 11:08 AM,wrote: > On Thu, Feb 16, 2017 at 10:52:19AM -0500, Soheil Hassas Yeganeh wrote: >> On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh >> wrote: >> > Thank you Vito for the report. >> > >> > The patch you cited actually resolves a similar backward compatibility >> > problem for traceroute. >> > >> > I suspect the problem here is that there's a local error queued on the >> > error queue after an ICMP message. ping apparently expect the >> > sk->sk_err to be set for the local errors as well, and hence the >> > error. Ideally, ping should read the error queue if there an EPOLLERR, >> > because local errors never sk->sk_err on their own. That is, if we >> > have >> >> [oops] That is, if we have only one local error on the error queue, we >> cannot rely on having an error on recvmsg (i.e., sk->sk_err being set) >> even in 4.9. >> >> > > Hi Soheil, > > This doesn't appear to be trivially reproducible here by just running ping > as it were originally discovered. I'll see if I can reliably cause the > malfunction somehow, but until then I can't meaningfully test patches. > > Perhaps a form of fault injection would make more sense if there's a > reasonable idea of what this is stemming from? I tried to generate different ICMP errors as well as local errors, but unfortunately haven't been able to reproduce the problem. > I've opened an issue with iputils on github in the event that this is found > to be a ping bug. Your input might be helpful there as well: > https://github.com/iputils/iputils/issues/74 Sent a pull request. Although, we might want to at least confirm the userspace patch fixes the issue in ping. Thanks! Soheil > Thanks, > Vito Caputo
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 11:08 AM, wrote: > On Thu, Feb 16, 2017 at 10:52:19AM -0500, Soheil Hassas Yeganeh wrote: >> On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh >> wrote: >> > Thank you Vito for the report. >> > >> > The patch you cited actually resolves a similar backward compatibility >> > problem for traceroute. >> > >> > I suspect the problem here is that there's a local error queued on the >> > error queue after an ICMP message. ping apparently expect the >> > sk->sk_err to be set for the local errors as well, and hence the >> > error. Ideally, ping should read the error queue if there an EPOLLERR, >> > because local errors never sk->sk_err on their own. That is, if we >> > have >> >> [oops] That is, if we have only one local error on the error queue, we >> cannot rely on having an error on recvmsg (i.e., sk->sk_err being set) >> even in 4.9. >> >> > > Hi Soheil, > > This doesn't appear to be trivially reproducible here by just running ping > as it were originally discovered. I'll see if I can reliably cause the > malfunction somehow, but until then I can't meaningfully test patches. > > Perhaps a form of fault injection would make more sense if there's a > reasonable idea of what this is stemming from? I tried to generate different ICMP errors as well as local errors, but unfortunately haven't been able to reproduce the problem. > I've opened an issue with iputils on github in the event that this is found > to be a ping bug. Your input might be helpful there as well: > https://github.com/iputils/iputils/issues/74 Sent a pull request. Although, we might want to at least confirm the userspace patch fixes the issue in ping. Thanks! Soheil > Thanks, > Vito Caputo
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 10:52:19AM -0500, Soheil Hassas Yeganeh wrote: > On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh >wrote: > > Thank you Vito for the report. > > > > The patch you cited actually resolves a similar backward compatibility > > problem for traceroute. > > > > I suspect the problem here is that there's a local error queued on the > > error queue after an ICMP message. ping apparently expect the > > sk->sk_err to be set for the local errors as well, and hence the > > error. Ideally, ping should read the error queue if there an EPOLLERR, > > because local errors never sk->sk_err on their own. That is, if we > > have > > [oops] That is, if we have only one local error on the error queue, we > cannot rely on having an error on recvmsg (i.e., sk->sk_err being set) > even in 4.9. > > Hi Soheil, This doesn't appear to be trivially reproducible here by just running ping as it were originally discovered. I'll see if I can reliably cause the malfunction somehow, but until then I can't meaningfully test patches. Perhaps a form of fault injection would make more sense if there's a reasonable idea of what this is stemming from? I've opened an issue with iputils on github in the event that this is found to be a ping bug. Your input might be helpful there as well: https://github.com/iputils/iputils/issues/74 Thanks, Vito Caputo
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 10:52:19AM -0500, Soheil Hassas Yeganeh wrote: > On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh > wrote: > > Thank you Vito for the report. > > > > The patch you cited actually resolves a similar backward compatibility > > problem for traceroute. > > > > I suspect the problem here is that there's a local error queued on the > > error queue after an ICMP message. ping apparently expect the > > sk->sk_err to be set for the local errors as well, and hence the > > error. Ideally, ping should read the error queue if there an EPOLLERR, > > because local errors never sk->sk_err on their own. That is, if we > > have > > [oops] That is, if we have only one local error on the error queue, we > cannot rely on having an error on recvmsg (i.e., sk->sk_err being set) > even in 4.9. > > Hi Soheil, This doesn't appear to be trivially reproducible here by just running ping as it were originally discovered. I'll see if I can reliably cause the malfunction somehow, but until then I can't meaningfully test patches. Perhaps a form of fault injection would make more sense if there's a reasonable idea of what this is stemming from? I've opened an issue with iputils on github in the event that this is found to be a ping bug. Your input might be helpful there as well: https://github.com/iputils/iputils/issues/74 Thanks, Vito Caputo
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh <soh...@google.com> wrote: > Thank you Vito for the report. > > The patch you cited actually resolves a similar backward compatibility > problem for traceroute. > > I suspect the problem here is that there's a local error queued on the > error queue after an ICMP message. ping apparently expect the > sk->sk_err to be set for the local errors as well, and hence the > error. Ideally, ping should read the error queue if there an EPOLLERR, > because local errors never sk->sk_err on their own. That is, if we > have [oops] That is, if we have only one local error on the error queue, we cannot rely on having an error on recvmsg (i.e., sk->sk_err being set) even in 4.9. Thanks, Soheil > But as a workaround, would you mind trying the following patch to see > if it resolves the issue reported? > > From: Soheil Hassas Yeganeh <soh...@google.com> > Date: Thu, 16 Feb 2017 10:48:24 -0500 > --- > net/core/skbuff.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 734c71468b01..2b774b564024 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3717,7 +3717,8 @@ EXPORT_SYMBOL(sock_queue_err_skb); > static bool is_icmp_err_skb(const struct sk_buff *skb) > { > return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP || > - SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6); > + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 || > + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL); > } > > struct sk_buff *sock_dequeue_err_skb(struct sock *sk) > -- > > Thanks! > Soheil > > On Thu, Feb 16, 2017 at 7:05 AM, <l...@pengaru.com> wrote: >> Hello netdev, >> >> Please see the forwarded message below. This was sent to linux-kernel but >> after digging a little I suspect it's specific to the network stack. >> >> Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related >> to what I'm observing: >> >> commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d >> Author: Soheil Hassas Yeganeh <soh...@google.com> >> Date: Wed Nov 30 14:01:08 2016 -0500 >> >> sock: reset sk_err for ICMP packets read from error queue >> >> Only when ICMP packets are enqueued onto the error queue, >> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err >> in sock_dequeue_err_skb), a subsequent error queue read >> would set sk_err to the next error on the queue, or 0 if empty. >> As no error types other than ICMP set this field, sk_err should >> not be modified upon dequeuing them. >> >> Only for ICMP errors, reset the (racy) sk_err. Some applications, >> like traceroute, rely on it and go into a futile busy POLLERR >> loop otherwise. >> >> In principle, sk_err has to be set while an ICMP error is queued. >> Testing is_icmp_err_skb(skb_next) approximates this without >> requiring a full queue walk. Applications that receive both ICMP >> and other errors cannot rely on this legacy behavior, as other >> errors do not set sk_err in the first place. >> >> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) >> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com> >> Signed-off-by: Willem de Bruijn <will...@google.com> >> Acked-by: Eric Dumazet <eduma...@google.com> >> Acked-by: Maciej Żenczykowski <m...@google.com> >> Signed-off-by: David S. Miller <da...@davemloft.net> >> >> Cheers, >> Vito Caputo >> >> >> - Forwarded message from l...@pengaru.com - >> >> Date: Thu, 16 Feb 2017 03:17:49 -0800 >> From: l...@pengaru.com >> To: linux-kernel@vger.kernel.org >> Subject: Re: [BUG] 4.10-rc8 - ping spinning? >> User-Agent: Mutt/1.5.23 (2014-03-12) >> >> On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: >>> Hello list, >>> >>> Some rtl8192cu bugs of old got me in the habit of running ping in a shelved >>> (i.e. forgotten) xterm, a harmless practice which seemed to prevent the >>> rtl8192cu device from dying. >>> >>> This evening the system started getting very slow and to my surprise I found >>> this in `top`: >>> 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping >>> 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping >>> >>> This is a dual core machine (X61s, core2
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh wrote: > Thank you Vito for the report. > > The patch you cited actually resolves a similar backward compatibility > problem for traceroute. > > I suspect the problem here is that there's a local error queued on the > error queue after an ICMP message. ping apparently expect the > sk->sk_err to be set for the local errors as well, and hence the > error. Ideally, ping should read the error queue if there an EPOLLERR, > because local errors never sk->sk_err on their own. That is, if we > have [oops] That is, if we have only one local error on the error queue, we cannot rely on having an error on recvmsg (i.e., sk->sk_err being set) even in 4.9. Thanks, Soheil > But as a workaround, would you mind trying the following patch to see > if it resolves the issue reported? > > From: Soheil Hassas Yeganeh > Date: Thu, 16 Feb 2017 10:48:24 -0500 > --- > net/core/skbuff.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 734c71468b01..2b774b564024 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3717,7 +3717,8 @@ EXPORT_SYMBOL(sock_queue_err_skb); > static bool is_icmp_err_skb(const struct sk_buff *skb) > { > return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP || > - SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6); > + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 || > + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL); > } > > struct sk_buff *sock_dequeue_err_skb(struct sock *sk) > -- > > Thanks! > Soheil > > On Thu, Feb 16, 2017 at 7:05 AM, wrote: >> Hello netdev, >> >> Please see the forwarded message below. This was sent to linux-kernel but >> after digging a little I suspect it's specific to the network stack. >> >> Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related >> to what I'm observing: >> >> commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d >> Author: Soheil Hassas Yeganeh >> Date: Wed Nov 30 14:01:08 2016 -0500 >> >> sock: reset sk_err for ICMP packets read from error queue >> >> Only when ICMP packets are enqueued onto the error queue, >> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err >> in sock_dequeue_err_skb), a subsequent error queue read >> would set sk_err to the next error on the queue, or 0 if empty. >> As no error types other than ICMP set this field, sk_err should >> not be modified upon dequeuing them. >> >> Only for ICMP errors, reset the (racy) sk_err. Some applications, >> like traceroute, rely on it and go into a futile busy POLLERR >> loop otherwise. >> >> In principle, sk_err has to be set while an ICMP error is queued. >> Testing is_icmp_err_skb(skb_next) approximates this without >> requiring a full queue walk. Applications that receive both ICMP >> and other errors cannot rely on this legacy behavior, as other >> errors do not set sk_err in the first place. >> >> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) >> Signed-off-by: Soheil Hassas Yeganeh >> Signed-off-by: Willem de Bruijn >> Acked-by: Eric Dumazet >> Acked-by: Maciej Żenczykowski >> Signed-off-by: David S. Miller >> >> Cheers, >> Vito Caputo >> >> >> - Forwarded message from l...@pengaru.com - >> >> Date: Thu, 16 Feb 2017 03:17:49 -0800 >> From: l...@pengaru.com >> To: linux-kernel@vger.kernel.org >> Subject: Re: [BUG] 4.10-rc8 - ping spinning? >> User-Agent: Mutt/1.5.23 (2014-03-12) >> >> On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: >>> Hello list, >>> >>> Some rtl8192cu bugs of old got me in the habit of running ping in a shelved >>> (i.e. forgotten) xterm, a harmless practice which seemed to prevent the >>> rtl8192cu device from dying. >>> >>> This evening the system started getting very slow and to my surprise I found >>> this in `top`: >>> 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping >>> 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping >>> >>> This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are >>> burning all the free CPU in the system context. They're identical commands, >>> just plain `ping domain.com`, to the same host. It appears I accidentally >>> (fortuitous
Re: [BUG] 4.10-rc8 - ping spinning?
Thank you Vito for the report. The patch you cited actually resolves a similar backward compatibility problem for traceroute. I suspect the problem here is that there's a local error queued on the error queue after an ICMP message. ping apparently expect the sk->sk_err to be set for the local errors as well, and hence the error. Ideally, ping should read the error queue if there an EPOLLERR, because local errors never sk->sk_err on their own. That is, if we have But as a workaround, would you mind trying the following patch to see if it resolves the issue reported? From: Soheil Hassas Yeganeh <soh...@google.com> Date: Thu, 16 Feb 2017 10:48:24 -0500 --- net/core/skbuff.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 734c71468b01..2b774b564024 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3717,7 +3717,8 @@ EXPORT_SYMBOL(sock_queue_err_skb); static bool is_icmp_err_skb(const struct sk_buff *skb) { return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP || - SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6); + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 || + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL); } struct sk_buff *sock_dequeue_err_skb(struct sock *sk) -- Thanks! Soheil On Thu, Feb 16, 2017 at 7:05 AM, <l...@pengaru.com> wrote: > Hello netdev, > > Please see the forwarded message below. This was sent to linux-kernel but > after digging a little I suspect it's specific to the network stack. > > Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related > to what I'm observing: > > commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d > Author: Soheil Hassas Yeganeh <soh...@google.com> > Date: Wed Nov 30 14:01:08 2016 -0500 > > sock: reset sk_err for ICMP packets read from error queue > > Only when ICMP packets are enqueued onto the error queue, > sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err > in sock_dequeue_err_skb), a subsequent error queue read > would set sk_err to the next error on the queue, or 0 if empty. > As no error types other than ICMP set this field, sk_err should > not be modified upon dequeuing them. > > Only for ICMP errors, reset the (racy) sk_err. Some applications, > like traceroute, rely on it and go into a futile busy POLLERR > loop otherwise. > > In principle, sk_err has to be set while an ICMP error is queued. > Testing is_icmp_err_skb(skb_next) approximates this without > requiring a full queue walk. Applications that receive both ICMP > and other errors cannot rely on this legacy behavior, as other > errors do not set sk_err in the first place. > > Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) > Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com> > Signed-off-by: Willem de Bruijn <will...@google.com> > Acked-by: Eric Dumazet <eduma...@google.com> > Acked-by: Maciej Żenczykowski <m...@google.com> > Signed-off-by: David S. Miller <da...@davemloft.net> > > Cheers, > Vito Caputo > > > - Forwarded message from l...@pengaru.com - > > Date: Thu, 16 Feb 2017 03:17:49 -0800 > From: l...@pengaru.com > To: linux-kernel@vger.kernel.org > Subject: Re: [BUG] 4.10-rc8 - ping spinning? > User-Agent: Mutt/1.5.23 (2014-03-12) > > On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: >> Hello list, >> >> Some rtl8192cu bugs of old got me in the habit of running ping in a shelved >>(i.e. forgotten) xterm, a harmless practice which seemed to prevent the >> rtl8192cu device from dying. >> >> This evening the system started getting very slow and to my surprise I found >> this in `top`: >> 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping >> 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping >> >> This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are >> burning all the free CPU in the system context. They're identical commands, >> just plain `ping domain.com`, to the same host. It appears I accidentally >> (fortuitously?) had two running, which made this event more interesting. >> >> I can assert that these did not begin spinning simultaneously - as you can >> see >> by the cumulative time in `top` there's a small delta. I also use a window >> manager with builtin continuous process monitoring, and when I noticed this >> was >> happening I was able to see that one of the processes had only recently begun >> spinning, the other was spinning long eno
Re: [BUG] 4.10-rc8 - ping spinning?
Thank you Vito for the report. The patch you cited actually resolves a similar backward compatibility problem for traceroute. I suspect the problem here is that there's a local error queued on the error queue after an ICMP message. ping apparently expect the sk->sk_err to be set for the local errors as well, and hence the error. Ideally, ping should read the error queue if there an EPOLLERR, because local errors never sk->sk_err on their own. That is, if we have But as a workaround, would you mind trying the following patch to see if it resolves the issue reported? From: Soheil Hassas Yeganeh Date: Thu, 16 Feb 2017 10:48:24 -0500 --- net/core/skbuff.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 734c71468b01..2b774b564024 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3717,7 +3717,8 @@ EXPORT_SYMBOL(sock_queue_err_skb); static bool is_icmp_err_skb(const struct sk_buff *skb) { return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP || - SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6); + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 || + SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL); } struct sk_buff *sock_dequeue_err_skb(struct sock *sk) -- Thanks! Soheil On Thu, Feb 16, 2017 at 7:05 AM, wrote: > Hello netdev, > > Please see the forwarded message below. This was sent to linux-kernel but > after digging a little I suspect it's specific to the network stack. > > Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related > to what I'm observing: > > commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d > Author: Soheil Hassas Yeganeh > Date: Wed Nov 30 14:01:08 2016 -0500 > > sock: reset sk_err for ICMP packets read from error queue > > Only when ICMP packets are enqueued onto the error queue, > sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err > in sock_dequeue_err_skb), a subsequent error queue read > would set sk_err to the next error on the queue, or 0 if empty. > As no error types other than ICMP set this field, sk_err should > not be modified upon dequeuing them. > > Only for ICMP errors, reset the (racy) sk_err. Some applications, > like traceroute, rely on it and go into a futile busy POLLERR > loop otherwise. > > In principle, sk_err has to be set while an ICMP error is queued. > Testing is_icmp_err_skb(skb_next) approximates this without > requiring a full queue walk. Applications that receive both ICMP > and other errors cannot rely on this legacy behavior, as other > errors do not set sk_err in the first place. > > Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) > Signed-off-by: Soheil Hassas Yeganeh > Signed-off-by: Willem de Bruijn > Acked-by: Eric Dumazet > Acked-by: Maciej Żenczykowski > Signed-off-by: David S. Miller > > Cheers, > Vito Caputo > > > - Forwarded message from l...@pengaru.com - > > Date: Thu, 16 Feb 2017 03:17:49 -0800 > From: l...@pengaru.com > To: linux-kernel@vger.kernel.org > Subject: Re: [BUG] 4.10-rc8 - ping spinning? > User-Agent: Mutt/1.5.23 (2014-03-12) > > On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: >> Hello list, >> >> Some rtl8192cu bugs of old got me in the habit of running ping in a shelved >>(i.e. forgotten) xterm, a harmless practice which seemed to prevent the >> rtl8192cu device from dying. >> >> This evening the system started getting very slow and to my surprise I found >> this in `top`: >> 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping >> 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping >> >> This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are >> burning all the free CPU in the system context. They're identical commands, >> just plain `ping domain.com`, to the same host. It appears I accidentally >> (fortuitously?) had two running, which made this event more interesting. >> >> I can assert that these did not begin spinning simultaneously - as you can >> see >> by the cumulative time in `top` there's a small delta. I also use a window >> manager with builtin continuous process monitoring, and when I noticed this >> was >> happening I was able to see that one of the processes had only recently begun >> spinning, the other was spinning long enough to have its start fall off the >> chart (at least ~17 minutes ago). >> >> This hasn't occurred before AFAIK, but I haven't spent much time in 4.10 yet. >> I'm pretty confident
Re: [BUG] 4.10-rc8 - ping spinning?
Hello netdev, Please see the forwarded message below. This was sent to linux-kernel but after digging a little I suspect it's specific to the network stack. Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related to what I'm observing: commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d Author: Soheil Hassas Yeganeh <soh...@google.com> Date: Wed Nov 30 14:01:08 2016 -0500 sock: reset sk_err for ICMP packets read from error queue Only when ICMP packets are enqueued onto the error queue, sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb), a subsequent error queue read would set sk_err to the next error on the queue, or 0 if empty. As no error types other than ICMP set this field, sk_err should not be modified upon dequeuing them. Only for ICMP errors, reset the (racy) sk_err. Some applications, like traceroute, rely on it and go into a futile busy POLLERR loop otherwise. In principle, sk_err has to be set while an ICMP error is queued. Testing is_icmp_err_skb(skb_next) approximates this without requiring a full queue walk. Applications that receive both ICMP and other errors cannot rely on this legacy behavior, as other errors do not set sk_err in the first place. Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com> Signed-off-by: Willem de Bruijn <will...@google.com> Acked-by: Eric Dumazet <eduma...@google.com> Acked-by: Maciej Żenczykowski <m...@google.com> Signed-off-by: David S. Miller <da...@davemloft.net> Cheers, Vito Caputo - Forwarded message from l...@pengaru.com - Date: Thu, 16 Feb 2017 03:17:49 -0800 From: l...@pengaru.com To: linux-kernel@vger.kernel.org Subject: Re: [BUG] 4.10-rc8 - ping spinning? User-Agent: Mutt/1.5.23 (2014-03-12) On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: > Hello list, > > Some rtl8192cu bugs of old got me in the habit of running ping in a shelved > (i.e. forgotten) xterm, a harmless practice which seemed to prevent the > rtl8192cu device from dying. > > This evening the system started getting very slow and to my surprise I found > this in `top`: > 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping > 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping > > This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are > burning all the free CPU in the system context. They're identical commands, > just plain `ping domain.com`, to the same host. It appears I accidentally > (fortuitously?) had two running, which made this event more interesting. > > I can assert that these did not begin spinning simultaneously - as you can see > by the cumulative time in `top` there's a small delta. I also use a window > manager with builtin continuous process monitoring, and when I noticed this > was > happening I was able to see that one of the processes had only recently begun > spinning, the other was spinning long enough to have its start fall off the > chart (at least ~17 minutes ago). > > This hasn't occurred before AFAIK, but I haven't spent much time in 4.10 yet. > I'm pretty confident this didn't happen in 4.9 which I ran for quite a while. > > `strace` of one of the aforementioned processes: > > 1487241315.073568 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.22> > 1487241315.073665 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.073747 gettimeofday({1487241315, 73774}, NULL) = 0 <0.21> > 1487241315.073829 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.073927 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.074024 gettimeofday({1487241315, 74050}, NULL) = 0 <0.000256> > 1487241315.074352 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.26> > 1487241315.076241 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.22> > 1487241315.076337 gettimeofday({1487241315, 76366}, NULL) = 0 <0.20> > 1487241315.076422 poll([{fd=3, events=POLLIN|POLLERR}], 1, 924) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.076523 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.25> > 1487241315.079770 gettimeofday({1487241315, 79799}, NULL) = 0 <0.19> > 1487241315.079855 poll([{fd=3, events=POLLIN|POLLERR}], 1, 921) = 1 ([{fd
Re: [BUG] 4.10-rc8 - ping spinning?
Hello netdev, Please see the forwarded message below. This was sent to linux-kernel but after digging a little I suspect it's specific to the network stack. Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related to what I'm observing: commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d Author: Soheil Hassas Yeganeh Date: Wed Nov 30 14:01:08 2016 -0500 sock: reset sk_err for ICMP packets read from error queue Only when ICMP packets are enqueued onto the error queue, sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb), a subsequent error queue read would set sk_err to the next error on the queue, or 0 if empty. As no error types other than ICMP set this field, sk_err should not be modified upon dequeuing them. Only for ICMP errors, reset the (racy) sk_err. Some applications, like traceroute, rely on it and go into a futile busy POLLERR loop otherwise. In principle, sk_err has to be set while an ICMP error is queued. Testing is_icmp_err_skb(skb_next) approximates this without requiring a full queue walk. Applications that receive both ICMP and other errors cannot rely on this legacy behavior, as other errors do not set sk_err in the first place. Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Willem de Bruijn Acked-by: Eric Dumazet Acked-by: Maciej Żenczykowski Signed-off-by: David S. Miller Cheers, Vito Caputo - Forwarded message from l...@pengaru.com - Date: Thu, 16 Feb 2017 03:17:49 -0800 From: l...@pengaru.com To: linux-kernel@vger.kernel.org Subject: Re: [BUG] 4.10-rc8 - ping spinning? User-Agent: Mutt/1.5.23 (2014-03-12) On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: > Hello list, > > Some rtl8192cu bugs of old got me in the habit of running ping in a shelved > (i.e. forgotten) xterm, a harmless practice which seemed to prevent the > rtl8192cu device from dying. > > This evening the system started getting very slow and to my surprise I found > this in `top`: > 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping > 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping > > This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are > burning all the free CPU in the system context. They're identical commands, > just plain `ping domain.com`, to the same host. It appears I accidentally > (fortuitously?) had two running, which made this event more interesting. > > I can assert that these did not begin spinning simultaneously - as you can see > by the cumulative time in `top` there's a small delta. I also use a window > manager with builtin continuous process monitoring, and when I noticed this > was > happening I was able to see that one of the processes had only recently begun > spinning, the other was spinning long enough to have its start fall off the > chart (at least ~17 minutes ago). > > This hasn't occurred before AFAIK, but I haven't spent much time in 4.10 yet. > I'm pretty confident this didn't happen in 4.9 which I ran for quite a while. > > `strace` of one of the aforementioned processes: > > 1487241315.073568 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.22> > 1487241315.073665 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.073747 gettimeofday({1487241315, 73774}, NULL) = 0 <0.21> > 1487241315.073829 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.073927 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.074024 gettimeofday({1487241315, 74050}, NULL) = 0 <0.000256> > 1487241315.074352 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.26> > 1487241315.076241 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.22> > 1487241315.076337 gettimeofday({1487241315, 76366}, NULL) = 0 <0.20> > 1487241315.076422 poll([{fd=3, events=POLLIN|POLLERR}], 1, 924) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.076523 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.25> > 1487241315.079770 gettimeofday({1487241315, 79799}, NULL) = 0 <0.19> > 1487241315.079855 poll([{fd=3, events=POLLIN|POLLERR}], 1, 921) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.079956 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily una
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: > Hello list, > > Some rtl8192cu bugs of old got me in the habit of running ping in a shelved > (i.e. forgotten) xterm, a harmless practice which seemed to prevent the > rtl8192cu device from dying. > > This evening the system started getting very slow and to my surprise I found > this in `top`: > 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping > 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping > > This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are > burning all the free CPU in the system context. They're identical commands, > just plain `ping domain.com`, to the same host. It appears I accidentally > (fortuitously?) had two running, which made this event more interesting. > > I can assert that these did not begin spinning simultaneously - as you can see > by the cumulative time in `top` there's a small delta. I also use a window > manager with builtin continuous process monitoring, and when I noticed this > was > happening I was able to see that one of the processes had only recently begun > spinning, the other was spinning long enough to have its start fall off the > chart (at least ~17 minutes ago). > > This hasn't occurred before AFAIK, but I haven't spent much time in 4.10 yet. > I'm pretty confident this didn't happen in 4.9 which I ran for quite a while. > > `strace` of one of the aforementioned processes: > > 1487241315.073568 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.22> > 1487241315.073665 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.073747 gettimeofday({1487241315, 73774}, NULL) = 0 <0.21> > 1487241315.073829 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.073927 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.074024 gettimeofday({1487241315, 74050}, NULL) = 0 <0.000256> > 1487241315.074352 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.26> > 1487241315.076241 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.22> > 1487241315.076337 gettimeofday({1487241315, 76366}, NULL) = 0 <0.20> > 1487241315.076422 poll([{fd=3, events=POLLIN|POLLERR}], 1, 924) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.076523 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.25> > 1487241315.079770 gettimeofday({1487241315, 79799}, NULL) = 0 <0.19> > 1487241315.079855 poll([{fd=3, events=POLLIN|POLLERR}], 1, 921) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.079956 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.21> > 1487241315.080057 gettimeofday({1487241315, 80084}, NULL) = 0 <0.20> > 1487241315.080140 poll([{fd=3, events=POLLIN|POLLERR}], 1, 921) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.080238 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.21> > 1487241315.080322 gettimeofday({1487241315, 80350}, NULL) = 0 <0.20> > 1487241315.080406 poll([{fd=3, events=POLLIN|POLLERR}], 1, 920) = 1 ([{fd=3, > revents=POLLERR}]) <0.23> > 1487241315.080502 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.19> > 1487241315.080583 gettimeofday({1487241315, 80610}, NULL) = 0 <0.18> > 1487241315.080663 poll([{fd=3, events=POLLIN|POLLERR}], 1, 920) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.080761 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.080843 gettimeofday({1487241315, 80870}, NULL) = 0 <0.20> > 1487241315.080925 poll([{fd=3, events=POLLIN|POLLERR}], 1, 920) = 1 ([{fd=3, > revents=POLLERR}]) <0.37> > 1487241315.081037 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.081119 gettimeofday({1487241315, 81147}, NULL) = 0 <0.20> > It's worth noting that ping is still otherwise functioning correctly, despite the POLLERR: 1487242826.169502 sendmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("xx.xxx.xxx.xxx")}, msg_iov(1)=[{"\10\0\245G\23\373\243uJ\206\245X\0\0\0\0\352\225\2\0\0\0\0\0\20\21\22\23\24\25\26\27"..., 64}], msg_controllen=0, msg_flags=0}, MSG_CONFIRM) = 64 <0.000133> 1487242826.169757 recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("66.240.222.126")}, msg_iov(1)=[{"E \0T\345\364\0\0002\1w\23B\360\336~\n\0\0\23\0\0\255G\23\373\243uJ\206\245X"..., 192}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...},
Re: [BUG] 4.10-rc8 - ping spinning?
On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote: > Hello list, > > Some rtl8192cu bugs of old got me in the habit of running ping in a shelved > (i.e. forgotten) xterm, a harmless practice which seemed to prevent the > rtl8192cu device from dying. > > This evening the system started getting very slow and to my surprise I found > this in `top`: > 5115 swivel30 10 14772 1928 1756 R 90.9 0.0 1351:41 ping > 9005 swivel30 10 14772 1892 1724 R 90.9 0.0 1354:26 ping > > This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are > burning all the free CPU in the system context. They're identical commands, > just plain `ping domain.com`, to the same host. It appears I accidentally > (fortuitously?) had two running, which made this event more interesting. > > I can assert that these did not begin spinning simultaneously - as you can see > by the cumulative time in `top` there's a small delta. I also use a window > manager with builtin continuous process monitoring, and when I noticed this > was > happening I was able to see that one of the processes had only recently begun > spinning, the other was spinning long enough to have its start fall off the > chart (at least ~17 minutes ago). > > This hasn't occurred before AFAIK, but I haven't spent much time in 4.10 yet. > I'm pretty confident this didn't happen in 4.9 which I ran for quite a while. > > `strace` of one of the aforementioned processes: > > 1487241315.073568 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.22> > 1487241315.073665 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.073747 gettimeofday({1487241315, 73774}, NULL) = 0 <0.21> > 1487241315.073829 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.073927 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.074024 gettimeofday({1487241315, 74050}, NULL) = 0 <0.000256> > 1487241315.074352 poll([{fd=3, events=POLLIN|POLLERR}], 1, 927) = 1 ([{fd=3, > revents=POLLERR}]) <0.26> > 1487241315.076241 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.22> > 1487241315.076337 gettimeofday({1487241315, 76366}, NULL) = 0 <0.20> > 1487241315.076422 poll([{fd=3, events=POLLIN|POLLERR}], 1, 924) = 1 ([{fd=3, > revents=POLLERR}]) <0.25> > 1487241315.076523 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.25> > 1487241315.079770 gettimeofday({1487241315, 79799}, NULL) = 0 <0.19> > 1487241315.079855 poll([{fd=3, events=POLLIN|POLLERR}], 1, 921) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.079956 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.21> > 1487241315.080057 gettimeofday({1487241315, 80084}, NULL) = 0 <0.20> > 1487241315.080140 poll([{fd=3, events=POLLIN|POLLERR}], 1, 921) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.080238 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.21> > 1487241315.080322 gettimeofday({1487241315, 80350}, NULL) = 0 <0.20> > 1487241315.080406 poll([{fd=3, events=POLLIN|POLLERR}], 1, 920) = 1 ([{fd=3, > revents=POLLERR}]) <0.23> > 1487241315.080502 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.19> > 1487241315.080583 gettimeofday({1487241315, 80610}, NULL) = 0 <0.18> > 1487241315.080663 poll([{fd=3, events=POLLIN|POLLERR}], 1, 920) = 1 ([{fd=3, > revents=POLLERR}]) <0.24> > 1487241315.080761 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.080843 gettimeofday({1487241315, 80870}, NULL) = 0 <0.20> > 1487241315.080925 poll([{fd=3, events=POLLIN|POLLERR}], 1, 920) = 1 ([{fd=3, > revents=POLLERR}]) <0.37> > 1487241315.081037 recvmsg(3, 0x7ffc8e05e260, MSG_DONTWAIT) = -1 EAGAIN > (Resource temporarily unavailable) <0.20> > 1487241315.081119 gettimeofday({1487241315, 81147}, NULL) = 0 <0.20> > It's worth noting that ping is still otherwise functioning correctly, despite the POLLERR: 1487242826.169502 sendmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("xx.xxx.xxx.xxx")}, msg_iov(1)=[{"\10\0\245G\23\373\243uJ\206\245X\0\0\0\0\352\225\2\0\0\0\0\0\20\21\22\23\24\25\26\27"..., 64}], msg_controllen=0, msg_flags=0}, MSG_CONFIRM) = 64 <0.000133> 1487242826.169757 recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("66.240.222.126")}, msg_iov(1)=[{"E \0T\345\364\0\0002\1w\23B\360\336~\n\0\0\23\0\0\255G\23\373\243uJ\206\245X"..., 192}], msg_controllen=32, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...},