Re: [PATCH net] net: account for current skb length when deciding about UFO
On 06/19/2017 07:03 AM, Michal Kubecek wrote: > Our customer encountered stuck NFS writes for blocks starting at specific > offsets w.r.t. page boundary caused by networking stack sending packets via > UFO enabled device with wrong checksum. The problem can be reproduced by > composing a long UDP datagram from multiple parts using MSG_MORE flag: > > sendto(sd, buff, 1000, MSG_MORE, ...); > sendto(sd, buff, 1000, MSG_MORE, ...); > sendto(sd, buff, 3000, 0, ...); > > Assume this packet is to be routed via a device with MTU 1500 and > NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(), > this condition is tested (among others) to decide whether to call > ip_ufo_append_data(): > > ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb)) > > At the moment, we already have skb with 1028 bytes of data which is not > marked for GSO so that the test is false (fragheaderlen is usually 20). > Thus we append second 1000 bytes to this skb without invoking UFO. Third > sendto(), however, has sufficient length to trigger the UFO path so that we > end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb() > uses udp_csum() to calculate the checksum but that assumes all fragments > have correct checksum in skb->csum which is not true for UFO fragments. > > When checking against MTU, we need to add skb->len to length of new segment > if we already have a partially filled skb and fragheaderlen only if there > isn't one. > > In the IPv6 case, skb can only be null if this is the first segment so that > we have to use headersize (length of the first IPv6 header) rather than > fragheaderlen (length of IPv6 header of further fragments) for skb == NULL. > > Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") > Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for > ip6 fragment between __ip6_append_data and ip6_finish_output") > Signed-off-by: Michal Kubecek <mkube...@suse.cz> LGTM. Acked-by: Vlad Yasevich <vyase...@redhat.com> -vlad > --- > net/ipv4/ip_output.c | 3 ++- > net/ipv6/ip6_output.c | 2 +- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 7a3fd25e8913..532b36e9ce2a 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -964,7 +964,8 @@ static int __ip_append_data(struct sock *sk, > csummode = CHECKSUM_PARTIAL; > > cork->length += length; > - if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) && > + if length + (skb ? skb->len : fragheaderlen)) > mtu) || > + (skb && skb_is_gso(skb))) && > (sk->sk_protocol == IPPROTO_UDP) && > (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) && > (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) { > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index bf8a58a1c32d..1699acb2fa2c 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1390,7 +1390,7 @@ static int __ip6_append_data(struct sock *sk, >*/ > > cork->length += length; > - if length + fragheaderlen) > mtu) || > + if length + (skb ? skb->len : headersize)) > mtu) || >(skb && skb_is_gso(skb))) && > (sk->sk_protocol == IPPROTO_UDP) && > (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) && >
Re: [PATCH net] net: account for current skb length when deciding about UFO
On 06/19/2017 07:03 AM, Michal Kubecek wrote: > Our customer encountered stuck NFS writes for blocks starting at specific > offsets w.r.t. page boundary caused by networking stack sending packets via > UFO enabled device with wrong checksum. The problem can be reproduced by > composing a long UDP datagram from multiple parts using MSG_MORE flag: > > sendto(sd, buff, 1000, MSG_MORE, ...); > sendto(sd, buff, 1000, MSG_MORE, ...); > sendto(sd, buff, 3000, 0, ...); > > Assume this packet is to be routed via a device with MTU 1500 and > NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(), > this condition is tested (among others) to decide whether to call > ip_ufo_append_data(): > > ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb)) > > At the moment, we already have skb with 1028 bytes of data which is not > marked for GSO so that the test is false (fragheaderlen is usually 20). > Thus we append second 1000 bytes to this skb without invoking UFO. Third > sendto(), however, has sufficient length to trigger the UFO path so that we > end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb() > uses udp_csum() to calculate the checksum but that assumes all fragments > have correct checksum in skb->csum which is not true for UFO fragments. > > When checking against MTU, we need to add skb->len to length of new segment > if we already have a partially filled skb and fragheaderlen only if there > isn't one. > > In the IPv6 case, skb can only be null if this is the first segment so that > we have to use headersize (length of the first IPv6 header) rather than > fragheaderlen (length of IPv6 header of further fragments) for skb == NULL. > > Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") > Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for > ip6 fragment between __ip6_append_data and ip6_finish_output") > Signed-off-by: Michal Kubecek LGTM. Acked-by: Vlad Yasevich -vlad > --- > net/ipv4/ip_output.c | 3 ++- > net/ipv6/ip6_output.c | 2 +- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 7a3fd25e8913..532b36e9ce2a 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -964,7 +964,8 @@ static int __ip_append_data(struct sock *sk, > csummode = CHECKSUM_PARTIAL; > > cork->length += length; > - if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) && > + if length + (skb ? skb->len : fragheaderlen)) > mtu) || > + (skb && skb_is_gso(skb))) && > (sk->sk_protocol == IPPROTO_UDP) && > (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) && > (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) { > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index bf8a58a1c32d..1699acb2fa2c 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1390,7 +1390,7 @@ static int __ip6_append_data(struct sock *sk, >*/ > > cork->length += length; > - if length + fragheaderlen) > mtu) || > + if length + (skb ? skb->len : headersize)) > mtu) || >(skb && skb_is_gso(skb))) && > (sk->sk_protocol == IPPROTO_UDP) && > (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) && >
Re: [PATCH 5/5] sctp: Adjust one function call together with a variable assignment
On 05/22/2017 12:41 PM, SF Markus Elfring wrote: > From: Markus Elfring <elfr...@users.sourceforge.net> > Date: Mon, 22 May 2017 18:15:12 +0200 > > The script "checkpatch.pl" pointed information out like the following. > > ERROR: do not use assignment in if condition > > Thus fix the affected source code place. > > Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> > --- > net/sctp/protocol.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 057479b7bd72..be2fe3ebae78 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -141,7 +141,8 @@ static void sctp_v4_copy_addrlist(struct list_head > *addrlist, > struct sctp_sockaddr_entry *addr; > > rcu_read_lock(); > - if ((in_dev = __in_dev_get_rcu(dev)) == NULL) { > + in_dev = __in_dev_get_rcu(dev); > + if (!in_dev) { > rcu_read_unlock(); > return; > } > Acked-by: Vlad Yasevich <vyasev...@gmail.com> -vlad
Re: [PATCH 5/5] sctp: Adjust one function call together with a variable assignment
On 05/22/2017 12:41 PM, SF Markus Elfring wrote: > From: Markus Elfring > Date: Mon, 22 May 2017 18:15:12 +0200 > > The script "checkpatch.pl" pointed information out like the following. > > ERROR: do not use assignment in if condition > > Thus fix the affected source code place. > > Signed-off-by: Markus Elfring > --- > net/sctp/protocol.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 057479b7bd72..be2fe3ebae78 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -141,7 +141,8 @@ static void sctp_v4_copy_addrlist(struct list_head > *addrlist, > struct sctp_sockaddr_entry *addr; > > rcu_read_lock(); > - if ((in_dev = __in_dev_get_rcu(dev)) == NULL) { > + in_dev = __in_dev_get_rcu(dev); > + if (!in_dev) { > rcu_read_unlock(); > return; > } > Acked-by: Vlad Yasevich -vlad
Re: [PATCH 4/5] sctp: Improve a size determination in sctp_inetaddr_event()
On 05/22/2017 12:40 PM, SF Markus Elfring wrote: > From: Markus Elfring <elfr...@users.sourceforge.net> > Date: Mon, 22 May 2017 18:08:24 +0200 > > Replace the specification of a data structure by a pointer dereference > as the parameter for the operator "sizeof" to make the corresponding size > determination a bit safer according to the Linux coding style convention. > > Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> > --- > net/sctp/protocol.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 64756c42cec9..057479b7bd72 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -784,7 +784,7 @@ static int sctp_inetaddr_event(struct notifier_block > *this, unsigned long ev, > > switch (ev) { > case NETDEV_UP: > - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC); > + addr = kmalloc(sizeof(*addr), GFP_ATOMIC); > if (addr) { > addr->a.v4.sin_family = AF_INET; > addr->a.v4.sin_port = 0; > Acked-by: Vlad Yasevich <vyasev...@gmail.com> -vlad
Re: [PATCH 4/5] sctp: Improve a size determination in sctp_inetaddr_event()
On 05/22/2017 12:40 PM, SF Markus Elfring wrote: > From: Markus Elfring > Date: Mon, 22 May 2017 18:08:24 +0200 > > Replace the specification of a data structure by a pointer dereference > as the parameter for the operator "sizeof" to make the corresponding size > determination a bit safer according to the Linux coding style convention. > > Signed-off-by: Markus Elfring > --- > net/sctp/protocol.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 64756c42cec9..057479b7bd72 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -784,7 +784,7 @@ static int sctp_inetaddr_event(struct notifier_block > *this, unsigned long ev, > > switch (ev) { > case NETDEV_UP: > - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC); > + addr = kmalloc(sizeof(*addr), GFP_ATOMIC); > if (addr) { > addr->a.v4.sin_family = AF_INET; > addr->a.v4.sin_port = 0; > Acked-by: Vlad Yasevich -vlad
Re: [PATCH 3/5] sctp: Fix a typo in a comment line in sctp_init()
On 05/22/2017 12:39 PM, SF Markus Elfring wrote: > From: Markus Elfring <elfr...@users.sourceforge.net> > Date: Mon, 22 May 2017 17:43:44 +0200 > > Add a missing character in this description. > > Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> > --- > net/sctp/protocol.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 5e7c8a344770..64756c42cec9 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -1454,7 +1454,7 @@ static __init int sctp_init(void) > } > > /* Allocate and initialize the SCTP port hash table. > - * Note that order is initalized to start at the max sized > + * Note that order is initialized to start at the max sized >* table we want to support. If we can't get that many pages > * reduce the order and try again >*/ > Acked-by: Vlad Yasevich <vyasev...@gmail.com. -vlad
Re: [PATCH 3/5] sctp: Fix a typo in a comment line in sctp_init()
On 05/22/2017 12:39 PM, SF Markus Elfring wrote: > From: Markus Elfring > Date: Mon, 22 May 2017 17:43:44 +0200 > > Add a missing character in this description. > > Signed-off-by: Markus Elfring > --- > net/sctp/protocol.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 5e7c8a344770..64756c42cec9 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -1454,7 +1454,7 @@ static __init int sctp_init(void) > } > > /* Allocate and initialize the SCTP port hash table. > - * Note that order is initalized to start at the max sized > + * Note that order is initialized to start at the max sized >* table we want to support. If we can't get that many pages > * reduce the order and try again >*/ > Acked-by: Vlad Yasevich
Re: [PATCH 2/5] sctp: Delete an error message for a failed memory allocation in sctp_init()
On 05/22/2017 12:38 PM, SF Markus Elfring wrote: > From: Markus Elfring <elfr...@users.sourceforge.net> > Date: Mon, 22 May 2017 17:28:14 +0200 > > Omit an extra message for a memory allocation failure in this function. > > This issue was detected by using the Coccinelle software. > > Link: > http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf > Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> > --- > net/sctp/protocol.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 2b1a6215bd2f..5e7c8a344770 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -1447,5 +1447,4 @@ static __init int sctp_init(void) > if (!sctp_ep_hashtable) { > - pr_err("Failed endpoint_hash alloc\n"); > status = -ENOMEM; > goto err_ehash_alloc; > } > Acked-by: Vlad Yasevich <vyasev...@gmail.com> At the time this was written, it was patterned after TCP. Since then TCP changed significantly. We can surely clean-up the pr_err() here and possibly update the code as well later. -vlad
Re: [PATCH 2/5] sctp: Delete an error message for a failed memory allocation in sctp_init()
On 05/22/2017 12:38 PM, SF Markus Elfring wrote: > From: Markus Elfring > Date: Mon, 22 May 2017 17:28:14 +0200 > > Omit an extra message for a memory allocation failure in this function. > > This issue was detected by using the Coccinelle software. > > Link: > http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf > Signed-off-by: Markus Elfring > --- > net/sctp/protocol.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 2b1a6215bd2f..5e7c8a344770 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -1447,5 +1447,4 @@ static __init int sctp_init(void) > if (!sctp_ep_hashtable) { > - pr_err("Failed endpoint_hash alloc\n"); > status = -ENOMEM; > goto err_ehash_alloc; > } > Acked-by: Vlad Yasevich At the time this was written, it was patterned after TCP. Since then TCP changed significantly. We can surely clean-up the pr_err() here and possibly update the code as well later. -vlad
Re: [PATCH 1/5] sctp: Use kmalloc_array() in sctp_init()
On 05/22/2017 12:37 PM, SF Markus Elfring wrote: > From: Markus Elfring <elfr...@users.sourceforge.net> > Date: Mon, 22 May 2017 17:20:11 +0200 > > * A multiplication for the size determination of a memory allocation > indicated that an array data structure should be processed. > Thus use the corresponding function "kmalloc_array". > > This issue was detected by using the Coccinelle software. > > * Replace the specification of a data structure by a pointer dereference > to make the corresponding size determination a bit safer according to > the Linux coding style convention. > > Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> Acked-by: Vlad Yasevich <vyasev...@gmail.com> -vlad > --- > net/sctp/protocol.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 989a900383b5..2b1a6215bd2f 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -1442,6 +1442,6 @@ static __init int sctp_init(void) > > /* Allocate and initialize the endpoint hash table. */ > sctp_ep_hashsize = 64; > - sctp_ep_hashtable = > - kmalloc(64 * sizeof(struct sctp_hashbucket), GFP_KERNEL); > + sctp_ep_hashtable = kmalloc_array(64, sizeof(*sctp_ep_hashtable), > + GFP_KERNEL); > if (!sctp_ep_hashtable) { >
Re: [PATCH 1/5] sctp: Use kmalloc_array() in sctp_init()
On 05/22/2017 12:37 PM, SF Markus Elfring wrote: > From: Markus Elfring > Date: Mon, 22 May 2017 17:20:11 +0200 > > * A multiplication for the size determination of a memory allocation > indicated that an array data structure should be processed. > Thus use the corresponding function "kmalloc_array". > > This issue was detected by using the Coccinelle software. > > * Replace the specification of a data structure by a pointer dereference > to make the corresponding size determination a bit safer according to > the Linux coding style convention. > > Signed-off-by: Markus Elfring Acked-by: Vlad Yasevich -vlad > --- > net/sctp/protocol.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c > index 989a900383b5..2b1a6215bd2f 100644 > --- a/net/sctp/protocol.c > +++ b/net/sctp/protocol.c > @@ -1442,6 +1442,6 @@ static __init int sctp_init(void) > > /* Allocate and initialize the endpoint hash table. */ > sctp_ep_hashsize = 64; > - sctp_ep_hashtable = > - kmalloc(64 * sizeof(struct sctp_hashbucket), GFP_KERNEL); > + sctp_ep_hashtable = kmalloc_array(64, sizeof(*sctp_ep_hashtable), > + GFP_KERNEL); > if (!sctp_ep_hashtable) { >
Re: [PATCH net] ipv6: sctp: clone options to avoid use after free
On 12/09/2015 10:25 AM, Eric Dumazet wrote: > From: Eric Dumazet > > SCTP is lacking proper np->opt cloning at accept() time. > > TCP and DCCP use ipv6_dup_options() helper, do the same > in SCTP. > > We might later factorize this code in a common helper to avoid > future mistakes. > > Reported-by: Dmitry Vyukov > Signed-off-by: Eric Dumazet Acked-by: Vlad Yasevich This is sufficient for accept() processing, but looks like peeloff is missing a bunch of ipv6 support. I'll see if I can cook something up to fix that part. -vlad > --- > net/sctp/ipv6.c |8 > 1 file changed, 8 insertions(+) > > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c > index d28c0b4c9128..ec529121f38a 100644 > --- a/net/sctp/ipv6.c > +++ b/net/sctp/ipv6.c > @@ -641,6 +641,7 @@ static struct sock *sctp_v6_create_accept_sk(struct sock > *sk, > struct sock *newsk; > struct ipv6_pinfo *newnp, *np = inet6_sk(sk); > struct sctp6_sock *newsctp6sk; > + struct ipv6_txoptions *opt; > > newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot, 0); > if (!newsk) > @@ -660,6 +661,13 @@ static struct sock *sctp_v6_create_accept_sk(struct sock > *sk, > > memcpy(newnp, np, sizeof(struct ipv6_pinfo)); > > + rcu_read_lock(); > + opt = rcu_dereference(np->opt); > + if (opt) > + opt = ipv6_dup_options(newsk, opt); > + RCU_INIT_POINTER(newnp->opt, opt); > + rcu_read_unlock(); > + > /* Initialize sk's sport, dport, rcv_saddr and daddr for getsockname() >* and getpeername(). >*/ > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] ipv6: sctp: clone options to avoid use after free
On 12/09/2015 10:25 AM, Eric Dumazet wrote: > From: Eric Dumazet <eduma...@google.com> > > SCTP is lacking proper np->opt cloning at accept() time. > > TCP and DCCP use ipv6_dup_options() helper, do the same > in SCTP. > > We might later factorize this code in a common helper to avoid > future mistakes. > > Reported-by: Dmitry Vyukov <dvyu...@google.com> > Signed-off-by: Eric Dumazet <eduma...@google.com> Acked-by: Vlad Yasevich <vyasev...@gmail.com> This is sufficient for accept() processing, but looks like peeloff is missing a bunch of ipv6 support. I'll see if I can cook something up to fix that part. -vlad > --- > net/sctp/ipv6.c |8 > 1 file changed, 8 insertions(+) > > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c > index d28c0b4c9128..ec529121f38a 100644 > --- a/net/sctp/ipv6.c > +++ b/net/sctp/ipv6.c > @@ -641,6 +641,7 @@ static struct sock *sctp_v6_create_accept_sk(struct sock > *sk, > struct sock *newsk; > struct ipv6_pinfo *newnp, *np = inet6_sk(sk); > struct sctp6_sock *newsctp6sk; > + struct ipv6_txoptions *opt; > > newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot, 0); > if (!newsk) > @@ -660,6 +661,13 @@ static struct sock *sctp_v6_create_accept_sk(struct sock > *sk, > > memcpy(newnp, np, sizeof(struct ipv6_pinfo)); > > + rcu_read_lock(); > + opt = rcu_dereference(np->opt); > + if (opt) > + opt = ipv6_dup_options(newsk, opt); > + RCU_INIT_POINTER(newnp->opt, opt); > + rcu_read_unlock(); > + > /* Initialize sk's sport, dport, rcv_saddr and daddr for getsockname() >* and getpeername(). >*/ > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: use-after-free in sctp_do_sm
On 12/04/2015 07:55 AM, Marcelo Ricardo Leitner wrote: > On Fri, Dec 04, 2015 at 11:40:02AM +0100, Dmitry Vyukov wrote: >> On Thu, Dec 3, 2015 at 9:51 PM, Joe Perches wrote: >>> (adding lkml as this is likely better discussed there) >>> >>> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote: On 12/03/2015 03:24 PM, Joe Perches wrote: > On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote: >> On 12/03/2015 03:03 PM, Joe Perches wrote: >>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote: On 12/03/2015 01:52 PM, Aaron Conole wrote: > I think that as a minimum, the following patch should be evaluted, > but am unsure to whom I should submit it (after I test): >>> [] Agreed - the intention here is certainly to have no side effects. It looks like 'no_printk()' is used in quite a few other places that would benefit from this change. So we probably want a generic 'really_no_printk()' macro. >>> >>> https://lkml.org/lkml/2012/6/17/231 >> >> I don't see this in the tree. > > It never got applied. > >> Also maybe we should just convert >> no_printk() to do what your 'eliminated_printk()'. > > Some of them at least. > >> So we can convert all users with this change? > > I don't think so, I think there are some > function evaluation/side effects that are > required. I believe some do hardware I/O. > > It'd be good to at least isolate them. > > I'm not sure how to find them via some > automated tool/mechanism though. > > I asked Julia Lawall about it once in this > thread: https://lkml.org/lkml/2014/12/3/696 > Seems rather fragile to have side effects that we rely upon hidden in a printk(). >>> >>> Yup. >>> Just convert them and see what breaks :) >>> >>> I appreciate your optimism. It's very 1995. >>> Try it and see what happens. >> >> >> Whatever is the resolution for pr_debug, we still need to fix this >> particular use-after-free. It affects stability of debug builds, gives >> invalid debug output, prevents us from finding more bugs in SCTP. And >> maybe somebody uses CONFIG_DYNAMIC_DEBUG in production. > > Agreed. I'm already working on a fix for this particular use-after-free. > > Another interesting thing about this is that sctp_do_sm() is called for > nearly every movement that happens on a sctp socket. Said that, that > always-running IDR search hidden on that debug statement do have some > nasty performance impact, specially because it's serialized on a > spinlock. YUCK! I didn't really pay much attention to those debug macros before, but debug_post_sfx() is truly awful. This wasn't such a bad thing where these macros depended on CONFIG_SCTP_DEBUG, but now that they are always built, we need fix them. -vlad > This wouldn't be happening if it was fully ellided and would > be ok if that pr_debug() was really being printed, but not as it is. > Kudos to this report that I could notice this. I'm trying to fix this on > SCTP-side as well. > > Marcelo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: use-after-free in sctp_do_sm
On 12/04/2015 07:55 AM, Marcelo Ricardo Leitner wrote: > On Fri, Dec 04, 2015 at 11:40:02AM +0100, Dmitry Vyukov wrote: >> On Thu, Dec 3, 2015 at 9:51 PM, Joe Percheswrote: >>> (adding lkml as this is likely better discussed there) >>> >>> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote: On 12/03/2015 03:24 PM, Joe Perches wrote: > On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote: >> On 12/03/2015 03:03 PM, Joe Perches wrote: >>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote: On 12/03/2015 01:52 PM, Aaron Conole wrote: > I think that as a minimum, the following patch should be evaluted, > but am unsure to whom I should submit it (after I test): >>> [] Agreed - the intention here is certainly to have no side effects. It looks like 'no_printk()' is used in quite a few other places that would benefit from this change. So we probably want a generic 'really_no_printk()' macro. >>> >>> https://lkml.org/lkml/2012/6/17/231 >> >> I don't see this in the tree. > > It never got applied. > >> Also maybe we should just convert >> no_printk() to do what your 'eliminated_printk()'. > > Some of them at least. > >> So we can convert all users with this change? > > I don't think so, I think there are some > function evaluation/side effects that are > required. I believe some do hardware I/O. > > It'd be good to at least isolate them. > > I'm not sure how to find them via some > automated tool/mechanism though. > > I asked Julia Lawall about it once in this > thread: https://lkml.org/lkml/2014/12/3/696 > Seems rather fragile to have side effects that we rely upon hidden in a printk(). >>> >>> Yup. >>> Just convert them and see what breaks :) >>> >>> I appreciate your optimism. It's very 1995. >>> Try it and see what happens. >> >> >> Whatever is the resolution for pr_debug, we still need to fix this >> particular use-after-free. It affects stability of debug builds, gives >> invalid debug output, prevents us from finding more bugs in SCTP. And >> maybe somebody uses CONFIG_DYNAMIC_DEBUG in production. > > Agreed. I'm already working on a fix for this particular use-after-free. > > Another interesting thing about this is that sctp_do_sm() is called for > nearly every movement that happens on a sctp socket. Said that, that > always-running IDR search hidden on that debug statement do have some > nasty performance impact, specially because it's serialized on a > spinlock. YUCK! I didn't really pay much attention to those debug macros before, but debug_post_sfx() is truly awful. This wasn't such a bad thing where these macros depended on CONFIG_SCTP_DEBUG, but now that they are always built, we need fix them. -vlad > This wouldn't be happening if it was fully ellided and would > be ok if that pr_debug() was really being printed, but not as it is. > Kudos to this report that I could notice this. I'm trying to fix this on > SCTP-side as well. > > Marcelo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE
On 08/27/2015 10:42 PM, Jason Wang wrote: > > > On 08/27/2015 06:43 PM, Michael S. Tsirkin wrote: >> On Wed, Aug 26, 2015 at 01:45:30PM +0800, Jason Wang wrote: >>> >>> On 08/26/2015 12:32 AM, Vlad Yasevich wrote: >>>> On 08/25/2015 07:30 AM, Jason Wang wrote: >>>>> On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote: >>>>>> On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote: >>>>>>>> For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0. >>>>>>>> >>>>>>>> For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 >>>>>>>> ("macvtap: Add support of packet capture on macvtap >>>>>>>> device."). Multiqueue macvtap suffers from single qdisc lock >>>>>>>> contention. This is because macvtap claims a non zero tx_queue_len and >>>>>>>> it reuses this value as it socket receive queue size.Thanks to >>>>>>>> IFF_NO_QUEUE, we can remove the lock contention without breaking >>>>>>>> existing socket receive queue length logic. >>>>>>>> >>>>>>>> Cc: Patrick McHardy >>>>>>>> Cc: Vladislav Yasevich >>>>>>>> Cc: Michael S. Tsirkin >>>>>>>> Signed-off-by: Jason Wang >>>>>> Seems to make sense. Give me a day or two to get over the jet lag >>>>>> (and get out from under the pile of mail accumulated while I was >>>>>> traveling), >>>>>> I'll review properly and ack. >>>>>> >>>>> A note on this patch: only default qdisc were removed but we don't lose >>>>> the ability to attach a qdisc to macvtap (though it may cause lock >>>>> contention on multiqueue case). >>>>> >>>> Wouldn't that lock contention be solved if we really had multiple queues >>>> for multi-queue macvtaps? >>>> >>>> -vlad >>> Yes, but this introduce another layer of txq locks contention? >> I don't follow - why does it? Could you clarify please? > > I believe Vlad wants to remove NETIF_F_LLTX. If yes, core will do an > extra tx lock at macvlan layer. No, I don't want to remove it. In a sense, it would function similar to how it works when fwd_priv is populated. I am still testing the code as it's showing some strange artifacts... could be due to keeping LLTX. -vlad > >> >>> And it >>> also needs macvlan multiqueue support. We used to do something like this >>> but switch to NETIF_F_LLTX finally. You may refer: >>> >>> 2c11455321f37da6fe6cc36353149f9ac9183334 macvlan: add multiqueue capability >>> 8ffab51b3dfc54876f145f15b351c41f3f703195 macvlan: lockless tx path >> My concern is that the moment someone configures a non-standard qdisc >> scalability suddenly disappears. That would also be tricky to debug in the >> field as not a lot of developers use non-standard qdiscs. >> What do you think? > > Probably not an issue. Non-standard qdisc may need be attached manually > after device creation, and we don't lose this ability with this patch. > (Unless somebody changes default_qdisc). Actually, user before > 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 does not expect any qdisc work > for macvtap like other stacked devices. This patch also restore this. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE
On 08/27/2015 10:42 PM, Jason Wang wrote: On 08/27/2015 06:43 PM, Michael S. Tsirkin wrote: On Wed, Aug 26, 2015 at 01:45:30PM +0800, Jason Wang wrote: On 08/26/2015 12:32 AM, Vlad Yasevich wrote: On 08/25/2015 07:30 AM, Jason Wang wrote: On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote: On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote: For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0. For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 (macvtap: Add support of packet capture on macvtap device.). Multiqueue macvtap suffers from single qdisc lock contention. This is because macvtap claims a non zero tx_queue_len and it reuses this value as it socket receive queue size.Thanks to IFF_NO_QUEUE, we can remove the lock contention without breaking existing socket receive queue length logic. Cc: Patrick McHardy ka...@trash.net Cc: Vladislav Yasevich vyase...@redhat.com Cc: Michael S. Tsirkin m...@redhat.com Signed-off-by: Jason Wang jasow...@redhat.com Seems to make sense. Give me a day or two to get over the jet lag (and get out from under the pile of mail accumulated while I was traveling), I'll review properly and ack. A note on this patch: only default qdisc were removed but we don't lose the ability to attach a qdisc to macvtap (though it may cause lock contention on multiqueue case). Wouldn't that lock contention be solved if we really had multiple queues for multi-queue macvtaps? -vlad Yes, but this introduce another layer of txq locks contention? I don't follow - why does it? Could you clarify please? I believe Vlad wants to remove NETIF_F_LLTX. If yes, core will do an extra tx lock at macvlan layer. No, I don't want to remove it. In a sense, it would function similar to how it works when fwd_priv is populated. I am still testing the code as it's showing some strange artifacts... could be due to keeping LLTX. -vlad And it also needs macvlan multiqueue support. We used to do something like this but switch to NETIF_F_LLTX finally. You may refer: 2c11455321f37da6fe6cc36353149f9ac9183334 macvlan: add multiqueue capability 8ffab51b3dfc54876f145f15b351c41f3f703195 macvlan: lockless tx path My concern is that the moment someone configures a non-standard qdisc scalability suddenly disappears. That would also be tricky to debug in the field as not a lot of developers use non-standard qdiscs. What do you think? Probably not an issue. Non-standard qdisc may need be attached manually after device creation, and we don't lose this ability with this patch. (Unless somebody changes default_qdisc). Actually, user before 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 does not expect any qdisc work for macvtap like other stacked devices. This patch also restore this. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE
On 08/25/2015 07:30 AM, Jason Wang wrote: > > > On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote: >> On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote: For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0. For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 ("macvtap: Add support of packet capture on macvtap device."). Multiqueue macvtap suffers from single qdisc lock contention. This is because macvtap claims a non zero tx_queue_len and it reuses this value as it socket receive queue size.Thanks to IFF_NO_QUEUE, we can remove the lock contention without breaking existing socket receive queue length logic. Cc: Patrick McHardy Cc: Vladislav Yasevich Cc: Michael S. Tsirkin Signed-off-by: Jason Wang >> Seems to make sense. Give me a day or two to get over the jet lag >> (and get out from under the pile of mail accumulated while I was traveling), >> I'll review properly and ack. >> > > A note on this patch: only default qdisc were removed but we don't lose > the ability to attach a qdisc to macvtap (though it may cause lock > contention on multiqueue case). > Wouldn't that lock contention be solved if we really had multiple queues for multi-queue macvtaps? -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE
On 08/25/2015 07:30 AM, Jason Wang wrote: On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote: On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote: For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0. For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 (macvtap: Add support of packet capture on macvtap device.). Multiqueue macvtap suffers from single qdisc lock contention. This is because macvtap claims a non zero tx_queue_len and it reuses this value as it socket receive queue size.Thanks to IFF_NO_QUEUE, we can remove the lock contention without breaking existing socket receive queue length logic. Cc: Patrick McHardy ka...@trash.net Cc: Vladislav Yasevich vyase...@redhat.com Cc: Michael S. Tsirkin m...@redhat.com Signed-off-by: Jason Wang jasow...@redhat.com Seems to make sense. Give me a day or two to get over the jet lag (and get out from under the pile of mail accumulated while I was traveling), I'll review properly and ack. A note on this patch: only default qdisc were removed but we don't lose the ability to attach a qdisc to macvtap (though it may cause lock contention on multiqueue case). Wouldn't that lock contention be solved if we really had multiple queues for multi-queue macvtaps? -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sctp: Add counters for out data chunk discards
On 07/06/2015 01:37 PM, Vitaly Andrianov wrote: > This commit adds a MIB entry for out data chunk discards. > Number of outgoing SCTP DATA chunks for a SCTP association for which no > problems were encountered to prevent their transmission but were discarded. > Data chunks are discarded due to ungraceful closing of the SCTP > association. > > Signed-off-by: Vitaly Andrianov > --- > include/net/sctp/sctp.h | 1 + > net/sctp/outqueue.c | 1 + > net/sctp/proc.c | 2 ++ > 3 files changed, 4 insertions(+) > > diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h > index ce13cf2..fd806ea 100644 > --- a/include/net/sctp/sctp.h > +++ b/include/net/sctp/sctp.h > @@ -216,6 +216,7 @@ enum { > SCTP_MIB_IN_PKT_BACKLOG, > SCTP_MIB_IN_PKT_DISCARDS, > SCTP_MIB_IN_DATA_CHUNK_DISCARDS, > + SCTP_MIB_OUT_DATA_CHUNK_DISCARDS, > __SCTP_MIB_MAX > }; > > diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c > index 7e8f0a1..dc1a40f 100644 > --- a/net/sctp/outqueue.c > +++ b/net/sctp/outqueue.c > @@ -315,6 +315,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk > *chunk) > case SCTP_STATE_SHUTDOWN_RECEIVED: > case SCTP_STATE_SHUTDOWN_ACK_SENT: > /* Cannot send after transport endpoint shutdown */ > + SCTP_INC_STATS(net, SCTP_MIB_OUT_DATA_CHUNK_DISCARDS); > error = -ESHUTDOWN; > break; First, this is not the only case where a chunk is discarded on output. If you are going to add an OUT_DATA_DISCARDS counter, you need to catch all such discards. Second, in this particular case, you will count discarded messages, not chunks since sctp_outq_tail is guaranteed to queue the whole message or none of it. -vlad > > diff --git a/net/sctp/proc.c b/net/sctp/proc.c > index 0697eda..1a10b81 100644 > --- a/net/sctp/proc.c > +++ b/net/sctp/proc.c > @@ -67,6 +67,8 @@ static const struct snmp_mib sctp_snmp_list[] = { > SNMP_MIB_ITEM("SctpInPktBacklog", SCTP_MIB_IN_PKT_BACKLOG), > SNMP_MIB_ITEM("SctpInPktDiscards", SCTP_MIB_IN_PKT_DISCARDS), > SNMP_MIB_ITEM("SctpInDataChunkDiscards", > SCTP_MIB_IN_DATA_CHUNK_DISCARDS), > + SNMP_MIB_ITEM("SctpOutDataChunkDiscards", > + SCTP_MIB_OUT_DATA_CHUNK_DISCARDS), > SNMP_MIB_SENTINEL > }; > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sctp: Add counters for out data chunk discards
On 07/06/2015 01:37 PM, Vitaly Andrianov wrote: This commit adds a MIB entry for out data chunk discards. Number of outgoing SCTP DATA chunks for a SCTP association for which no problems were encountered to prevent their transmission but were discarded. Data chunks are discarded due to ungraceful closing of the SCTP association. Signed-off-by: Vitaly Andrianov vita...@ti.com --- include/net/sctp/sctp.h | 1 + net/sctp/outqueue.c | 1 + net/sctp/proc.c | 2 ++ 3 files changed, 4 insertions(+) diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h index ce13cf2..fd806ea 100644 --- a/include/net/sctp/sctp.h +++ b/include/net/sctp/sctp.h @@ -216,6 +216,7 @@ enum { SCTP_MIB_IN_PKT_BACKLOG, SCTP_MIB_IN_PKT_DISCARDS, SCTP_MIB_IN_DATA_CHUNK_DISCARDS, + SCTP_MIB_OUT_DATA_CHUNK_DISCARDS, __SCTP_MIB_MAX }; diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c index 7e8f0a1..dc1a40f 100644 --- a/net/sctp/outqueue.c +++ b/net/sctp/outqueue.c @@ -315,6 +315,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk) case SCTP_STATE_SHUTDOWN_RECEIVED: case SCTP_STATE_SHUTDOWN_ACK_SENT: /* Cannot send after transport endpoint shutdown */ + SCTP_INC_STATS(net, SCTP_MIB_OUT_DATA_CHUNK_DISCARDS); error = -ESHUTDOWN; break; First, this is not the only case where a chunk is discarded on output. If you are going to add an OUT_DATA_DISCARDS counter, you need to catch all such discards. Second, in this particular case, you will count discarded messages, not chunks since sctp_outq_tail is guaranteed to queue the whole message or none of it. -vlad diff --git a/net/sctp/proc.c b/net/sctp/proc.c index 0697eda..1a10b81 100644 --- a/net/sctp/proc.c +++ b/net/sctp/proc.c @@ -67,6 +67,8 @@ static const struct snmp_mib sctp_snmp_list[] = { SNMP_MIB_ITEM(SctpInPktBacklog, SCTP_MIB_IN_PKT_BACKLOG), SNMP_MIB_ITEM(SctpInPktDiscards, SCTP_MIB_IN_PKT_DISCARDS), SNMP_MIB_ITEM(SctpInDataChunkDiscards, SCTP_MIB_IN_DATA_CHUNK_DISCARDS), + SNMP_MIB_ITEM(SctpOutDataChunkDiscards, + SCTP_MIB_OUT_DATA_CHUNK_DISCARDS), SNMP_MIB_SENTINEL }; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel >= 4.0: crashes when using traceroute6 with isatap
On 05/12/2015 04:36 PM, Eric Dumazet wrote: > On Tue, 2015-05-12 at 16:18 -0400, Vlad Yasevich wrote: >> On 05/06/2015 06:11 PM, Wolfgang Walter wrote: >>> Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie: >>>> (Cc'ing netdev.) >>>> >>>> On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter wrote: >>>>> Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter: >>>>>> Hello, >>>>>> >>>>>> kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an >>>>>> isatap-tunnel. >>>>> >>>>> I did some further tests. To trigger the crash you need >>>>> >>>>> * isatap-tunnel (probably any sit-tunnel will do it) >>>>> * raw-socket >>>>> * udp >>>>> >>>>> Using icmpv6 or tcp i.e. does not trigger it. >>>> >>>> Do you have a script to reproduce it? >>>> >>>> >>>> Thanks for the bug report! >>> >>> You need a isatap-server with say ipv4-address $X >>> >>> Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X >>> >>> then do >>> >>> traceroute6 www.google.de >>> >>> Regards, >>> >> >> Hi Walter >> >> Could you try this patch. Looks like raw passes transhdrlen >> of 0 on the first packet and that makes IPv4 behave correctly, >> but not IPv6. >> >> >> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c >> index 7fde1f2..fd9c079 100644 >> --- a/net/ipv6/ip6_output.c >> +++ b/net/ipv6/ip6_output.c >> @@ -1278,7 +1278,7 @@ emsgsize: >> /* If this is the first and only packet and device >> * supports checksum offloading, let's use it. >> */ >> -if (!skb && sk->sk_protocol == IPPROTO_UDP && >> +if (transhdrlen && sk->sk_protocol == IPPROTO_UDP && >> length + fragheaderlen < mtu && >> rt->dst.dev->features & NETIF_F_V6_CSUM && >> !exthdrlen) > > And make sure the checksum is correct ;) > > Vlad, can you tell where skb->cum_start and skb->csum_offset are set ? > > For udp, udp6_hwcsum_outgoing(), but with the above patch, the check above will return false, and we'll fallback to using CHECKSUM_NONE. Before, the !skb was true since there there was no skb on the queue. Now, that we are checking transhdr and raw passing in 0, the check will be false. This is what's making IPv4 work correctly in this case. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel >= 4.0: crashes when using traceroute6 with isatap
On 05/06/2015 06:11 PM, Wolfgang Walter wrote: > Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie: >> (Cc'ing netdev.) >> >> On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter wrote: >>> Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter: Hello, kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an isatap-tunnel. >>> >>> I did some further tests. To trigger the crash you need >>> >>> * isatap-tunnel (probably any sit-tunnel will do it) >>> * raw-socket >>> * udp >>> >>> Using icmpv6 or tcp i.e. does not trigger it. >> >> Do you have a script to reproduce it? >> >> >> Thanks for the bug report! > > You need a isatap-server with say ipv4-address $X > > Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X > > then do > > traceroute6 www.google.de > > Regards, > Hi Walter Could you try this patch. Looks like raw passes transhdrlen of 0 on the first packet and that makes IPv4 behave correctly, but not IPv6. diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7fde1f2..fd9c079 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1278,7 +1278,7 @@ emsgsize: /* If this is the first and only packet and device * supports checksum offloading, let's use it. */ - if (!skb && sk->sk_protocol == IPPROTO_UDP && + if (transhdrlen && sk->sk_protocol == IPPROTO_UDP && length + fragheaderlen < mtu && rt->dst.dev->features & NETIF_F_V6_CSUM && !exthdrlen) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel >= 4.0: crashes when using traceroute6 with isatap
On 05/06/2015 06:42 PM, Eric Dumazet wrote: > On Thu, 2015-05-07 at 00:04 +0200, Wolfgang Walter wrote: >> Am Mittwoch, 6. Mai 2015, 12:10:00 schrieb Eric Dumazet: >>> On Wed, 2015-05-06 at 11:15 -0700, Cong Wang wrote: (Cc'ing netdev.) On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter wrote: > Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter: >> Hello, >> >> kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with >> an >> isatap-tunnel. > > I did some further tests. To trigger the crash you need > > * isatap-tunnel (probably any sit-tunnel will do it) > * raw-socket > * udp > > Using icmpv6 or tcp i.e. does not trigger it. Do you have a script to reproduce it? Thanks for the bug report! -- >>> >>> Please Wolfgang try to revert 32dce968dd987adfb0c00946d78dad9154f64759 >>> ("ipv6: Allow for partial checksums on non-ufo packets") >> >> Indeed, that fixes the problem. > > Yes, setting skb->csum to 0 is clearly wrong for CHECKSUM_PARTIAL > > Would you try : > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index > 7fde1f265c90e90f16291e6c861b6e242111c25b..694ae630e1ca67e25ab1e5f6dd0b3597db3669b0 > 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1416,6 +1416,10 @@ alloc_new_skb: > data += fragheaderlen; > skb->transport_header = (skb->network_header + >fragheaderlen); > + if (csummode == CHECKSUM_PARTIAL) { > + skb->csum_start = skb_transport_header(skb) - > skb->head; > + skb->csum_offset = offsetof(struct udphdr, > check); > + } > if (fraggap) { > skb->csum = skb_copy_and_csum_bits( > skb_prev, maxfraglen, > > > So why is this not an issue in __ip_append_data()? -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel = 4.0: crashes when using traceroute6 with isatap
On 05/06/2015 06:42 PM, Eric Dumazet wrote: On Thu, 2015-05-07 at 00:04 +0200, Wolfgang Walter wrote: Am Mittwoch, 6. Mai 2015, 12:10:00 schrieb Eric Dumazet: On Wed, 2015-05-06 at 11:15 -0700, Cong Wang wrote: (Cc'ing netdev.) On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter li...@stwm.de wrote: Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter: Hello, kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an isatap-tunnel. I did some further tests. To trigger the crash you need * isatap-tunnel (probably any sit-tunnel will do it) * raw-socket * udp Using icmpv6 or tcp i.e. does not trigger it. Do you have a script to reproduce it? Thanks for the bug report! -- Please Wolfgang try to revert 32dce968dd987adfb0c00946d78dad9154f64759 (ipv6: Allow for partial checksums on non-ufo packets) Indeed, that fixes the problem. Yes, setting skb-csum to 0 is clearly wrong for CHECKSUM_PARTIAL Would you try : diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7fde1f265c90e90f16291e6c861b6e242111c25b..694ae630e1ca67e25ab1e5f6dd0b3597db3669b0 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1416,6 +1416,10 @@ alloc_new_skb: data += fragheaderlen; skb-transport_header = (skb-network_header + fragheaderlen); + if (csummode == CHECKSUM_PARTIAL) { + skb-csum_start = skb_transport_header(skb) - skb-head; + skb-csum_offset = offsetof(struct udphdr, check); + } if (fraggap) { skb-csum = skb_copy_and_csum_bits( skb_prev, maxfraglen, So why is this not an issue in __ip_append_data()? -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel = 4.0: crashes when using traceroute6 with isatap
On 05/06/2015 06:11 PM, Wolfgang Walter wrote: Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie: (Cc'ing netdev.) On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter li...@stwm.de wrote: Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter: Hello, kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an isatap-tunnel. I did some further tests. To trigger the crash you need * isatap-tunnel (probably any sit-tunnel will do it) * raw-socket * udp Using icmpv6 or tcp i.e. does not trigger it. Do you have a script to reproduce it? Thanks for the bug report! You need a isatap-server with say ipv4-address $X Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X then do traceroute6 www.google.de Regards, Hi Walter Could you try this patch. Looks like raw passes transhdrlen of 0 on the first packet and that makes IPv4 behave correctly, but not IPv6. diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7fde1f2..fd9c079 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1278,7 +1278,7 @@ emsgsize: /* If this is the first and only packet and device * supports checksum offloading, let's use it. */ - if (!skb sk-sk_protocol == IPPROTO_UDP + if (transhdrlen sk-sk_protocol == IPPROTO_UDP length + fragheaderlen mtu rt-dst.dev-features NETIF_F_V6_CSUM !exthdrlen) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel = 4.0: crashes when using traceroute6 with isatap
On 05/12/2015 04:36 PM, Eric Dumazet wrote: On Tue, 2015-05-12 at 16:18 -0400, Vlad Yasevich wrote: On 05/06/2015 06:11 PM, Wolfgang Walter wrote: Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie: (Cc'ing netdev.) On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter li...@stwm.de wrote: Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter: Hello, kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an isatap-tunnel. I did some further tests. To trigger the crash you need * isatap-tunnel (probably any sit-tunnel will do it) * raw-socket * udp Using icmpv6 or tcp i.e. does not trigger it. Do you have a script to reproduce it? Thanks for the bug report! You need a isatap-server with say ipv4-address $X Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X then do traceroute6 www.google.de Regards, Hi Walter Could you try this patch. Looks like raw passes transhdrlen of 0 on the first packet and that makes IPv4 behave correctly, but not IPv6. diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7fde1f2..fd9c079 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1278,7 +1278,7 @@ emsgsize: /* If this is the first and only packet and device * supports checksum offloading, let's use it. */ -if (!skb sk-sk_protocol == IPPROTO_UDP +if (transhdrlen sk-sk_protocol == IPPROTO_UDP length + fragheaderlen mtu rt-dst.dev-features NETIF_F_V6_CSUM !exthdrlen) And make sure the checksum is correct ;) Vlad, can you tell where skb-cum_start and skb-csum_offset are set ? For udp, udp6_hwcsum_outgoing(), but with the above patch, the check above will return false, and we'll fallback to using CHECKSUM_NONE. Before, the !skb was true since there there was no skb on the queue. Now, that we are checking transhdr and raw passing in 0, the check will be false. This is what's making IPv4 work correctly in this case. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
On 02/15/2015 11:54 PM, Hiroshi Shimamoto wrote: >>>>>>>>>>> Can you please fix up your patches based on my tree: >>>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que >>>>>>>>>>> ue.git >>>>>>>>>> >>>>>>>>>> Yes. I haven't noticed your tree. >>>>>>>>>> Will resend patches against it. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I encountered an issue with your tree, the commit id is below. >>>>>>>>> >>>>>>>>> $ git log | head >>>>>>>>> commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e >>>>>>>>> Author: Rasmus Villemoes >>>>>>>>> Date: Fri Jan 23 20:43:14 2015 -0800 >>>>>>>>> >>>>>>>>> ethernet: fm10k: Actually drop 4 bits >>>>>>>>> >>>>>>>>> The comment explains the intention, but vid has type u16. Before >>>> the >>>>>>>>> inner shift, it is promoted to int, which has plenty of space for >>>>>>>>> all >>>>>>>>> vid's bits, so nothing is dropped. Use a simple mask instead. >>>>>>>>> >>>>>>>>> >>>>>>>>> I use the kernel from your tree in both host and guest. >>>>>>>>> >>>>>>>>> Assign an IPv6 for VF in guest. >>>>>>>>> # ip -6 addr add 2001:db8::18:1/64 dev ens0 >>>>>>>>> >>>>>>>>> Send ping packet from other server to the VM. >>>>>>>>> # ping6 2001:db8::18:1 -I eth0 >>>>>>>>> >>>>>>>>> The following message was shown. >>>>>>>>> ixgbevf :00:08.0: partial checksum but l4 proto=3a! >>>>>>>>> >>>>>>>>> If I did the same operation in the host, I saw the same error >>>>>>>>> message in >>>>>>> host too. >>>>>>>>> ixgbe :2d:00.0: partial checksum but l4 proto=3a! >>>>>>>>> >>>>>>>>> Do you have any idea about that? >>>>>>>> >>>>>>>> Ah, sorry about that, try this tree again: >>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git >>>>>>>> >>>>>>>> That patch was dropped for favor of a patch that Matthew Vick >>>>>>>> put together (and recently got pushed upstream). So my queue no >>>>>>>> longer has that patch in the queue, since it got dropped. >>>>>>> >>>>>>> I still see the same error, the head id is the below >>>>>>> >>>>>>> $ git log | head >>>>>>> commit a072afb0b45904022b76deef3b770ee9a93cb13a >>>>>>> Author: Nicholas Krause >>>>>>> Date: Mon Feb 9 00:27:00 2015 -0800 >>>>>>> >>>>>>> igb: Remove outdated fix me comment in the >>>>>>> function,gb_acquire_swfw_sync_i210 >>>>>>> >>>>>>> >>>>>>> thanks, >>>>>>> Hiroshi >>>>>> >>>>>> I'm having our validation see if they can recreate the same issue >>>>>> internally. When they get back to me I'll let you >>>>> know >>>>>> what we found. >>>>> >>>>> We did bisect, and the below looks the culprit; >>>>> >>>>> 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit >>>>> commit 32dce968dd987adfb0c00946d78dad9154f64759 >>>>> Author: Vlad Yasevich >>>>> Date: Sat Jan 31 10:40:18 2015 -0500 >>>>> >>>>> ipv6: Allow for partial checksums on non-ufo packets >>>>> >>>>> Currntly, if we are not doing UFO on the packet, all UDP >>>>> packets will start with CHECKSUM_NONE and thus perform full >>>>> checksum computations in software even if device support >>>>> IPv6 checksum offloading. >>>>> >>>>>
Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
On 02/15/2015 11:54 PM, Hiroshi Shimamoto wrote: Can you please fix up your patches based on my tree: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que ue.git Yes. I haven't noticed your tree. Will resend patches against it. I encountered an issue with your tree, the commit id is below. $ git log | head commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e Author: Rasmus Villemoes li...@rasmusvillemoes.dk Date: Fri Jan 23 20:43:14 2015 -0800 ethernet: fm10k: Actually drop 4 bits The comment explains the intention, but vid has type u16. Before the inner shift, it is promoted to int, which has plenty of space for all vid's bits, so nothing is dropped. Use a simple mask instead. I use the kernel from your tree in both host and guest. Assign an IPv6 for VF in guest. # ip -6 addr add 2001:db8::18:1/64 dev ens0 Send ping packet from other server to the VM. # ping6 2001:db8::18:1 -I eth0 The following message was shown. ixgbevf :00:08.0: partial checksum but l4 proto=3a! If I did the same operation in the host, I saw the same error message in host too. ixgbe :2d:00.0: partial checksum but l4 proto=3a! Do you have any idea about that? Ah, sorry about that, try this tree again: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git That patch was dropped for favor of a patch that Matthew Vick put together (and recently got pushed upstream). So my queue no longer has that patch in the queue, since it got dropped. I still see the same error, the head id is the below $ git log | head commit a072afb0b45904022b76deef3b770ee9a93cb13a Author: Nicholas Krause xerofo...@gmail.com Date: Mon Feb 9 00:27:00 2015 -0800 igb: Remove outdated fix me comment in the function,gb_acquire_swfw_sync_i210 thanks, Hiroshi I'm having our validation see if they can recreate the same issue internally. When they get back to me I'll let you know what we found. We did bisect, and the below looks the culprit; 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit commit 32dce968dd987adfb0c00946d78dad9154f64759 Author: Vlad Yasevich vyasev...@gmail.com Date: Sat Jan 31 10:40:18 2015 -0500 ipv6: Allow for partial checksums on non-ufo packets Currntly, if we are not doing UFO on the packet, all UDP packets will start with CHECKSUM_NONE and thus perform full checksum computations in software even if device support IPv6 checksum offloading. Let's start start with CHECKSUM_PARTIAL if the device supports it and we are sending only a single packet at or below mtu size. Signed-off-by: Vladislav Yasevich vyase...@redhat.com Signed-off-by: David S. Miller da...@davemloft.net :04 04 4437eaf7e944f5a6136ebf668a256fee688fda3d fade8da998d35c8da97a15f0556949ad371e5347 M net When I reverted the commit, the issue was solved. thanks, Hiroshi I believe the issue is that this patch (32dce968dd98 - ipv6: Allow for partial checksums on non-ufo packets) is that it now sets CHECKSUM_PARTIAL on all IPv6 packets including ICMPv6 ones. Our HW (82599) only supports checksum offload on TCP/UDP (NETIF_F_IPV6_CSUM) so we get hung up on the skb's protocol and the fact that it is CHECKSUM_PARTIAL. Another thing that confuses me is the feature test in this patch. It checks (rt-dst.dev-features NETIF_F_V6_CSUM) but NETIF_F_V6_CSUM is a two bit field? #define NETIF_F_V6_CSUM (NETIF_F_GEN_CSUM | NETIF_F_IPV6_CSUM) So the test would succeed if either bit was high, that doesn't seem right. I cc'd the author so maybe he could clue us in. This has been addressed by: commit bf250a1fa769f2eb8fc7a4e28b3b523e9cb67eef Author: Vlad Yasevich vyasev...@gmail.com Date: Tue Feb 10 11:37:29 2015 -0500 ipv6: Partial checksum only UDP packets As far the 2 bit issue, GEN_CSUM (HW_SUM) and IPV6_CSUM can not coexist at the same time. See netdev_fix_features(). thanks for pointing it. I will test with that commit. Jeff's tree hasn't included that commit yet, right? Which branch has the commit? This is in DaveM's net and net-next trees. -vlad thanks, Hiroshi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
On 02/13/2015 12:26 PM, Skidmore, Donald C wrote: > > >> -Original Message- >> From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com] >> Sent: Thursday, February 12, 2015 8:45 PM >> To: Skidmore, Donald C; Kirsher, Jeffrey T >> Cc: Alexander Duyck; Bjørn Mork; e1000-de...@lists.sourceforge.net; >> net...@vger.kernel.org; Choi, Sy Jong; linux-kernel@vger.kernel.org; David >> Laight; Hayato Momma >> Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to >> enable MC promiscuous mode >> >>>>> -Original Message- >>>>> From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com] >>>>> Sent: Monday, February 09, 2015 6:29 PM >>>>> To: Kirsher, Jeffrey T >>>>> Cc: Alexander Duyck; Skidmore, Donald C; Bjørn Mork; e1000- >>>>> de...@lists.sourceforge.net; net...@vger.kernel.org; Choi, Sy >>>>> Jong; linux- ker...@vger.kernel.org; David Laight; Hayato Momma >>>>> Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new >>>>> mbox API to enable MC promiscuous mode >>>>> >>>>>>>>> Can you please fix up your patches based on my tree: >>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que >>>>>>>>> ue.git >>>>>>>> >>>>>>>> Yes. I haven't noticed your tree. >>>>>>>> Will resend patches against it. >>>>>>>> >>>>>>> >>>>>>> I encountered an issue with your tree, the commit id is below. >>>>>>> >>>>>>> $ git log | head >>>>>>> commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e >>>>>>> Author: Rasmus Villemoes >>>>>>> Date: Fri Jan 23 20:43:14 2015 -0800 >>>>>>> >>>>>>> ethernet: fm10k: Actually drop 4 bits >>>>>>> >>>>>>> The comment explains the intention, but vid has type u16. Before >> the >>>>>>> inner shift, it is promoted to int, which has plenty of space for >>>>>>> all >>>>>>> vid's bits, so nothing is dropped. Use a simple mask instead. >>>>>>> >>>>>>> >>>>>>> I use the kernel from your tree in both host and guest. >>>>>>> >>>>>>> Assign an IPv6 for VF in guest. >>>>>>> # ip -6 addr add 2001:db8::18:1/64 dev ens0 >>>>>>> >>>>>>> Send ping packet from other server to the VM. >>>>>>> # ping6 2001:db8::18:1 -I eth0 >>>>>>> >>>>>>> The following message was shown. >>>>>>> ixgbevf :00:08.0: partial checksum but l4 proto=3a! >>>>>>> >>>>>>> If I did the same operation in the host, I saw the same error >>>>>>> message in >>>>> host too. >>>>>>> ixgbe :2d:00.0: partial checksum but l4 proto=3a! >>>>>>> >>>>>>> Do you have any idea about that? >>>>>> >>>>>> Ah, sorry about that, try this tree again: >>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git >>>>>> >>>>>> That patch was dropped for favor of a patch that Matthew Vick >>>>>> put together (and recently got pushed upstream). So my queue no >>>>>> longer has that patch in the queue, since it got dropped. >>>>> >>>>> I still see the same error, the head id is the below >>>>> >>>>> $ git log | head >>>>> commit a072afb0b45904022b76deef3b770ee9a93cb13a >>>>> Author: Nicholas Krause >>>>> Date: Mon Feb 9 00:27:00 2015 -0800 >>>>> >>>>> igb: Remove outdated fix me comment in the >>>>> function,gb_acquire_swfw_sync_i210 >>>>> >>>>> >>>>> thanks, >>>>> Hiroshi >>>> >>>> I'm having our validation see if they can recreate the same issue >>>> internally. When they get back to me I'll let you >>> know >>>> what we found. >>> >>> We did bisect, and the below looks the culprit; >>> >>> 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit >>> commit 32dce968dd98
Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
On 02/13/2015 12:26 PM, Skidmore, Donald C wrote: -Original Message- From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com] Sent: Thursday, February 12, 2015 8:45 PM To: Skidmore, Donald C; Kirsher, Jeffrey T Cc: Alexander Duyck; Bjørn Mork; e1000-de...@lists.sourceforge.net; net...@vger.kernel.org; Choi, Sy Jong; linux-kernel@vger.kernel.org; David Laight; Hayato Momma Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode -Original Message- From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com] Sent: Monday, February 09, 2015 6:29 PM To: Kirsher, Jeffrey T Cc: Alexander Duyck; Skidmore, Donald C; Bjørn Mork; e1000- de...@lists.sourceforge.net; net...@vger.kernel.org; Choi, Sy Jong; linux- ker...@vger.kernel.org; David Laight; Hayato Momma Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode Can you please fix up your patches based on my tree: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que ue.git Yes. I haven't noticed your tree. Will resend patches against it. I encountered an issue with your tree, the commit id is below. $ git log | head commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e Author: Rasmus Villemoes li...@rasmusvillemoes.dk Date: Fri Jan 23 20:43:14 2015 -0800 ethernet: fm10k: Actually drop 4 bits The comment explains the intention, but vid has type u16. Before the inner shift, it is promoted to int, which has plenty of space for all vid's bits, so nothing is dropped. Use a simple mask instead. I use the kernel from your tree in both host and guest. Assign an IPv6 for VF in guest. # ip -6 addr add 2001:db8::18:1/64 dev ens0 Send ping packet from other server to the VM. # ping6 2001:db8::18:1 -I eth0 The following message was shown. ixgbevf :00:08.0: partial checksum but l4 proto=3a! If I did the same operation in the host, I saw the same error message in host too. ixgbe :2d:00.0: partial checksum but l4 proto=3a! Do you have any idea about that? Ah, sorry about that, try this tree again: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git That patch was dropped for favor of a patch that Matthew Vick put together (and recently got pushed upstream). So my queue no longer has that patch in the queue, since it got dropped. I still see the same error, the head id is the below $ git log | head commit a072afb0b45904022b76deef3b770ee9a93cb13a Author: Nicholas Krause xerofo...@gmail.com Date: Mon Feb 9 00:27:00 2015 -0800 igb: Remove outdated fix me comment in the function,gb_acquire_swfw_sync_i210 thanks, Hiroshi I'm having our validation see if they can recreate the same issue internally. When they get back to me I'll let you know what we found. We did bisect, and the below looks the culprit; 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit commit 32dce968dd987adfb0c00946d78dad9154f64759 Author: Vlad Yasevich vyasev...@gmail.com Date: Sat Jan 31 10:40:18 2015 -0500 ipv6: Allow for partial checksums on non-ufo packets Currntly, if we are not doing UFO on the packet, all UDP packets will start with CHECKSUM_NONE and thus perform full checksum computations in software even if device support IPv6 checksum offloading. Let's start start with CHECKSUM_PARTIAL if the device supports it and we are sending only a single packet at or below mtu size. Signed-off-by: Vladislav Yasevich vyase...@redhat.com Signed-off-by: David S. Miller da...@davemloft.net :04 04 4437eaf7e944f5a6136ebf668a256fee688fda3d fade8da998d35c8da97a15f0556949ad371e5347 M net When I reverted the commit, the issue was solved. thanks, Hiroshi I believe the issue is that this patch (32dce968dd98 - ipv6: Allow for partial checksums on non-ufo packets) is that it now sets CHECKSUM_PARTIAL on all IPv6 packets including ICMPv6 ones. Our HW (82599) only supports checksum offload on TCP/UDP (NETIF_F_IPV6_CSUM) so we get hung up on the skb's protocol and the fact that it is CHECKSUM_PARTIAL. Another thing that confuses me is the feature test in this patch. It checks (rt-dst.dev-features NETIF_F_V6_CSUM) but NETIF_F_V6_CSUM is a two bit field? #define NETIF_F_V6_CSUM (NETIF_F_GEN_CSUM | NETIF_F_IPV6_CSUM) So the test would succeed if either bit was high, that doesn't seem right. I cc'd the author so maybe he could clue us in. This has been addressed by: commit bf250a1fa769f2eb8fc7a4e28b3b523e9cb67eef Author: Vlad Yasevich vyasev...@gmail.com Date: Tue Feb 10 11:37:29 2015 -0500 ipv6: Partial checksum only UDP packets As far the 2 bit issue, GEN_CSUM (HW_SUM) and IPV6_CSUM can not coexist at the same time. See netdev_fix_features(). -vlad Thanks, -Don Skidmore donald.c.skidm...@intel.com -- To unsubscribe from this list: send
Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached
On 01/23/2015 12:10 PM, Daniel Borkmann wrote: > On 01/23/2015 05:05 PM, Vlad Yasevich wrote: >> On 01/23/2015 06:50 AM, Daniel Borkmann wrote: >>> On 01/23/2015 11:25 AM, Sun Paul wrote: >>> ... >>>> I would like to check the behave in LKSCTP. >>>> >>>> we are running DIAMETER message over SCTP, and we have set the >>>> parameter "net.sctp.association_max_retrans = 4" in the LinuxOS. >>>> >>>> We noticed that when remote peer have retry to send the same request >>>> for 4 times, the LKSCTP will initiate an ABORT chunk with reason >>>> "association exceeded its max_retrans count". >>>> >>>> We would like to know whether this is the correct behavior? is there >>>> any other option that we can alter in order to avoid the ABORT chunk >>>> being sent? >>> >>> I don't recall the RFC saying to send an ABORT, but let me double >>> check in the mean time. >> >> The RFC is silent on the matter. The abort got added in 3.8 so >> it's been there for a while. > > I see, commit de4594a51c90 ("sctp: send abort chunk when max_retrans > exceeded") added the behaviour. > >>> Hmm, untested, but could you try something like that? >>> >>> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c >>> index fef2acd..5ce198d 100644 >>> --- a/net/sctp/sm_sideeffect.c >>> +++ b/net/sctp/sm_sideeffect.c >>> @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t >>> *commands, >>> sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP, >>> SCTP_ULPEVENT(event)); >>> >>> -if (asoc->overall_error_count >= asoc->max_retrans) { >>> +if (asoc->overall_error_count >= asoc->max_retrans && >>> +error != SCTP_ERROR_NO_ERROR) { >>> abort = sctp_make_violation_max_retrans(asoc, chunk); >>> if (abort) >>> sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, >> >> This would pretty much stop all ABORTs due to excessive rtx. Might >> as well take the code out :). >> >> I was a bit concerned about this ABORT when it went in. > > So effectively, if I understand the argument from the commit, the > assumption is that the ABORT would never reach the peer anyway, but > is a way for tcpdump users to see on the wire that rtx limit has > been exceeded and since there's not mentioned anything in the RFC > about this, it doesn't break it. Hm. > Additionally I seem to recall BSD sending this type of ABORT for pretty much the same reason. -vlad > Sun Paul, what exactly broke in your scenario? Can you be more explicit? > > Thanks, > Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached
On 01/23/2015 05:25 AM, Sun Paul wrote: > Hi > > I would like to check the behave in LKSCTP. > > we are running DIAMETER message over SCTP, and we have set the > parameter "net.sctp.association_max_retrans = 4" in the LinuxOS. > > We noticed that when remote peer have retry to send the same request > for 4 times, the LKSCTP will initiate an ABORT chunk with reason > "association exceeded its max_retrans count". > > We would like to know whether this is the correct behavior? is there > any other option that we can alter in order to avoid the ABORT chunk > being sent? > Why do you not want ABORT to be sent? SCTP has attempted to retransmit the data maximum allows times, and at this point it will terminate the association. It sends an ABORT notifying the peer of this, but most likely the peer is unreachable anyway. Any message that a peer sends at this point will most likely result in an ABORT to be send back or an association restart. Might as well start fresh. -vlad > Thanks > > PS > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached
On 01/23/2015 06:50 AM, Daniel Borkmann wrote: > Hi, > > On 01/23/2015 11:25 AM, Sun Paul wrote: > ... >> I would like to check the behave in LKSCTP. >> >> we are running DIAMETER message over SCTP, and we have set the >> parameter "net.sctp.association_max_retrans = 4" in the LinuxOS. >> >> We noticed that when remote peer have retry to send the same request >> for 4 times, the LKSCTP will initiate an ABORT chunk with reason >> "association exceeded its max_retrans count". >> >> We would like to know whether this is the correct behavior? is there >> any other option that we can alter in order to avoid the ABORT chunk >> being sent? > > I don't recall the RFC saying to send an ABORT, but let me double > check in the mean time. The RFC is silent on the matter. The abort got added in 3.8 so it's been there for a while. > > Hmm, untested, but could you try something like that? > > diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c > index fef2acd..5ce198d 100644 > --- a/net/sctp/sm_sideeffect.c > +++ b/net/sctp/sm_sideeffect.c > @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t > *commands, > sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP, > SCTP_ULPEVENT(event)); > > -if (asoc->overall_error_count >= asoc->max_retrans) { > +if (asoc->overall_error_count >= asoc->max_retrans && > +error != SCTP_ERROR_NO_ERROR) { > abort = sctp_make_violation_max_retrans(asoc, chunk); > if (abort) > sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, This would pretty much stop all ABORTs due to excessive rtx. Might as well take the code out :). I was a bit concerned about this ABORT when it went in. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached
On 01/23/2015 12:10 PM, Daniel Borkmann wrote: On 01/23/2015 05:05 PM, Vlad Yasevich wrote: On 01/23/2015 06:50 AM, Daniel Borkmann wrote: On 01/23/2015 11:25 AM, Sun Paul wrote: ... I would like to check the behave in LKSCTP. we are running DIAMETER message over SCTP, and we have set the parameter net.sctp.association_max_retrans = 4 in the LinuxOS. We noticed that when remote peer have retry to send the same request for 4 times, the LKSCTP will initiate an ABORT chunk with reason association exceeded its max_retrans count. We would like to know whether this is the correct behavior? is there any other option that we can alter in order to avoid the ABORT chunk being sent? I don't recall the RFC saying to send an ABORT, but let me double check in the mean time. The RFC is silent on the matter. The abort got added in 3.8 so it's been there for a while. I see, commit de4594a51c90 (sctp: send abort chunk when max_retrans exceeded) added the behaviour. Hmm, untested, but could you try something like that? diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c index fef2acd..5ce198d 100644 --- a/net/sctp/sm_sideeffect.c +++ b/net/sctp/sm_sideeffect.c @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t *commands, sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP, SCTP_ULPEVENT(event)); -if (asoc-overall_error_count = asoc-max_retrans) { +if (asoc-overall_error_count = asoc-max_retrans +error != SCTP_ERROR_NO_ERROR) { abort = sctp_make_violation_max_retrans(asoc, chunk); if (abort) sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, This would pretty much stop all ABORTs due to excessive rtx. Might as well take the code out :). I was a bit concerned about this ABORT when it went in. So effectively, if I understand the argument from the commit, the assumption is that the ABORT would never reach the peer anyway, but is a way for tcpdump users to see on the wire that rtx limit has been exceeded and since there's not mentioned anything in the RFC about this, it doesn't break it. Hm. Additionally I seem to recall BSD sending this type of ABORT for pretty much the same reason. -vlad Sun Paul, what exactly broke in your scenario? Can you be more explicit? Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached
On 01/23/2015 06:50 AM, Daniel Borkmann wrote: Hi, On 01/23/2015 11:25 AM, Sun Paul wrote: ... I would like to check the behave in LKSCTP. we are running DIAMETER message over SCTP, and we have set the parameter net.sctp.association_max_retrans = 4 in the LinuxOS. We noticed that when remote peer have retry to send the same request for 4 times, the LKSCTP will initiate an ABORT chunk with reason association exceeded its max_retrans count. We would like to know whether this is the correct behavior? is there any other option that we can alter in order to avoid the ABORT chunk being sent? I don't recall the RFC saying to send an ABORT, but let me double check in the mean time. The RFC is silent on the matter. The abort got added in 3.8 so it's been there for a while. Hmm, untested, but could you try something like that? diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c index fef2acd..5ce198d 100644 --- a/net/sctp/sm_sideeffect.c +++ b/net/sctp/sm_sideeffect.c @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t *commands, sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP, SCTP_ULPEVENT(event)); -if (asoc-overall_error_count = asoc-max_retrans) { +if (asoc-overall_error_count = asoc-max_retrans +error != SCTP_ERROR_NO_ERROR) { abort = sctp_make_violation_max_retrans(asoc, chunk); if (abort) sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, This would pretty much stop all ABORTs due to excessive rtx. Might as well take the code out :). I was a bit concerned about this ABORT when it went in. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached
On 01/23/2015 05:25 AM, Sun Paul wrote: Hi I would like to check the behave in LKSCTP. we are running DIAMETER message over SCTP, and we have set the parameter net.sctp.association_max_retrans = 4 in the LinuxOS. We noticed that when remote peer have retry to send the same request for 4 times, the LKSCTP will initiate an ABORT chunk with reason association exceeded its max_retrans count. We would like to know whether this is the correct behavior? is there any other option that we can alter in order to avoid the ABORT chunk being sent? Why do you not want ABORT to be sent? SCTP has attempted to retransmit the data maximum allows times, and at this point it will terminate the association. It sends an ABORT notifying the peer of this, but most likely the peer is unreachable anyway. Any message that a peer sends at this point will most likely result in an ABORT to be send back or an association restart. Might as well start fresh. -vlad Thanks PS -- To unsubscribe from this list: send the line unsubscribe linux-sctp in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.2 105/131] net: Correctly set segment mac_len in skb_segment().
On 09/11/2014 08:32 AM, Ben Hutchings wrote: > 3.2.63-rc1 review patch. If anyone has any objections, please let me know. > > -- > > From: Vlad Yasevich > > [ Upstream commit fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ] > > When performing segmentation, the mac_len value is copied right > out of the original skb. However, this value is not always set correctly > (like when the packet is VLAN-tagged) and we'll end up copying a bad > value. > > One way to demonstrate this is to configure a VM which tags > packets internally and turn off VLAN acceleration on the forwarding > bridge port. The packets show up corrupt like this: > 16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q > (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0, > 0x: 8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.d. > 0x0010: 0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..dA.^d..0.. > 0x0020: 29e3 01c9 f871 0101 080a 000a e833)q.3 > 0x0030: 000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp > 0x0040: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp > 0x0050: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp > 0x0060: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp > ... > > This also leads to awful throughput as GSO packets are dropped and > cause retransmissions. > > The solution is to set the mac_len using the values already available > in then new skb. We've already adjusted all of the header offset, so we > might as well correctly figure out the mac_len using skb_reset_mac_len(). > After this change, packets are segmented correctly and performance > is restored. > > CC: Eric Dumazet > Signed-off-by: Vlad Yasevich > Signed-off-by: David S. Miller > Signed-off-by: Ben Hutchings > --- > net/core/skbuff.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 7121d9b..0ccfb53 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -2669,7 +2669,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 > features) > tail = nskb; > > __copy_skb_header(nskb, skb); > - nskb->mac_len = skb->mac_len; > > /* nskb and skb might have different headroom */ > if (nskb->ip_summed == CHECKSUM_PARTIAL) > @@ -2679,6 +2678,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 > features) > skb_set_network_header(nskb, skb->mac_len); > nskb->transport_header = (nskb->network_header + > skb_network_header_len(skb)); > + skb_reset_mac_len(nskb); This will not fix the problem here because the network header above will already be set incorrectly based on the old mac_len. This patch depends on commit 030737bcc3c404e273e97dbe06fe9561699a411b Author: Eric Dumazet Date: Sat Oct 19 11:42:54 2013 -0700 net: generalize skb_segment() that correctly populates the header offsets in the new segment. -vlad > skb_copy_from_linear_data(skb, nskb->data, doffset); > > if (fskb != skb_shinfo(skb)->frag_list) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.2 105/131] net: Correctly set segment mac_len in skb_segment().
On 09/11/2014 08:32 AM, Ben Hutchings wrote: 3.2.63-rc1 review patch. If anyone has any objections, please let me know. -- From: Vlad Yasevich vyase...@redhat.com [ Upstream commit fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ] When performing segmentation, the mac_len value is copied right out of the original skb. However, this value is not always set correctly (like when the packet is VLAN-tagged) and we'll end up copying a bad value. One way to demonstrate this is to configure a VM which tags packets internally and turn off VLAN acceleration on the forwarding bridge port. The packets show up corrupt like this: 16:18:24.985548 52:54:00:ab:be:25 52:54:00:26:ce:a3, ethertype 802.1Q (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0, 0x: 8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.d. 0x0010: 0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..dA.^d..0.. 0x0020: 29e3 01c9 f871 0101 080a 000a e833)q.3 0x0030: 000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp 0x0040: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp 0x0050: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp 0x0060: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp ... This also leads to awful throughput as GSO packets are dropped and cause retransmissions. The solution is to set the mac_len using the values already available in then new skb. We've already adjusted all of the header offset, so we might as well correctly figure out the mac_len using skb_reset_mac_len(). After this change, packets are segmented correctly and performance is restored. CC: Eric Dumazet eduma...@google.com Signed-off-by: Vlad Yasevich vyase...@redhat.com Signed-off-by: David S. Miller da...@davemloft.net Signed-off-by: Ben Hutchings b...@decadent.org.uk --- net/core/skbuff.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7121d9b..0ccfb53 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2669,7 +2669,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 features) tail = nskb; __copy_skb_header(nskb, skb); - nskb-mac_len = skb-mac_len; /* nskb and skb might have different headroom */ if (nskb-ip_summed == CHECKSUM_PARTIAL) @@ -2679,6 +2678,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 features) skb_set_network_header(nskb, skb-mac_len); nskb-transport_header = (nskb-network_header + skb_network_header_len(skb)); + skb_reset_mac_len(nskb); This will not fix the problem here because the network header above will already be set incorrectly based on the old mac_len. This patch depends on commit 030737bcc3c404e273e97dbe06fe9561699a411b Author: Eric Dumazet eduma...@google.com Date: Sat Oct 19 11:42:54 2013 -0700 net: generalize skb_segment() that correctly populates the header offsets in the new segment. -vlad skb_copy_from_linear_data(skb, nskb-data, doffset); if (fskb != skb_shinfo(skb)-frag_list) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)
On 09/02/2014 02:15 PM, Cong Wang wrote: > On Tue, Sep 2, 2014 at 11:11 AM, Eric Dumazet wrote: >> On Tue, 2014-09-02 at 11:04 -0700, Cong Wang wrote: >>> On Tue, Sep 2, 2014 at 10:58 AM, Hannes Frederic Sowa >> I definitely don't have a problem cleaning this up in net-next. I wanted a minimal patch for stable because I didn't check history where and when additional users of dev_get_by_flags_rcu were removed. >>> >>> `git grep` should show you we only have one caller. Apparently we don't >>> care about any out-of-tree module. >> >> Point is : you did not check if some stable versions had more callers. >> >> Its very nice you checked current version, but it is not enough for a >> stable candidate. > > That is what we do when backporting patches, I can do that if David asks > me to backport it, but you know for netdev that is David's work. > > (I am not saying I don't want to help him, I just want to point out the fact. > I am very pleased to help David for stable backports as long as he asks) Instead of helping after the fact, why not arrange the patches so that it's not such a big issue. Leave the _rcu variant alone. Add an _rtnl variant of the function and use that in the patch. Have a follow-on patch that removes the _rcu variant all by itself. This way backports become easier, and if anyone wants the _rcu variant back, all they have to do is revert a very simple commit. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)
On 09/02/2014 02:15 PM, Cong Wang wrote: On Tue, Sep 2, 2014 at 11:11 AM, Eric Dumazet eric.duma...@gmail.com wrote: On Tue, 2014-09-02 at 11:04 -0700, Cong Wang wrote: On Tue, Sep 2, 2014 at 10:58 AM, Hannes Frederic Sowa I definitely don't have a problem cleaning this up in net-next. I wanted a minimal patch for stable because I didn't check history where and when additional users of dev_get_by_flags_rcu were removed. `git grep` should show you we only have one caller. Apparently we don't care about any out-of-tree module. Point is : you did not check if some stable versions had more callers. Its very nice you checked current version, but it is not enough for a stable candidate. That is what we do when backporting patches, I can do that if David asks me to backport it, but you know for netdev that is David's work. (I am not saying I don't want to help him, I just want to point out the fact. I am very pleased to help David for stable backports as long as he asks) Instead of helping after the fact, why not arrange the patches so that it's not such a big issue. Leave the _rcu variant alone. Add an _rtnl variant of the function and use that in the patch. Have a follow-on patch that removes the _rcu variant all by itself. This way backports become easier, and if anyone wants the _rcu variant back, all they have to do is revert a very simple commit. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)
On 08/29/2014 11:26 AM, Tommi Rantala wrote: > Hi, > > Was fuzzing Linus v3.17-rc2-89-g59753a8 with Trinity as the root user > in qemu, when I hit the following assertion failures. > > Tommi > > > [init] Started watchdog process, PID is 4841 > [main] Main thread is alive. > [ 77.229699] sctp: [Deprecated]: trinity-main (pid 4842) Use of int > in max_burst socket option deprecated. > [ 77.229699] Use struct sctp_assoc_value instead > [ 77.297196] RTNL: assertion failed at net/ipv6/addrconf.c (1699) > [ 77.298080] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30 > [ 77.299039] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 77.299789] 88003d76a618 880026133c50 8238ba79 > 880037c84520 > [ 77.300829] 880026133c90 820bd52b > 82d86c40 > [ 77.301869] f76fd1e1 8800382d8000 > 8800382d8220 > [ 77.302906] Call Trace: > [ 77.303246] [] dump_stack+0x4d/0x66 > [ 77.303928] [] addrconf_join_solict+0x4b/0xb0 > [ 77.304731] [] ipv6_dev_ac_inc+0x2bb/0x330 > [ 77.305498] [] ? ac6_seq_start+0x260/0x260 > [ 77.306257] [] ipv6_sock_ac_join+0x26e/0x360 > [ 77.307046] [] ? ipv6_sock_ac_join+0x99/0x360 > [ 77.307798] [] do_ipv6_setsockopt.isra.5+0xa70/0xf20 > [ 77.308570] [] ? sched_clock_local+0x1d/0x80 > [ 77.309260] [] ? kvm_clock_read+0x27/0x40 > [ 77.309915] [] ? sched_clock+0x9/0x10 > [ 77.310537] [] ? sock_has_perm+0x168/0x1e0 > [ 77.311204] [] ? sched_clock_cpu+0xa8/0xf0 > [ 77.311866] [] ? local_clock+0x1b/0x30 > [ 77.312501] [] ? lock_release_holdtime+0x1d/0x170 > [ 77.313241] [] ? sock_has_perm+0x180/0x1e0 > [ 77.313905] [] ? > selinux_msg_queue_alloc_security+0xa0/0xa0 > [ 77.314746] [] ipv6_setsockopt+0x53/0xb0 > [ 77.315397] [] udpv6_setsockopt+0x25/0x30 > [ 77.316058] [] sock_common_setsockopt+0xf/0x20 > [ 77.316764] [] SyS_setsockopt+0x8e/0xd0 > [ 77.317406] [] system_call_fastpath+0x16/0x1b > [main] 375 sockets created based on info from socket cachefile. > [main] Generating file descriptors > [main] Added 129 filenames from /dev > [main] Added 44048 filenames from /proc > [main] Added 18192 filenames from /sys > [main] Enabled 9 fd providers. > [watchdog] Watchdog is alive. (pid:4841) > [child3:4846] finit_module (313) returned ENOSYS, marking as inactive. > [child1:4844] kcmp (312) returned ENOSYS, marking as inactive. > [child2:4845] uselib (134) returned ENOSYS, marking as inactive. > [child1:4844] nfsservctl (180) returned ENOSYS, marking as inactive. > [child2:4845] delete_module (129:[32BIT]) returned ENOSYS, marking as > inactive. > [child2:4845] init_module (175) returned ENOSYS, marking as inactive. > [ 84.126609] trinity-c7: vm86 mode not supported on 64 bit kernel > [child7:4850] vm86 (166:[32BIT]) returned ENOSYS, marking as inactive. > [main] Bailing main loop because ctrl-c. > [ 84.345840] RTNL: assertion failed at net/ipv6/addrconf.c (1712) > [ 84.346615] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30 > [ 84.347426] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 84.348102] 88003d76a618 880026133d10 8238ba79 > 8800382d8000 > [ 84.349018] 880026133d50 820bd5db 81141555 > 8800382d8220 > [ 84.349935] 8800382d8000 f76fd1e1 88003d76a618 > 8800382d8000 > [ 84.350848] Call Trace: > [ 84.351149] [] dump_stack+0x4d/0x66 > [ 84.351751] [] addrconf_leave_solict+0x4b/0xb0 > [ 84.352574] [] ? __local_bh_enable_ip+0xa5/0xf0 > [ 84.353315] [] __ipv6_dev_ac_dec+0xc3/0x140 > [ 84.354019] [] ipv6_dev_ac_dec+0x98/0xb0 > [ 84.354687] [] ipv6_sock_ac_close+0x10d/0x1a0 > [ 84.355410] [] ? ipv6_sock_ac_close+0x2e/0x1a0 > [ 84.356147] [] inet6_release+0x23/0x40 > [ 84.356789] [] sock_release+0x14/0x80 > [ 84.357410] [] sock_close+0xd/0x20 > [ 84.358042] [] __fput+0x111/0x1e0 > [ 84.358622] [] fput+0x9/0x10 > [ 84.359196] [] task_work_run+0x9e/0xd0 > [ 84.359825] [] do_exit+0x456/0xb30 > [ 84.360419] [] ? retint_swapgs+0x13/0x1b > [ 84.361075] [] do_group_exit+0x84/0xd0 > [ 84.361705] [] SyS_exit_group+0xf/0x10 > [ 84.362338] [] system_call_fastpath+0x16/0x1b > [watchdog] [4841] Watchdog exiting because ctrl-c. > [init] Ran 775 syscalls. Successes: 179 Failures: 596 > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Yep, looks like ipv6_dev_ac_inc() and __ipv6_dev_ac_dec() are called without RNTL in the socket option path and with RTNL in the address configuration path. So it look like this this can actually trigger list corruptions. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)
On 08/29/2014 11:26 AM, Tommi Rantala wrote: Hi, Was fuzzing Linus v3.17-rc2-89-g59753a8 with Trinity as the root user in qemu, when I hit the following assertion failures. Tommi [init] Started watchdog process, PID is 4841 [main] Main thread is alive. [ 77.229699] sctp: [Deprecated]: trinity-main (pid 4842) Use of int in max_burst socket option deprecated. [ 77.229699] Use struct sctp_assoc_value instead [ 77.297196] RTNL: assertion failed at net/ipv6/addrconf.c (1699) [ 77.298080] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30 [ 77.299039] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 77.299789] 88003d76a618 880026133c50 8238ba79 880037c84520 [ 77.300829] 880026133c90 820bd52b 82d86c40 [ 77.301869] f76fd1e1 8800382d8000 8800382d8220 [ 77.302906] Call Trace: [ 77.303246] [8238ba79] dump_stack+0x4d/0x66 [ 77.303928] [820bd52b] addrconf_join_solict+0x4b/0xb0 [ 77.304731] [820b031b] ipv6_dev_ac_inc+0x2bb/0x330 [ 77.305498] [820b0060] ? ac6_seq_start+0x260/0x260 [ 77.306257] [820b05fe] ipv6_sock_ac_join+0x26e/0x360 [ 77.307046] [820b0429] ? ipv6_sock_ac_join+0x99/0x360 [ 77.307798] [820cdd60] do_ipv6_setsockopt.isra.5+0xa70/0xf20 [ 77.308570] [8117097d] ? sched_clock_local+0x1d/0x80 [ 77.309260] [810a8a27] ? kvm_clock_read+0x27/0x40 [ 77.309915] [810736d9] ? sched_clock+0x9/0x10 [ 77.310537] [815afff8] ? sock_has_perm+0x168/0x1e0 [ 77.311204] [81170bb8] ? sched_clock_cpu+0xa8/0xf0 [ 77.311866] [81170d1b] ? local_clock+0x1b/0x30 [ 77.312501] [811872cd] ? lock_release_holdtime+0x1d/0x170 [ 77.313241] [815b0010] ? sock_has_perm+0x180/0x1e0 [ 77.313905] [815afe90] ? selinux_msg_queue_alloc_security+0xa0/0xa0 [ 77.314746] [820ce263] ipv6_setsockopt+0x53/0xb0 [ 77.315397] [820d3135] udpv6_setsockopt+0x25/0x30 [ 77.316058] [81f9930f] sock_common_setsockopt+0xf/0x20 [ 77.316764] [81f9305e] SyS_setsockopt+0x8e/0xd0 [ 77.317406] [823a47e9] system_call_fastpath+0x16/0x1b [main] 375 sockets created based on info from socket cachefile. [main] Generating file descriptors [main] Added 129 filenames from /dev [main] Added 44048 filenames from /proc [main] Added 18192 filenames from /sys [main] Enabled 9 fd providers. [watchdog] Watchdog is alive. (pid:4841) [child3:4846] finit_module (313) returned ENOSYS, marking as inactive. [child1:4844] kcmp (312) returned ENOSYS, marking as inactive. [child2:4845] uselib (134) returned ENOSYS, marking as inactive. [child1:4844] nfsservctl (180) returned ENOSYS, marking as inactive. [child2:4845] delete_module (129:[32BIT]) returned ENOSYS, marking as inactive. [child2:4845] init_module (175) returned ENOSYS, marking as inactive. [ 84.126609] trinity-c7: vm86 mode not supported on 64 bit kernel [child7:4850] vm86 (166:[32BIT]) returned ENOSYS, marking as inactive. [main] Bailing main loop because ctrl-c. [ 84.345840] RTNL: assertion failed at net/ipv6/addrconf.c (1712) [ 84.346615] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30 [ 84.347426] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 84.348102] 88003d76a618 880026133d10 8238ba79 8800382d8000 [ 84.349018] 880026133d50 820bd5db 81141555 8800382d8220 [ 84.349935] 8800382d8000 f76fd1e1 88003d76a618 8800382d8000 [ 84.350848] Call Trace: [ 84.351149] [8238ba79] dump_stack+0x4d/0x66 [ 84.351751] [820bd5db] addrconf_leave_solict+0x4b/0xb0 [ 84.352574] [81141555] ? __local_bh_enable_ip+0xa5/0xf0 [ 84.353315] [820b07b3] __ipv6_dev_ac_dec+0xc3/0x140 [ 84.354019] [820b08c8] ipv6_dev_ac_dec+0x98/0xb0 [ 84.354687] [820b0bcd] ipv6_sock_ac_close+0x10d/0x1a0 [ 84.355410] [820b0aee] ? ipv6_sock_ac_close+0x2e/0x1a0 [ 84.356147] [820ae9d3] inet6_release+0x23/0x40 [ 84.356789] [81f91834] sock_release+0x14/0x80 [ 84.357410] [81f918ad] sock_close+0xd/0x20 [ 84.358042] [8127fa91] __fput+0x111/0x1e0 [ 84.358622] [8127fba9] fput+0x9/0x10 [ 84.359196] [8115e3ee] task_work_run+0x9e/0xd0 [ 84.359825] [8113f4b6] do_exit+0x456/0xb30 [ 84.360419] [823a541c] ? retint_swapgs+0x13/0x1b [ 84.361075] [8113fc54] do_group_exit+0x84/0xd0 [ 84.361705] [8113fcaf] SyS_exit_group+0xf/0x10 [ 84.362338] [823a47e9] system_call_fastpath+0x16/0x1b [watchdog] [4841] Watchdog exiting because ctrl-c. [init] Ran 775 syscalls. Successes: 179 Failures: 596 -- To unsubscribe from this list: send the line unsubscribe netdev in the
Re: [PATCH 1/1] sctp: not send SCTP_PEER_ADDR_CHANGE notifications with failed probe
On 08/20/2014 05:31 AM, Zhu Yanjun wrote: > Since the transport has always been in state SCTP_UNCONFIRMED, it > therefore wasn't active before and hasn't been used before, and it > always has been, so it is unnecessary to bug the user with a > notification. > > Reported-by: Deepak Khandelwal > Suggested-by: Vlad Yasevich > Suggested-by: Michael Tuexen > Suggested-by: Daniel Borkmann > Signed-off-by: Zhu Yanjun Acked-by: Vlad Yasevich Thanks -vlad > --- > net/sctp/associola.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/sctp/associola.c b/net/sctp/associola.c > index 9de23a2..2e23f6b 100644 > --- a/net/sctp/associola.c > +++ b/net/sctp/associola.c > @@ -813,6 +813,7 @@ void sctp_assoc_control_transport(struct sctp_association > *asoc, > else { > dst_release(transport->dst); > transport->dst = NULL; > + ulp_notify = false; > } > > spc_state = SCTP_ADDR_UNREACHABLE; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] sctp: not send SCTP_PEER_ADDR_CHANGE notifications with failed probe
On 08/20/2014 05:31 AM, Zhu Yanjun wrote: Since the transport has always been in state SCTP_UNCONFIRMED, it therefore wasn't active before and hasn't been used before, and it always has been, so it is unnecessary to bug the user with a notification. Reported-by: Deepak Khandelwal khandelwal.deepak.1...@gmail.com Suggested-by: Vlad Yasevich vyasev...@gmail.com Suggested-by: Michael Tuexen tue...@fh-muenster.de Suggested-by: Daniel Borkmann dbork...@redhat.com Signed-off-by: Zhu Yanjun yanjun@windriver.com Acked-by: Vlad Yasevich vyasev...@gmail.com Thanks -vlad --- net/sctp/associola.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/sctp/associola.c b/net/sctp/associola.c index 9de23a2..2e23f6b 100644 --- a/net/sctp/associola.c +++ b/net/sctp/associola.c @@ -813,6 +813,7 @@ void sctp_assoc_control_transport(struct sctp_association *asoc, else { dst_release(transport-dst); transport-dst = NULL; + ulp_notify = false; } spc_state = SCTP_ADDR_UNREACHABLE; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.16-rc1 kernel BUG triggered at mutex.c:586
Trying to boot 3.16.0-rc1+ (the + is a small vlan related patch) on a tests system triggers the following BUG: Jun 16 13:54:50 scratch kernel: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 Jun 16 13:54:50 scratch kernel: in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/3 Jun 16 13:54:50 scratch kernel: 1 lock held by swapper/3/0: Jun 16 13:54:50 scratch kernel: #0: (&(>counts.lock)->rlock){..-...}, at: [] bitmap_endwrite+0x58/0x230 Jun 16 13:54:50 scratch kernel: irq event stamp: 211959 Jun 16 13:54:50 scratch kernel: hardirqs last enabled at (211958): [] _raw_spin_unlock_irqrestore+0x36/0x70 Jun 16 13:54:50 scratch kernel: hardirqs last disabled at (211959): [] _raw_spin_lock_irqsave+0x29/0x70 Jun 16 13:54:50 scratch kernel: softirqs last enabled at (211952): [] _local_bh_enable+0x22/0x50 Jun 16 13:54:50 scratch kernel: softirqs last disabled at (211953): [] irq_exit+0xc5/0xd0 Jun 16 13:54:50 scratch kernel: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.16.0-rc1+ #35 Jun 16 13:54:50 scratch kernel: Hardware name: Hewlett-Packard HP xw8400 Workstation/0A08h, BIOS 786D5 v02.38 10/25/2010 Jun 16 13:54:50 scratch kernel: 88047fcc3bd0 81706ef2 81c857e0 Jun 16 13:54:50 scratch kernel: 88047fcc3be0 810b4dba 88047fcc3c50 8170c79c Jun 16 13:54:50 scratch kernel: 81c85938 0046 0092 Jun 16 13:54:50 scratch kernel: Call Trace: Jun 16 13:54:50 scratch kernel: [] dump_stack+0x4d/0x66 Jun 16 13:54:50 scratch kernel: [] __might_sleep+0xfa/0x130 Jun 16 13:54:50 scratch kernel: [] mutex_lock_nested+0x3c/0x3a0 Jun 16 13:54:50 scratch kernel: [] kernfs_notify+0x8a/0x140 Jun 16 13:54:50 scratch kernel: [] bitmap_endwrite+0xb9/0x230 Jun 16 13:54:50 scratch kernel: [] close_write+0x93/0xb0 [raid1] Jun 16 13:54:50 scratch kernel: [] r1_bio_write_done+0x29/0x50 [raid1] Jun 16 13:54:50 scratch kernel: [] raid1_end_write_request+0xcf/0x220 [raid1] Jun 16 13:54:50 scratch kernel: [] bio_endio+0x5b/0xa0 Jun 16 13:54:50 scratch kernel: [] blk_update_request+0x90/0x330 Jun 16 13:54:50 scratch kernel: [] blk_update_bidi_request+0x1c/0x80 Jun 16 13:54:50 scratch kernel: [] blk_end_bidi_request+0x1f/0x60 Jun 16 13:54:50 scratch kernel: [] blk_end_request+0x10/0x20 Jun 16 13:54:50 scratch kernel: [] scsi_io_completion+0xf4/0x6e0 Jun 16 13:54:50 scratch kernel: [] scsi_finish_command+0xb3/0x110 Jun 16 13:54:50 scratch kernel: [] scsi_softirq_done+0x137/0x160 Jun 16 13:54:50 scratch kernel: [] blk_done_softirq+0x90/0xb0 Jun 16 13:54:50 scratch kernel: [] __do_softirq+0x125/0x300 Jun 16 13:54:50 scratch kernel: [] irq_exit+0xc5/0xd0 Jun 16 13:54:50 scratch kernel: [] smp_call_function_single_interrupt+0x35/0x40 Jun 16 13:54:50 scratch kernel: [] call_function_single_interrupt+0x72/0x80 Jun 16 13:54:50 scratch kernel: [] ? native_safe_halt+0x6/0x10 Jun 16 13:54:50 scratch kernel: [] ? trace_hardirqs_on+0xd/0x10 Jun 16 13:54:50 scratch kernel: [] default_idle+0x24/0xe0 Jun 16 13:54:50 scratch kernel: [] arch_cpu_idle+0xf/0x20 Jun 16 13:54:50 scratch kernel: [] cpu_startup_entry+0x2cb/0x450 Jun 16 13:54:50 scratch kernel: [] ? clockevents_register_device+0xbc/0x120 Jun 16 13:54:50 scratch kernel: [] start_secondary+0x1de/0x290 Jun 16 13:54:50 scratch kernel: Jun 16 13:54:50 scratch kernel: = Jun 16 13:54:50 scratch kernel: [ INFO: inconsistent lock state ] Jun 16 13:54:50 scratch kernel: 3.16.0-rc1+ #35 Not tainted Jun 16 13:54:50 scratch kernel: - Jun 16 13:54:50 scratch kernel: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. Jun 16 13:54:50 scratch kernel: swapper/3/0 [HC0[0]:SC1[1]:HE0:SE0] takes: Jun 16 13:54:50 scratch kernel: (kernfs_mutex){+.?.+.}, at: [] kernfs_notify+0x8a/0x140 Jun 16 13:54:50 scratch kernel: {SOFTIRQ-ON-W} state was registered at: Jun 16 13:54:50 scratch kernel: [] __lock_acquire+0x316/0x1a60 Jun 16 13:54:50 scratch kernel: [] lock_acquire+0xa2/0x130 Jun 16 13:54:50 scratch kernel: [] mutex_lock_nested+0x6a/0x3a0 Jun 16 13:54:50 scratch kernel: [] kernfs_activate+0x1f/0xf0 Jun 16 13:54:50 scratch kernel: [] kernfs_create_root+0xe8/0x110 Jun 16 13:54:50 scratch kernel: [] sysfs_init+0x13/0x51 Jun 16 13:54:50 scratch kernel: [] mnt_init+0x118/0x225 Jun 16 13:54:50 scratch kernel: [] vfs_caches_init+0x99/0x115 Jun 16 13:54:50 scratch kernel: [] start_kernel+0x3ec/0x443 Jun 16 13:54:50 scratch kernel: [] x86_64_start_reservations+0x2a/0x2c Jun 16 13:54:50 scratch kernel: [] x86_64_start_kernel+0x13e/0x14d Jun 16 13:54:50 scratch kernel: irq event stamp: 211959 Jun 16 13:54:50 scratch kernel: hardirqs last enabled at (211958): [] _raw_spin_unlock_irqrestore+0x36/0x70 Jun 16 13:54:50 scratch kernel: hardirqs last disabled at (211959): [] _raw_spin_lock_irqsave+0x29/0x70 Jun 16 13:54:50 scratch kernel: softirqs last enabled at (211952): [] _local_bh_enable+0x22/0x50 Jun 16 13:54:50
3.16-rc1 kernel BUG triggered at mutex.c:586
Trying to boot 3.16.0-rc1+ (the + is a small vlan related patch) on a tests system triggers the following BUG: Jun 16 13:54:50 scratch kernel: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 Jun 16 13:54:50 scratch kernel: in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/3 Jun 16 13:54:50 scratch kernel: 1 lock held by swapper/3/0: Jun 16 13:54:50 scratch kernel: #0: ((bitmap-counts.lock)-rlock){..-...}, at: [815729e8] bitmap_endwrite+0x58/0x230 Jun 16 13:54:50 scratch kernel: irq event stamp: 211959 Jun 16 13:54:50 scratch kernel: hardirqs last enabled at (211958): [8170f4d6] _raw_spin_unlock_irqrestore+0x36/0x70 Jun 16 13:54:50 scratch kernel: hardirqs last disabled at (211959): [8170fba9] _raw_spin_lock_irqsave+0x29/0x70 Jun 16 13:54:50 scratch kernel: softirqs last enabled at (211952): [8108bab2] _local_bh_enable+0x22/0x50 Jun 16 13:54:50 scratch kernel: softirqs last disabled at (211953): [8108cc15] irq_exit+0xc5/0xd0 Jun 16 13:54:50 scratch kernel: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.16.0-rc1+ #35 Jun 16 13:54:50 scratch kernel: Hardware name: Hewlett-Packard HP xw8400 Workstation/0A08h, BIOS 786D5 v02.38 10/25/2010 Jun 16 13:54:50 scratch kernel: 88047fcc3bd0 81706ef2 81c857e0 Jun 16 13:54:50 scratch kernel: 88047fcc3be0 810b4dba 88047fcc3c50 8170c79c Jun 16 13:54:50 scratch kernel: 81c85938 0046 0092 Jun 16 13:54:50 scratch kernel: Call Trace: Jun 16 13:54:50 scratch kernel: IRQ [81706ef2] dump_stack+0x4d/0x66 Jun 16 13:54:50 scratch kernel: [810b4dba] __might_sleep+0xfa/0x130 Jun 16 13:54:50 scratch kernel: [8170c79c] mutex_lock_nested+0x3c/0x3a0 Jun 16 13:54:50 scratch kernel: [81270f5a] kernfs_notify+0x8a/0x140 Jun 16 13:54:50 scratch kernel: [81572a49] bitmap_endwrite+0xb9/0x230 Jun 16 13:54:50 scratch kernel: [a027a823] close_write+0x93/0xb0 [raid1] Jun 16 13:54:50 scratch kernel: [a027af89] r1_bio_write_done+0x29/0x50 [raid1] Jun 16 13:54:50 scratch kernel: [a027c98f] raid1_end_write_request+0xcf/0x220 [raid1] Jun 16 13:54:50 scratch kernel: [8132f0bb] bio_endio+0x5b/0xa0 Jun 16 13:54:50 scratch kernel: [81335f60] blk_update_request+0x90/0x330 Jun 16 13:54:50 scratch kernel: [8133621c] blk_update_bidi_request+0x1c/0x80 Jun 16 13:54:50 scratch kernel: [8133656f] blk_end_bidi_request+0x1f/0x60 Jun 16 13:54:50 scratch kernel: [813365c0] blk_end_request+0x10/0x20 Jun 16 13:54:50 scratch kernel: [81497194] scsi_io_completion+0xf4/0x6e0 Jun 16 13:54:50 scratch kernel: [8148c613] scsi_finish_command+0xb3/0x110 Jun 16 13:54:50 scratch kernel: [81496fb7] scsi_softirq_done+0x137/0x160 Jun 16 13:54:50 scratch kernel: [8133cfe0] blk_done_softirq+0x90/0xb0 Jun 16 13:54:50 scratch kernel: [8108c705] __do_softirq+0x125/0x300 Jun 16 13:54:50 scratch kernel: [8108cc15] irq_exit+0xc5/0xd0 Jun 16 13:54:50 scratch kernel: [810439d5] smp_call_function_single_interrupt+0x35/0x40 Jun 16 13:54:50 scratch kernel: [81711552] call_function_single_interrupt+0x72/0x80 Jun 16 13:54:50 scratch kernel: EOI [81053fa6] ? native_safe_halt+0x6/0x10 Jun 16 13:54:50 scratch kernel: [810d891d] ? trace_hardirqs_on+0xd/0x10 Jun 16 13:54:50 scratch kernel: [8101e0c4] default_idle+0x24/0xe0 Jun 16 13:54:50 scratch kernel: [8101ea6f] arch_cpu_idle+0xf/0x20 Jun 16 13:54:50 scratch kernel: [810d0d1b] cpu_startup_entry+0x2cb/0x450 Jun 16 13:54:50 scratch kernel: [810fed4c] ? clockevents_register_device+0xbc/0x120 Jun 16 13:54:50 scratch kernel: [810442ee] start_secondary+0x1de/0x290 Jun 16 13:54:50 scratch kernel: Jun 16 13:54:50 scratch kernel: = Jun 16 13:54:50 scratch kernel: [ INFO: inconsistent lock state ] Jun 16 13:54:50 scratch kernel: 3.16.0-rc1+ #35 Not tainted Jun 16 13:54:50 scratch kernel: - Jun 16 13:54:50 scratch kernel: inconsistent {SOFTIRQ-ON-W} - {IN-SOFTIRQ-W} usage. Jun 16 13:54:50 scratch kernel: swapper/3/0 [HC0[0]:SC1[1]:HE0:SE0] takes: Jun 16 13:54:50 scratch kernel: (kernfs_mutex){+.?.+.}, at: [81270f5a] kernfs_notify+0x8a/0x140 Jun 16 13:54:50 scratch kernel: {SOFTIRQ-ON-W} state was registered at: Jun 16 13:54:50 scratch kernel: [810d8f66] __lock_acquire+0x316/0x1a60 Jun 16 13:54:50 scratch kernel: [810dae72] lock_acquire+0xa2/0x130 Jun 16 13:54:50 scratch kernel: [8170c7ca] mutex_lock_nested+0x6a/0x3a0 Jun 16 13:54:50 scratch kernel: [8127066f] kernfs_activate+0x1f/0xf0 Jun 16 13:54:50 scratch kernel: [81270a08] kernfs_create_root+0xe8/0x110 Jun 16 13:54:50 scratch kernel: [81d73567] sysfs_init+0x13/0x51 Jun 16 13:54:50 scratch kernel: [81d70d06]
Re: [PATCH v2] sctp: Fix sk_ack_backlog wrap-around problem
On 06/11/2014 10:53 PM, Xufeng Zhang wrote: > Consider the scenario: > For a TCP-style socket, while processing the COOKIE_ECHO chunk in > sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, > a new association would be created in sctp_unpack_cookie(), but afterwards, > some processing maybe failed, and sctp_association_free() will be called to > free the previously allocated association, in sctp_association_free(), > sk_ack_backlog value is decremented for this socket, since the initial > value for sk_ack_backlog is 0, after the decrement, it will be 65535, > a wrap-around problem happens, and if we want to establish new associations > afterward in the same socket, ABORT would be triggered since sctp deem the > accept queue as full. > Fix this issue by only decrementing sk_ack_backlog for associations in > the endpoint's list. > > Fix-suggested-by: Neil Horman > Signed-off-by: Xufeng Zhang Acked-by: Vlad Yasevich Thanks -vlad > --- > Change for v2: > Drop the redundant test for temp suggested by Vlad Yasevich. > > net/sctp/associola.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/sctp/associola.c b/net/sctp/associola.c > index 39579c3..0b8 100644 > --- a/net/sctp/associola.c > +++ b/net/sctp/associola.c > @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc) > /* Only real associations count against the endpoint, so >* don't bother for if this is a temporary association. >*/ > - if (!asoc->temp) { > + if (!list_empty(>asocs)) { > list_del(>asocs); > > /* Decrement the backlog value for a TCP-style listening > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sctp: Fix sk_ack_backlog wrap-around problem
On 06/11/2014 10:53 PM, Xufeng Zhang wrote: Consider the scenario: For a TCP-style socket, while processing the COOKIE_ECHO chunk in sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, a new association would be created in sctp_unpack_cookie(), but afterwards, some processing maybe failed, and sctp_association_free() will be called to free the previously allocated association, in sctp_association_free(), sk_ack_backlog value is decremented for this socket, since the initial value for sk_ack_backlog is 0, after the decrement, it will be 65535, a wrap-around problem happens, and if we want to establish new associations afterward in the same socket, ABORT would be triggered since sctp deem the accept queue as full. Fix this issue by only decrementing sk_ack_backlog for associations in the endpoint's list. Fix-suggested-by: Neil Horman nhor...@tuxdriver.com Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com Acked-by: Vlad Yasevich vyasev...@gmail.com Thanks -vlad --- Change for v2: Drop the redundant test for temp suggested by Vlad Yasevich. net/sctp/associola.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/sctp/associola.c b/net/sctp/associola.c index 39579c3..0b8 100644 --- a/net/sctp/associola.c +++ b/net/sctp/associola.c @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc) /* Only real associations count against the endpoint, so * don't bother for if this is a temporary association. */ - if (!asoc-temp) { + if (!list_empty(asoc-asocs)) { list_del(asoc-asocs); /* Decrement the backlog value for a TCP-style listening -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem
On 06/11/2014 08:55 AM, Vlad Yasevich wrote: > On 06/10/2014 10:37 PM, Xufeng Zhang wrote: >> Consider the scenario: >> For a TCP-style socket, while processing the COOKIE_ECHO chunk in >> sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, >> a new association would be created in sctp_unpack_cookie(), but afterwards, >> some processing maybe failed, and sctp_association_free() will be called to >> free the previously allocated association, in sctp_association_free(), >> sk_ack_backlog value is decremented for this socket, since the initial >> value for sk_ack_backlog is 0, after the decrement, it will be 65535, >> a wrap-around problem happens, and if we want to establish new associations >> afterward in the same socket, ABORT would be triggered since sctp deem the >> accept queue as full. >> Fix this issue by only decrementing sk_ack_backlog for associations in >> the endpoint's list. >> >> Fix-suggested-by: Neil Horman >> Signed-off-by: Xufeng Zhang >> --- >> net/sctp/associola.c |2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >> diff --git a/net/sctp/associola.c b/net/sctp/associola.c >> index 39579c3..60564f2 100644 >> --- a/net/sctp/associola.c >> +++ b/net/sctp/associola.c >> @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc) >> /* Only real associations count against the endpoint, so >> * don't bother for if this is a temporary association. >> */ >> -if (!asoc->temp) { >> +if (!asoc->temp && !list_empty(>asocs)) { >> list_del(>asocs); >> >> /* Decrement the backlog value for a TCP-style listening >> > > I am not crazy about this patch. It's been suggested before that may > be duplicate cookie processing should really be creating a temporary > association since that's is how that association is being used. I had another look at the description for triggering this issue and realized that I was thinking about something else when looking at this solution. There is however no need to test both the list and temp value. We can simply always test the that list is not empty before doing list_del(). -vlad > It > might be nice at that approach. It actually benefits us as the > association destruction would happen immediately instead of being delayed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem
On 06/10/2014 10:37 PM, Xufeng Zhang wrote: > Consider the scenario: > For a TCP-style socket, while processing the COOKIE_ECHO chunk in > sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, > a new association would be created in sctp_unpack_cookie(), but afterwards, > some processing maybe failed, and sctp_association_free() will be called to > free the previously allocated association, in sctp_association_free(), > sk_ack_backlog value is decremented for this socket, since the initial > value for sk_ack_backlog is 0, after the decrement, it will be 65535, > a wrap-around problem happens, and if we want to establish new associations > afterward in the same socket, ABORT would be triggered since sctp deem the > accept queue as full. > Fix this issue by only decrementing sk_ack_backlog for associations in > the endpoint's list. > > Fix-suggested-by: Neil Horman > Signed-off-by: Xufeng Zhang > --- > net/sctp/associola.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/sctp/associola.c b/net/sctp/associola.c > index 39579c3..60564f2 100644 > --- a/net/sctp/associola.c > +++ b/net/sctp/associola.c > @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc) > /* Only real associations count against the endpoint, so >* don't bother for if this is a temporary association. >*/ > - if (!asoc->temp) { > + if (!asoc->temp && !list_empty(>asocs)) { > list_del(>asocs); > > /* Decrement the backlog value for a TCP-style listening > I am not crazy about this patch. It's been suggested before that may be duplicate cookie processing should really be creating a temporary association since that's is how that association is being used. It might be nice at that approach. It actually benefits us as the association destruction would happen immediately instead of being delayed. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem
On 06/10/2014 10:37 PM, Xufeng Zhang wrote: Consider the scenario: For a TCP-style socket, while processing the COOKIE_ECHO chunk in sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, a new association would be created in sctp_unpack_cookie(), but afterwards, some processing maybe failed, and sctp_association_free() will be called to free the previously allocated association, in sctp_association_free(), sk_ack_backlog value is decremented for this socket, since the initial value for sk_ack_backlog is 0, after the decrement, it will be 65535, a wrap-around problem happens, and if we want to establish new associations afterward in the same socket, ABORT would be triggered since sctp deem the accept queue as full. Fix this issue by only decrementing sk_ack_backlog for associations in the endpoint's list. Fix-suggested-by: Neil Horman nhor...@tuxdriver.com Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com --- net/sctp/associola.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/sctp/associola.c b/net/sctp/associola.c index 39579c3..60564f2 100644 --- a/net/sctp/associola.c +++ b/net/sctp/associola.c @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc) /* Only real associations count against the endpoint, so * don't bother for if this is a temporary association. */ - if (!asoc-temp) { + if (!asoc-temp !list_empty(asoc-asocs)) { list_del(asoc-asocs); /* Decrement the backlog value for a TCP-style listening I am not crazy about this patch. It's been suggested before that may be duplicate cookie processing should really be creating a temporary association since that's is how that association is being used. It might be nice at that approach. It actually benefits us as the association destruction would happen immediately instead of being delayed. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem
On 06/11/2014 08:55 AM, Vlad Yasevich wrote: On 06/10/2014 10:37 PM, Xufeng Zhang wrote: Consider the scenario: For a TCP-style socket, while processing the COOKIE_ECHO chunk in sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check, a new association would be created in sctp_unpack_cookie(), but afterwards, some processing maybe failed, and sctp_association_free() will be called to free the previously allocated association, in sctp_association_free(), sk_ack_backlog value is decremented for this socket, since the initial value for sk_ack_backlog is 0, after the decrement, it will be 65535, a wrap-around problem happens, and if we want to establish new associations afterward in the same socket, ABORT would be triggered since sctp deem the accept queue as full. Fix this issue by only decrementing sk_ack_backlog for associations in the endpoint's list. Fix-suggested-by: Neil Horman nhor...@tuxdriver.com Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com --- net/sctp/associola.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/sctp/associola.c b/net/sctp/associola.c index 39579c3..60564f2 100644 --- a/net/sctp/associola.c +++ b/net/sctp/associola.c @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc) /* Only real associations count against the endpoint, so * don't bother for if this is a temporary association. */ -if (!asoc-temp) { +if (!asoc-temp !list_empty(asoc-asocs)) { list_del(asoc-asocs); /* Decrement the backlog value for a TCP-style listening I am not crazy about this patch. It's been suggested before that may be duplicate cookie processing should really be creating a temporary association since that's is how that association is being used. I had another look at the description for triggering this issue and realized that I was thinking about something else when looking at this solution. There is however no need to test both the list and temp value. We can simply always test the that list is not empty before doing list_del(). -vlad It might be nice at that approach. It actually benefits us as the association destruction would happen immediately instead of being delayed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PULL 2/2] vhost: replace rcu with mutex
On 06/03/2014 08:48 AM, Paolo Bonzini wrote: > Il 02/06/2014 23:58, Eric Dumazet ha scritto: >> This looks dubious >> >> What about using kfree_rcu() instead ? > > It would lead to unbound allocation from userspace. > >> translate_desc() still uses rcu_read_lock(), its not clear if the mutex >> is really held. > > Yes, vhost_get_vq_desc must be called with the vq mutex held. > > The rcu_read_lock/unlock in translate_desc is unnecessary. > If that's true, then does dev->memory really needs to be rcu protected? It appears to always be read under mutex. -vlad > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PULL 2/2] vhost: replace rcu with mutex
On 06/03/2014 08:48 AM, Paolo Bonzini wrote: Il 02/06/2014 23:58, Eric Dumazet ha scritto: This looks dubious What about using kfree_rcu() instead ? It would lead to unbound allocation from userspace. translate_desc() still uses rcu_read_lock(), its not clear if the mutex is really held. Yes, vhost_get_vq_desc must be called with the vq mutex held. The rcu_read_lock/unlock in translate_desc is unnecessary. If that's true, then does dev-memory really needs to be rcu protected? It appears to always be read under mutex. -vlad Paolo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space after fdb update
On 05/27/2014 06:45 PM, Jon Maxwell wrote: > There has been a number incidents recently where customers running KVM have > reported that VM hosts on different Hypervisors are unreachable. Based on > pcap traces we found that the bridge was broadcasting the ARP request out > onto the network. However some NICs have an inbuilt switch which on occasions > were broadcasting the VMs ARP request back through the physical NIC on the > Hypervisor. This resulted in the bridge changing ports and incorrectly > learning > that the VMs mac address was external. As a result the ARP reply was directed > back onto the external network and VM never updated it's ARP cache. This patch > will notify the bridge command, after a fdb has been updated to identify such > port toggling. > > Signed-off-by: Jon Maxwell > --- > net/bridge/br_fdb.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c > index 9203d5a..f3ee2da 100644 > --- a/net/bridge/br_fdb.c > +++ b/net/bridge/br_fdb.c > @@ -487,6 +487,7 @@ void br_fdb_update(struct net_bridge *br, struct > net_bridge_port *source, > { > struct hlist_head *head = >hash[br_mac_hash(addr, vid)]; > struct net_bridge_fdb_entry *fdb; > + bool fdb_modified = 0; > > /* some users want to always flood. */ > if (hold_time(br) == 0) > @@ -507,10 +508,15 @@ void br_fdb_update(struct net_bridge *br, struct > net_bridge_port *source, > source->dev->name); > } else { > /* fastpath: update of existing entry */ > - fdb->dst = source; > + if (unlikely(source != fdb->dst)) { > + fdb->dst = source; > + fdb_modified = 1; > + } This looks over-indented. -vlad > fdb->updated = jiffies; > if (unlikely(added_by_user)) > fdb->added_by_user = 1; > + if (unlikely(fdb_modified)) > + fdb_notify(br, fdb, RTM_NEWNEIGH); > } > } else { > spin_lock(>hash_lock); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space after fdb update
On 05/27/2014 06:45 PM, Jon Maxwell wrote: There has been a number incidents recently where customers running KVM have reported that VM hosts on different Hypervisors are unreachable. Based on pcap traces we found that the bridge was broadcasting the ARP request out onto the network. However some NICs have an inbuilt switch which on occasions were broadcasting the VMs ARP request back through the physical NIC on the Hypervisor. This resulted in the bridge changing ports and incorrectly learning that the VMs mac address was external. As a result the ARP reply was directed back onto the external network and VM never updated it's ARP cache. This patch will notify the bridge command, after a fdb has been updated to identify such port toggling. Signed-off-by: Jon Maxwell jmaxwel...@gmail.com --- net/bridge/br_fdb.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c index 9203d5a..f3ee2da 100644 --- a/net/bridge/br_fdb.c +++ b/net/bridge/br_fdb.c @@ -487,6 +487,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, { struct hlist_head *head = br-hash[br_mac_hash(addr, vid)]; struct net_bridge_fdb_entry *fdb; + bool fdb_modified = 0; /* some users want to always flood. */ if (hold_time(br) == 0) @@ -507,10 +508,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, source-dev-name); } else { /* fastpath: update of existing entry */ - fdb-dst = source; + if (unlikely(source != fdb-dst)) { + fdb-dst = source; + fdb_modified = 1; + } This looks over-indented. -vlad fdb-updated = jiffies; if (unlikely(added_by_user)) fdb-added_by_user = 1; + if (unlikely(fdb_modified)) + fdb_notify(br, fdb, RTM_NEWNEIGH); } } else { spin_lock(br-hash_lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: skbuff truesize incorrect.
On 05/22/2014 03:07 PM, Jim Baxter wrote: > Hi, I was hoping you can help me with some questions. > > I have been investigating a network issue with bursts of network traffic > over USB CDC-NCM, the issue is that the kernel is dropping packets > because sk_rcvqueues_full() returns true due to skb2->truesize is always > 32960 instead of SKB_TRUESIZE(skb2->len) which is about 1800. > > The code I am trying to fix is this code below, it is splitting a set of > multiple network packets compressed into a single 16k packet into > individual skb's and sending them up the network stack. > > skb2 = skb_clone(skb, GFP_ATOMIC); > if (skb2 == NULL) > goto err; > > if (!skb_pull(skb2, index)) { > ret = -EOVERFLOW; > goto err; > } This assumes that you original 16K packet is linear. Is that always the case? > > skb_trim(skb2, dg_len - crc_len); > > My questions are: > > 1) Which buffer size does truesize represent, is it the total buffer or > just the data related to the relevant skb? Total buffer size (including the struct sk_buff). > > 2) If truesize is for the skb it is contained within should it be > updated during the call to skb_trim? No, because the the buffer/memory is still there. skb_trim just sets the tail pointer and adjusts skb->len. > > 3) Why does the truesize default to 32960? Probably because that's how much buffer space was actually allocated for the original skb. When you cloned it, you inherited the truesize since a clone is just the struct sk_buff that points into the data of the original. This is the very same problem that I ran into with SCTP since it has similar code in it. You can play games with truesize manually, but you have to be very careful here. -vlad > > > Thank you for any help, > Jim Baxter. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: skbuff truesize incorrect.
On 05/22/2014 03:07 PM, Jim Baxter wrote: Hi, I was hoping you can help me with some questions. I have been investigating a network issue with bursts of network traffic over USB CDC-NCM, the issue is that the kernel is dropping packets because sk_rcvqueues_full() returns true due to skb2-truesize is always 32960 instead of SKB_TRUESIZE(skb2-len) which is about 1800. The code I am trying to fix is this code below, it is splitting a set of multiple network packets compressed into a single 16k packet into individual skb's and sending them up the network stack. skb2 = skb_clone(skb, GFP_ATOMIC); if (skb2 == NULL) goto err; if (!skb_pull(skb2, index)) { ret = -EOVERFLOW; goto err; } This assumes that you original 16K packet is linear. Is that always the case? skb_trim(skb2, dg_len - crc_len); My questions are: 1) Which buffer size does truesize represent, is it the total buffer or just the data related to the relevant skb? Total buffer size (including the struct sk_buff). 2) If truesize is for the skb it is contained within should it be updated during the call to skb_trim? No, because the the buffer/memory is still there. skb_trim just sets the tail pointer and adjusts skb-len. 3) Why does the truesize default to 32960? Probably because that's how much buffer space was actually allocated for the original skb. When you cloned it, you inherited the truesize since a clone is just the struct sk_buff that points into the data of the original. This is the very same problem that I ran into with SCTP since it has similar code in it. You can play games with truesize manually, but you have to be very careful here. -vlad Thank you for any help, Jim Baxter. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
On 05/20/2014 12:55 AM, valdis.kletni...@vt.edu wrote: > On Mon, 19 May 2014 23:49:22 +0930, David Newall said: > >> How does a packet get fragmented in this case? Does it only happen when >> bridging to a device with smaller MTU? That scenario sounds quite >> un-bridge-like. It also sounds like something that can be handled by >> real routing. > > Which doesn't change the fact that you *will* get clowns who take a box that > has a 10G card on a jumbogram-enabled subnet that's running with an MTU of > 9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route. > (Did you know that you can actually mount an NFS filesystem across that? And > that ls and cat and friends will work *just fine*? Until you hit a file that's > more than 1.5 in size, that is. And when you do a traceroute to the wedged > client, it tells you it's on the 10G network, so you have no idea why you're > seeing an MTU issue. Don't ask how I know this - let's just say that > supporting HPC users is never boring. :) > > So yes, we *do* need to do something sensible there - either frag the packet > on the way out, or something. It *would* be nice if we could drop the > packet and send an ICMP Frag Needed back - except it's unclear what IP > you use as the source address for the ICMP > If there is no netfilter, then the bridge will just drop the packet (see br_dev_queue_push_xmit). It should probably also do that with netfilter. On the question of ICMP, I've also debated about sending ICMP Frag Needed, but that's really beyond the scope of the bridge device. Recording a stat might be sufficient to help troubleshoot these types of issues. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the net tree
On 05/19/2014 08:59 PM, Stephen Rothwell wrote: > Hi all, > > After merging the net tree, today's linux-next build (powerpc > ppc64_defconfig) failed like this: > > > In file included from arch/powerpc/net/bpf_jit_comp.c:16:0: > include/linux/if_vlan.h: In function 'vlan_get_encap_level': > include/linux/if_vlan.h:491:2: error: implicit declaration of function > 'vlan_dev_priv' [-Werror=implicit-function-declaration] > return vlan_dev_priv(dev)->nest_level; > ^ > include/linux/if_vlan.h:491:27: error: invalid type argument of '->' (have > 'int') > return vlan_dev_priv(dev)->nest_level; >^ > > Caused by commit 44a4085538c8 ("bonding: Fix stacked device detection > in arp monitoring"). > > vlan_dev_priv() is protected by: > > #if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE) > > but the newly added call in vlan_get_encap_level() is not. > > I have used the net tree from next-20140519 for today. > Right. Apologies. I missed the above fact. Fix forthcoming. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the net tree
On 05/19/2014 08:59 PM, Stephen Rothwell wrote: Hi all, After merging the net tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: In file included from arch/powerpc/net/bpf_jit_comp.c:16:0: include/linux/if_vlan.h: In function 'vlan_get_encap_level': include/linux/if_vlan.h:491:2: error: implicit declaration of function 'vlan_dev_priv' [-Werror=implicit-function-declaration] return vlan_dev_priv(dev)-nest_level; ^ include/linux/if_vlan.h:491:27: error: invalid type argument of '-' (have 'int') return vlan_dev_priv(dev)-nest_level; ^ Caused by commit 44a4085538c8 (bonding: Fix stacked device detection in arp monitoring). vlan_dev_priv() is protected by: #if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE) but the newly added call in vlan_get_encap_level() is not. I have used the net tree from next-20140519 for today. Right. Apologies. I missed the above fact. Fix forthcoming. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
On 05/20/2014 12:55 AM, valdis.kletni...@vt.edu wrote: On Mon, 19 May 2014 23:49:22 +0930, David Newall said: How does a packet get fragmented in this case? Does it only happen when bridging to a device with smaller MTU? That scenario sounds quite un-bridge-like. It also sounds like something that can be handled by real routing. Which doesn't change the fact that you *will* get clowns who take a box that has a 10G card on a jumbogram-enabled subnet that's running with an MTU of 9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route. (Did you know that you can actually mount an NFS filesystem across that? And that ls and cat and friends will work *just fine*? Until you hit a file that's more than 1.5 in size, that is. And when you do a traceroute to the wedged client, it tells you it's on the 10G network, so you have no idea why you're seeing an MTU issue. Don't ask how I know this - let's just say that supporting HPC users is never boring. :) So yes, we *do* need to do something sensible there - either frag the packet on the way out, or something. It *would* be nice if we could drop the packet and send an ICMP Frag Needed back - except it's unclear what IP you use as the source address for the ICMP If there is no netfilter, then the bridge will just drop the packet (see br_dev_queue_push_xmit). It should probably also do that with netfilter. On the question of ICMP, I've also debated about sending ICMP Frag Needed, but that's really beyond the scope of the bridge device. Recording a stat might be sufficient to help troubleshoot these types of issues. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] macvlan: Fix checksum errors when ip_summed is CHECKSUM_PARTIAL
On 05/19/2014 01:24 PM, Michael Spang wrote: > Changing ip_summed from CHECKSUM_PARTIAL to CHECKSUM_UNNECESSARY > will result in an incorrect checksum if the packet is sent off the box. > > Cc: sta...@vger.kernel.org > Signed-off-by: Michael Spang > --- > drivers/net/macvlan.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c > index 753a8c2..806b56a 100644 > --- a/drivers/net/macvlan.c > +++ b/drivers/net/macvlan.c > @@ -267,7 +267,9 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct > net_device *dev) > > if (vlan->mode == MACVLAN_MODE_BRIDGE) { > const struct ethhdr *eth = (void *)skb->data; > - skb->ip_summed = CHECKSUM_UNNECESSARY; > + > + if (skb->ip_summed == CHECKSUM_NONE) > + skb->ip_summed = CHECKSUM_UNNECESSARY; > > /* send to other bridge ports directly */ > if (is_multicast_ether_addr(eth->h_dest)) { > Already done. See commit f114890cdf84d753f6b41cd0cc44ba51d16313da -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] macvlan: Fix checksum errors when ip_summed is CHECKSUM_PARTIAL
On 05/19/2014 01:24 PM, Michael Spang wrote: Changing ip_summed from CHECKSUM_PARTIAL to CHECKSUM_UNNECESSARY will result in an incorrect checksum if the packet is sent off the box. Cc: sta...@vger.kernel.org Signed-off-by: Michael Spang sp...@chromium.org --- drivers/net/macvlan.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 753a8c2..806b56a 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -267,7 +267,9 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) if (vlan-mode == MACVLAN_MODE_BRIDGE) { const struct ethhdr *eth = (void *)skb-data; - skb-ip_summed = CHECKSUM_UNNECESSARY; + + if (skb-ip_summed == CHECKSUM_NONE) + skb-ip_summed = CHECKSUM_UNNECESSARY; /* send to other bridge ports directly */ if (is_multicast_ether_addr(eth-h_dest)) { Already done. See commit f114890cdf84d753f6b41cd0cc44ba51d16313da -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes
On 04/30/2014 06:59 PM, Luis R. Rodriguez wrote: > On Wed, Apr 30, 2014 at 04:04:34PM -0400, Vlad Yasevich wrote: >> On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote: >>> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c >>> index 54d207d..dcd9378 100644 >>> --- a/net/bridge/br_if.c >>> +++ b/net/bridge/br_if.c >>> @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct >>> net_bridge *br, >>> features &= ~NETIF_F_ONE_FOR_ALL; >>> >>> list_for_each_entry(p, >port_list, list) { >>> + if (p->flags & BR_ROOT_BLOCK) >>> + continue; >>> features = netdev_increment_features(features, >>> p->dev->features, mask); >>> } >>> >> Hi Luis >> >> The hunk above isn't right. Just because you set ROOT_BLOCK on the port >> doesn't mean that you should ignore it's device features. If the device >> you just added happens to disable or enable some device offload feature, >> you should take that into account. > > OK thanks, how about this part: > > On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote: >> On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote: >> On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote: >>> On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez wrote: >>>> On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote: >>>>> On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez >>>>> wrote: >>>>>> spin_unlock_bh(>br->lock); >>>>>> + if (changed) >>>>>> + >>>>>> call_netdevice_notifiers(NETDEV_CHANGEADDR, >>>>>> +p->br->dev); >>>>>> + netdev_update_features(p->br->dev); >>>>> This is actually just a part of it. You also need to handle the sysfs changing the flag. Look at the first 2 patches in this series: http://www.spinics.net/lists/netdev/msg280863.html You might need that functionality. -vlad >>>>> I think this is supposed to be in netdev event handler of br->dev >>>>> instead of here. >>>> >>>> Do you mean netdev_update_features() ? I mimic'd what was being done on >>>> br_del_if() given that root blocking is doing something similar. If >>>> we need to change something for the above then I suppose it means we need >>>> to change br_del_if() too. Let me know if you see any reason for something >>>> else. >>>> >>> >>> Yeah, for me it looks like it's better to call netdev_update_features() >>> in the event handler of br->dev, rather than where calling >>> call_netdevice_notifiers(..., br->dev);. >> >> I still don't see why, in fact trying to verify this I am wondering now >> if instead we should actually fix br_features_recompute() to take into >> consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features() >> is called above even if the MAC address did not change, just as is done >> on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more >> appropriate we just call >> >> call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p->br->dev) >> >> for both the above then and also br_del_if()? > > Luis > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes
On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote: > On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote: >> On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote: >>> On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez wrote: On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote: > On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez > wrote: >> spin_unlock_bh(>br->lock); >> + if (changed) >> + >> call_netdevice_notifiers(NETDEV_CHANGEADDR, >> +p->br->dev); >> + netdev_update_features(p->br->dev); > > I think this is supposed to be in netdev event handler of br->dev > instead of here. Do you mean netdev_update_features() ? I mimic'd what was being done on br_del_if() given that root blocking is doing something similar. If we need to change something for the above then I suppose it means we need to change br_del_if() too. Let me know if you see any reason for something else. >>> >>> Yeah, for me it looks like it's better to call netdev_update_features() >>> in the event handler of br->dev, rather than where calling >>> call_netdevice_notifiers(..., br->dev);. >> >> I still don't see why, in fact trying to verify this I am wondering now >> if instead we should actually fix br_features_recompute() to take into >> consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features() >> is called above even if the MAC address did not change, just as is done >> on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more >> appropriate we just call >> >> call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p->br->dev) >> >> for both the above then and also br_del_if()? How about the below >> change? >> >> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c >> index 54d207d..dcd9378 100644 >> --- a/net/bridge/br_if.c >> +++ b/net/bridge/br_if.c >> @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct >> net_bridge *br, >> features &= ~NETIF_F_ONE_FOR_ALL; >> >> list_for_each_entry(p, >port_list, list) { >> +if (p->flags & BR_ROOT_BLOCK) >> +continue; >> features = netdev_increment_features(features, >> p->dev->features, mask); >> } > > Cong, can you provide feedback on this? I tried to grow confidence on the > hunk above but its not clear but the other points still hold and I'd love > your feedback on those. > Hi Luis The hunk above isn't right. Just because you set ROOT_BLOCK on the port doesn't mean that you should ignore it's device features. If the device you just added happens to disable or enable some device offload feature, you should take that into account. Thanks -vlad > Luis > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bridge: preserve random init MAC address
On 04/22/2014 03:41 PM, Luis R. Rodriguez wrote: > On Wed, Mar 19, 2014 at 7:05 PM, Luis R. Rodriguez wrote: >> On Tue, Mar 18, 2014 at 08:10:56PM -0700, Stephen Hemminger wrote: >>> On Wed, 12 Mar 2014 20:15:25 -0700 >>> "Luis R. Rodriguez" wrote: >>> As it is now if you add create a bridge it gets started with a random MAC address and if you then add a net_device as a slave but later kick it out you end up with a zero MAC address. Instead preserve the original random MAC address and use it. >>> >>> What is supposed to happen is that the recalculate chooses >>> the lowest MAC address of the slaves. If there are no slaves >>> it might as well just calculate a new random value. There is >>> not great merit in preserving the original defunct address. >>> >>> Or something like this >>> --- a/net/bridge/br_stp_if.c 2014-02-12 08:21:56.733857356 -0800 >>> +++ b/net/bridge/br_stp_if.c 2014-03-18 20:09:09.334388826 -0700 >>> @@ -235,6 +235,9 @@ bool br_stp_recalculate_bridge_id(struct >>> addr = p->dev->dev_addr; >>> >>> } >>> + >>> + if (addr == br_mac_zero) >>> + return false; /* keep original address */ >>> >>> if (ether_addr_equal(br->bridge_id.addr, addr)) >>> return false; /* no change */ >>> >>> that just keeps the old value. >> >> The old value could be a port which got root blocked, I think >> it can be confusing to see that happen. Either way feel free to >> make the call, I'll provide more details below on perhaps one reason >> to keep the original MAC address. > > Stephen, I'd like to respin this series to address all pending > feedback, I'd still like your feedback / call / judgement on this > part. I'm fine either way, just wanted to ensure I highlight the > reasoning of why I kept the original random MAC address. Please keep > in mind that at this point I'm convinced bridging is the *wrong* > solution to networking with guests but it is being used in a lot of > current topologies, this would just help with smoothing out corner > cases. > Hi Luis I took at this series again, I think it might make to generate a new random address instead of storing the original. There is not much point in storing the original since it's been gone, and most likely not advertised anywhere anyway. As for virtualization configuration you described, a lot of times there is a single port in the down state that remains. At least that's how libvirt manages it's bridged networks. Thanks -vlad >>> The bridge is in a meaningless state when there are no ports, >> >> Some virtualization topologies may want a backend with no link (or >> perhaps one which is dynamic, with the option to have none) to the >> internet but just a bridge so guests sharing the bridge can talk to >> each other. In this case bridging can be used only to link the >> batch of guests. >> >> In this case the bridge simply acts as a switch, but also can be used as the >> interface for DHCP, for example. In such a case the guests will be doing >> ARP to get to the DHCP server. There's a flurry of ways one can try to get >> all this meshed together including enablign an ARP proxy but I'm looking >> at ways to optimize this further -- but I'd like to address the current >> usage cases first. >> >>> and when the first port is added back it will be used as the >>> new bridge id. >> >> Sure. Let me know how you think we should proceed with this patch based >> on the above. > > Thanks in advance. > > Luis > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bridge: preserve random init MAC address
On 04/22/2014 03:41 PM, Luis R. Rodriguez wrote: On Wed, Mar 19, 2014 at 7:05 PM, Luis R. Rodriguez mcg...@suse.com wrote: On Tue, Mar 18, 2014 at 08:10:56PM -0700, Stephen Hemminger wrote: On Wed, 12 Mar 2014 20:15:25 -0700 Luis R. Rodriguez mcg...@do-not-panic.com wrote: As it is now if you add create a bridge it gets started with a random MAC address and if you then add a net_device as a slave but later kick it out you end up with a zero MAC address. Instead preserve the original random MAC address and use it. What is supposed to happen is that the recalculate chooses the lowest MAC address of the slaves. If there are no slaves it might as well just calculate a new random value. There is not great merit in preserving the original defunct address. Or something like this --- a/net/bridge/br_stp_if.c 2014-02-12 08:21:56.733857356 -0800 +++ b/net/bridge/br_stp_if.c 2014-03-18 20:09:09.334388826 -0700 @@ -235,6 +235,9 @@ bool br_stp_recalculate_bridge_id(struct addr = p-dev-dev_addr; } + + if (addr == br_mac_zero) + return false; /* keep original address */ if (ether_addr_equal(br-bridge_id.addr, addr)) return false; /* no change */ that just keeps the old value. The old value could be a port which got root blocked, I think it can be confusing to see that happen. Either way feel free to make the call, I'll provide more details below on perhaps one reason to keep the original MAC address. Stephen, I'd like to respin this series to address all pending feedback, I'd still like your feedback / call / judgement on this part. I'm fine either way, just wanted to ensure I highlight the reasoning of why I kept the original random MAC address. Please keep in mind that at this point I'm convinced bridging is the *wrong* solution to networking with guests but it is being used in a lot of current topologies, this would just help with smoothing out corner cases. Hi Luis I took at this series again, I think it might make to generate a new random address instead of storing the original. There is not much point in storing the original since it's been gone, and most likely not advertised anywhere anyway. As for virtualization configuration you described, a lot of times there is a single port in the down state that remains. At least that's how libvirt manages it's bridged networks. Thanks -vlad The bridge is in a meaningless state when there are no ports, Some virtualization topologies may want a backend with no link (or perhaps one which is dynamic, with the option to have none) to the internet but just a bridge so guests sharing the bridge can talk to each other. In this case bridging can be used only to link the batch of guests. In this case the bridge simply acts as a switch, but also can be used as the interface for DHCP, for example. In such a case the guests will be doing ARP to get to the DHCP server. There's a flurry of ways one can try to get all this meshed together including enablign an ARP proxy but I'm looking at ways to optimize this further -- but I'd like to address the current usage cases first. and when the first port is added back it will be used as the new bridge id. Sure. Let me know how you think we should proceed with this patch based on the above. Thanks in advance. Luis -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes
On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote: On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez mcg...@suse.com wrote: On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote: On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez mcg...@do-not-panic.com wrote: spin_unlock_bh(p-br-lock); + if (changed) + call_netdevice_notifiers(NETDEV_CHANGEADDR, +p-br-dev); + netdev_update_features(p-br-dev); I think this is supposed to be in netdev event handler of br-dev instead of here. Do you mean netdev_update_features() ? I mimic'd what was being done on br_del_if() given that root blocking is doing something similar. If we need to change something for the above then I suppose it means we need to change br_del_if() too. Let me know if you see any reason for something else. Yeah, for me it looks like it's better to call netdev_update_features() in the event handler of br-dev, rather than where calling call_netdevice_notifiers(..., br-dev);. I still don't see why, in fact trying to verify this I am wondering now if instead we should actually fix br_features_recompute() to take into consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features() is called above even if the MAC address did not change, just as is done on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more appropriate we just call call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p-br-dev) for both the above then and also br_del_if()? How about the below change? diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index 54d207d..dcd9378 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct net_bridge *br, features = ~NETIF_F_ONE_FOR_ALL; list_for_each_entry(p, br-port_list, list) { +if (p-flags BR_ROOT_BLOCK) +continue; features = netdev_increment_features(features, p-dev-features, mask); } Cong, can you provide feedback on this? I tried to grow confidence on the hunk above but its not clear but the other points still hold and I'd love your feedback on those. Hi Luis The hunk above isn't right. Just because you set ROOT_BLOCK on the port doesn't mean that you should ignore it's device features. If the device you just added happens to disable or enable some device offload feature, you should take that into account. Thanks -vlad Luis -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes
On 04/30/2014 06:59 PM, Luis R. Rodriguez wrote: On Wed, Apr 30, 2014 at 04:04:34PM -0400, Vlad Yasevich wrote: On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote: diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index 54d207d..dcd9378 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct net_bridge *br, features = ~NETIF_F_ONE_FOR_ALL; list_for_each_entry(p, br-port_list, list) { + if (p-flags BR_ROOT_BLOCK) + continue; features = netdev_increment_features(features, p-dev-features, mask); } Hi Luis The hunk above isn't right. Just because you set ROOT_BLOCK on the port doesn't mean that you should ignore it's device features. If the device you just added happens to disable or enable some device offload feature, you should take that into account. OK thanks, how about this part: On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote: On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez mcg...@suse.com wrote: On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote: On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez mcg...@do-not-panic.com wrote: spin_unlock_bh(p-br-lock); + if (changed) + call_netdevice_notifiers(NETDEV_CHANGEADDR, +p-br-dev); + netdev_update_features(p-br-dev); This is actually just a part of it. You also need to handle the sysfs changing the flag. Look at the first 2 patches in this series: http://www.spinics.net/lists/netdev/msg280863.html You might need that functionality. -vlad I think this is supposed to be in netdev event handler of br-dev instead of here. Do you mean netdev_update_features() ? I mimic'd what was being done on br_del_if() given that root blocking is doing something similar. If we need to change something for the above then I suppose it means we need to change br_del_if() too. Let me know if you see any reason for something else. Yeah, for me it looks like it's better to call netdev_update_features() in the event handler of br-dev, rather than where calling call_netdevice_notifiers(..., br-dev);. I still don't see why, in fact trying to verify this I am wondering now if instead we should actually fix br_features_recompute() to take into consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features() is called above even if the MAC address did not change, just as is done on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more appropriate we just call call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p-br-dev) for both the above then and also br_del_if()? Luis -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] sctp: reset flowi4_oif parameter on route lookup
On 04/25/2014 04:55 AM, Xufeng Zhang wrote: > commit 813b3b5db83 (ipv4: Use caller's on-stack flowi as-is > in output route lookups.) introduces another regression which > is very similar to the problem of commit e6b45241c (ipv4: reset > flowi parameters on route connect) wants to fix: > Before we call ip_route_output_key() in sctp_v4_get_dst() to > get a dst that matches a bind address as the source address, > we have already called this function previously and the flowi > parameters have been initialized including flowi4_oif, so when > we call this function again, the process in __ip_route_output_key() > will be different because of the setting of flowi4_oif, and we'll > get a networking device which corresponds to the inputted flowi4_oif > as the output device, this is wrong because we'll never hit this > place if the previously returned source address of dst match one > of the bound addresses. > > To reproduce this problem, a vlan setting is enough: > # ifconfig eth0 up > # route del default > # vconfig add eth0 2 > # vconfig add eth0 3 > # ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0 > # route add default gw 10.0.1.254 dev eth0.2 > # ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0 > # ip rule add from 10.0.0.14 table 4 > # ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3 > # sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I > You'll detect that all the flow are routed to eth0.2(10.0.1.254). > > Signed-off-by: Xufeng Zhang > Signed-off-by: Julian Anastasov Acked-by: Vlad Yasevich -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] sctp: reset flowi4_oif parameter on route lookup
On 04/25/2014 04:55 AM, Xufeng Zhang wrote: commit 813b3b5db83 (ipv4: Use caller's on-stack flowi as-is in output route lookups.) introduces another regression which is very similar to the problem of commit e6b45241c (ipv4: reset flowi parameters on route connect) wants to fix: Before we call ip_route_output_key() in sctp_v4_get_dst() to get a dst that matches a bind address as the source address, we have already called this function previously and the flowi parameters have been initialized including flowi4_oif, so when we call this function again, the process in __ip_route_output_key() will be different because of the setting of flowi4_oif, and we'll get a networking device which corresponds to the inputted flowi4_oif as the output device, this is wrong because we'll never hit this place if the previously returned source address of dst match one of the bound addresses. To reproduce this problem, a vlan setting is enough: # ifconfig eth0 up # route del default # vconfig add eth0 2 # vconfig add eth0 3 # ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0 # route add default gw 10.0.1.254 dev eth0.2 # ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0 # ip rule add from 10.0.0.14 table 4 # ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3 # sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I You'll detect that all the flow are routed to eth0.2(10.0.1.254). Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com Signed-off-by: Julian Anastasov j...@ssi.bg Acked-by: Vlad Yasevich vyasev...@gmail.com -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 03/03/2014 04:13 AM, Christian Borntraeger wrote: > On 02/03/14 02:21, Vlad Yasevich wrote: >> On 03/01/2014 02:27 PM, Vlad Yasevich wrote: >>> On 03/01/2014 06:15 AM, Christian Borntraeger wrote: >>>> On 28/02/14 23:14, Vlad Yasevich wrote: >>>>> On 02/27/2014 03:52 PM, Christian Borntraeger wrote: >>>>>> Vlad, >>>>>> >>>>>> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 >>>>>> macvtap: Add support of packet capture on macvtap device. >>>>>> >>>>>> causes a performance regression for iperf traffic between two KVM guests >>>>>> on my s390 system. Both guests are connected via two macvtaps on the >>>>>> same OSA >>>>>> network card. >>>>>> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get >>>>>> ~4Gbit/sec >>>>>> >>>>>> Latency seems to be unchanges (uperf 1byte ping pong). >>>>>> >>>>>> According to ifconfig in the guest, I have ~ 1500 bytes per packet with >>>>>> this >>>>>> patch and ~ 4 bytes without. So for some reason this patch causes >>>>>> the >>>>>> network stack to do segmentation. (the guest kernel stays the same, only >>>>>> host >>>>>> kernel is changed). >>>>>> >>>>>> Any ideas? >>>>> >>>>> I am looking. It shouldn't cause addition segmentations and when I ran >>>>> netperf on the code I didn't see any difference in the throughput. >>>> >>>> Dont know if the different bytes/packets ratio is really the reason or >>>> just a side effect. As a hint: the underlying network device does not >>>> support >>>> segmentation, but this should not matter for traffic between to guests. >>> >>> Could you post 'ethtool -k' output for both lower-level device and the >>> macvtap device? >>> >>> Thanks >>> -vlad >>> >> >> Ok. I think I see what's happening. Since you turn off offloads on >> lower device, that's propagated to macvlan device. As a result, when >> when we call dev_queue_xmit on the vlan->dev, we end up segmenting since >> lower level says it does support segmentation. >> >> One way to fix this is to never disable offloads on macvlan. macvlan >> will always try to use __dev_queue_xmit() with it's lower device, so any >> segmentation can happen there. > > If you have anything that I should test, let me know. Hi Christian Just sent out a patch to fix this. I tried it with namespaces and kvm guests and it seems to restore performance for me. Please give it a try. Thanks -vlad > > Christian > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 03/03/2014 04:13 AM, Christian Borntraeger wrote: On 02/03/14 02:21, Vlad Yasevich wrote: On 03/01/2014 02:27 PM, Vlad Yasevich wrote: On 03/01/2014 06:15 AM, Christian Borntraeger wrote: On 28/02/14 23:14, Vlad Yasevich wrote: On 02/27/2014 03:52 PM, Christian Borntraeger wrote: Vlad, commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 macvtap: Add support of packet capture on macvtap device. causes a performance regression for iperf traffic between two KVM guests on my s390 system. Both guests are connected via two macvtaps on the same OSA network card. Before that patch I get ~20 Gbit/sec between two guests, afterwards I get ~4Gbit/sec Latency seems to be unchanges (uperf 1byte ping pong). According to ifconfig in the guest, I have ~ 1500 bytes per packet with this patch and ~ 4 bytes without. So for some reason this patch causes the network stack to do segmentation. (the guest kernel stays the same, only host kernel is changed). Any ideas? I am looking. It shouldn't cause addition segmentations and when I ran netperf on the code I didn't see any difference in the throughput. Dont know if the different bytes/packets ratio is really the reason or just a side effect. As a hint: the underlying network device does not support segmentation, but this should not matter for traffic between to guests. Could you post 'ethtool -k' output for both lower-level device and the macvtap device? Thanks -vlad Ok. I think I see what's happening. Since you turn off offloads on lower device, that's propagated to macvlan device. As a result, when when we call dev_queue_xmit on the vlan-dev, we end up segmenting since lower level says it does support segmentation. One way to fix this is to never disable offloads on macvlan. macvlan will always try to use __dev_queue_xmit() with it's lower device, so any segmentation can happen there. If you have anything that I should test, let me know. Hi Christian Just sent out a patch to fix this. I tried it with namespaces and kvm guests and it seems to restore performance for me. Please give it a try. Thanks -vlad Christian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 03/01/2014 02:27 PM, Vlad Yasevich wrote: > On 03/01/2014 06:15 AM, Christian Borntraeger wrote: >> On 28/02/14 23:14, Vlad Yasevich wrote: >>> On 02/27/2014 03:52 PM, Christian Borntraeger wrote: >>>> Vlad, >>>> >>>> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 >>>> macvtap: Add support of packet capture on macvtap device. >>>> >>>> causes a performance regression for iperf traffic between two KVM guests >>>> on my s390 system. Both guests are connected via two macvtaps on the same >>>> OSA >>>> network card. >>>> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get >>>> ~4Gbit/sec >>>> >>>> Latency seems to be unchanges (uperf 1byte ping pong). >>>> >>>> According to ifconfig in the guest, I have ~ 1500 bytes per packet with >>>> this >>>> patch and ~ 4 bytes without. So for some reason this patch causes the >>>> network stack to do segmentation. (the guest kernel stays the same, only >>>> host >>>> kernel is changed). >>>> >>>> Any ideas? >>> >>> I am looking. It shouldn't cause addition segmentations and when I ran >>> netperf on the code I didn't see any difference in the throughput. >> >> Dont know if the different bytes/packets ratio is really the reason or >> just a side effect. As a hint: the underlying network device does not support >> segmentation, but this should not matter for traffic between to guests. > > Could you post 'ethtool -k' output for both lower-level device and the > macvtap device? > > Thanks > -vlad > Ok. I think I see what's happening. Since you turn off offloads on lower device, that's propagated to macvlan device. As a result, when when we call dev_queue_xmit on the vlan->dev, we end up segmenting since lower level says it does support segmentation. One way to fix this is to never disable offloads on macvlan. macvlan will always try to use __dev_queue_xmit() with it's lower device, so any segmentation can happen there. -vlad >> >> Maybe you remember, we had a similar situation with commit >> 3e4f8b787370978733ca6cae452720a4f0c296b8 >> (macvtap: Perform GSO on forwarding path), the setup is basically the same. >> >> >> Christian >> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 03/01/2014 06:15 AM, Christian Borntraeger wrote: > On 28/02/14 23:14, Vlad Yasevich wrote: >> On 02/27/2014 03:52 PM, Christian Borntraeger wrote: >>> Vlad, >>> >>> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 >>> macvtap: Add support of packet capture on macvtap device. >>> >>> causes a performance regression for iperf traffic between two KVM guests >>> on my s390 system. Both guests are connected via two macvtaps on the same >>> OSA >>> network card. >>> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get >>> ~4Gbit/sec >>> >>> Latency seems to be unchanges (uperf 1byte ping pong). >>> >>> According to ifconfig in the guest, I have ~ 1500 bytes per packet with this >>> patch and ~ 4 bytes without. So for some reason this patch causes the >>> network stack to do segmentation. (the guest kernel stays the same, only >>> host >>> kernel is changed). >>> >>> Any ideas? >> >> I am looking. It shouldn't cause addition segmentations and when I ran >> netperf on the code I didn't see any difference in the throughput. > > Dont know if the different bytes/packets ratio is really the reason or > just a side effect. As a hint: the underlying network device does not support > segmentation, but this should not matter for traffic between to guests. Could you post 'ethtool -k' output for both lower-level device and the macvtap device? Thanks -vlad > > Maybe you remember, we had a similar situation with commit > 3e4f8b787370978733ca6cae452720a4f0c296b8 > (macvtap: Perform GSO on forwarding path), the setup is basically the same. > > > Christian > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 03/01/2014 06:15 AM, Christian Borntraeger wrote: On 28/02/14 23:14, Vlad Yasevich wrote: On 02/27/2014 03:52 PM, Christian Borntraeger wrote: Vlad, commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 macvtap: Add support of packet capture on macvtap device. causes a performance regression for iperf traffic between two KVM guests on my s390 system. Both guests are connected via two macvtaps on the same OSA network card. Before that patch I get ~20 Gbit/sec between two guests, afterwards I get ~4Gbit/sec Latency seems to be unchanges (uperf 1byte ping pong). According to ifconfig in the guest, I have ~ 1500 bytes per packet with this patch and ~ 4 bytes without. So for some reason this patch causes the network stack to do segmentation. (the guest kernel stays the same, only host kernel is changed). Any ideas? I am looking. It shouldn't cause addition segmentations and when I ran netperf on the code I didn't see any difference in the throughput. Dont know if the different bytes/packets ratio is really the reason or just a side effect. As a hint: the underlying network device does not support segmentation, but this should not matter for traffic between to guests. Could you post 'ethtool -k' output for both lower-level device and the macvtap device? Thanks -vlad Maybe you remember, we had a similar situation with commit 3e4f8b787370978733ca6cae452720a4f0c296b8 (macvtap: Perform GSO on forwarding path), the setup is basically the same. Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 03/01/2014 02:27 PM, Vlad Yasevich wrote: On 03/01/2014 06:15 AM, Christian Borntraeger wrote: On 28/02/14 23:14, Vlad Yasevich wrote: On 02/27/2014 03:52 PM, Christian Borntraeger wrote: Vlad, commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 macvtap: Add support of packet capture on macvtap device. causes a performance regression for iperf traffic between two KVM guests on my s390 system. Both guests are connected via two macvtaps on the same OSA network card. Before that patch I get ~20 Gbit/sec between two guests, afterwards I get ~4Gbit/sec Latency seems to be unchanges (uperf 1byte ping pong). According to ifconfig in the guest, I have ~ 1500 bytes per packet with this patch and ~ 4 bytes without. So for some reason this patch causes the network stack to do segmentation. (the guest kernel stays the same, only host kernel is changed). Any ideas? I am looking. It shouldn't cause addition segmentations and when I ran netperf on the code I didn't see any difference in the throughput. Dont know if the different bytes/packets ratio is really the reason or just a side effect. As a hint: the underlying network device does not support segmentation, but this should not matter for traffic between to guests. Could you post 'ethtool -k' output for both lower-level device and the macvtap device? Thanks -vlad Ok. I think I see what's happening. Since you turn off offloads on lower device, that's propagated to macvlan device. As a result, when when we call dev_queue_xmit on the vlan-dev, we end up segmenting since lower level says it does support segmentation. One way to fix this is to never disable offloads on macvlan. macvlan will always try to use __dev_queue_xmit() with it's lower device, so any segmentation can happen there. -vlad Maybe you remember, we had a similar situation with commit 3e4f8b787370978733ca6cae452720a4f0c296b8 (macvtap: Perform GSO on forwarding path), the setup is basically the same. Christian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 02/27/2014 03:52 PM, Christian Borntraeger wrote: > Vlad, > > commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 > macvtap: Add support of packet capture on macvtap device. > > causes a performance regression for iperf traffic between two KVM guests > on my s390 system. Both guests are connected via two macvtaps on the same OSA > network card. > Before that patch I get ~20 Gbit/sec between two guests, afterwards I get > ~4Gbit/sec > > Latency seems to be unchanges (uperf 1byte ping pong). > > According to ifconfig in the guest, I have ~ 1500 bytes per packet with this > patch and ~ 4 bytes without. So for some reason this patch causes the > network stack to do segmentation. (the guest kernel stays the same, only host > kernel is changed). > > Any ideas? I am looking. It shouldn't cause addition segmentations and when I ran netperf on the code I didn't see any difference in the throughput. -vlad > > > Christian > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1
On 02/27/2014 03:52 PM, Christian Borntraeger wrote: Vlad, commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 macvtap: Add support of packet capture on macvtap device. causes a performance regression for iperf traffic between two KVM guests on my s390 system. Both guests are connected via two macvtaps on the same OSA network card. Before that patch I get ~20 Gbit/sec between two guests, afterwards I get ~4Gbit/sec Latency seems to be unchanges (uperf 1byte ping pong). According to ifconfig in the guest, I have ~ 1500 bytes per packet with this patch and ~ 4 bytes without. So for some reason this patch causes the network stack to do segmentation. (the guest kernel stays the same, only host kernel is changed). Any ideas? I am looking. It shouldn't cause addition segmentations and when I ran netperf on the code I didn't see any difference in the throughput. -vlad Christian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval
On 02/18/2014 10:39 AM, David Laight wrote: >>> + >>> + /* Update the heartbeat timer immediately. */ >>> + if (!mod_timer(>hb_timer, >>> + sctp_transport_timeout(trans))) >>> + sctp_transport_hold(trans); >> >> This is causes a rather inconsistent behavior in that it doesn't >> account of the amount of time left on the hb timer. Thus it doesn't >> always do what commit log implies. If the remaining time is shorter >> then the new timeout value, it will take longer till the next HB >> the application expects. > > Being able to stop heartbeats being sent by repeatedly configuring > the interval is certainly not a good idea! > >> If the application has very strict timing requirement on the HBs, it >> should trigger the HB manually. >> >> We could rework the code to allow the combination of >> SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows: >> 1) update the timer to the new value >> 2) Send an immediate HB >> a) Update the hb timeout. >> >> 2a should probably be done regardless since it can cause 2 HB to be >> outstanding at the same time. > > Sending a heartbeat 'on demand' might be problematic if one > has also just been sent (from the timer). Yes, you are right. This will trigger an rto doubling since we'll be in a state where we have an outstanding HB without an ack. > > I'd have thought that you wouldn't want to send a heartbeat while > still expecting a response from the previous one (this might require > splitting the time interval in two). This one is interesting. Technically, a user can request an immediate HB any time, and if the user requests it, we should honor the request. Also, the user may not know that there is an outstanding HB. I'll need to think about this a bit. Thanks for bringing this up. -vlad > > David > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval
On 02/18/2014 12:56 AM, Xufeng Zhang wrote: > For an established association, if user has updated the HB.interval > parameter by setsockopt(), this new heartbeat interval will not > take effect until: > - the expiry of the heartbeat timer and new hearbeat is sent. > - DATA chunk has been sent and the transport resets the timer. > This could not meet the requirement of the user who need to > get HEARTBEAT sent at the specified time. > > Thus, we need to update the heartbeat timer immediately after > user has changed HB.interval. > > Signed-off-by: Xufeng Zhang > --- > net/sctp/socket.c |5 + > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c > index 9e91d6e..699ae1e 100644 > --- a/net/sctp/socket.c > +++ b/net/sctp/socket.c > @@ -2344,6 +2344,11 @@ static int sctp_apply_peer_addr_params(struct > sctp_paddrparams *params, > if (trans) { > trans->hbinterval = > msecs_to_jiffies(params->spp_hbinterval); > + > + /* Update the heartbeat timer immediately. */ > + if (!mod_timer(>hb_timer, > + sctp_transport_timeout(trans))) > + sctp_transport_hold(trans); This is causes a rather inconsistent behavior in that it doesn't account of the amount of time left on the hb timer. Thus it doesn't always do what commit log implies. If the remaining time is shorter then the new timeout value, it will take longer till the next HB the application expects. If the application has very strict timing requirement on the HBs, it should trigger the HB manually. We could rework the code to allow the combination of SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows: 1) update the timer to the new value 2) Send an immediate HB a) Update the hb timeout. 2a should probably be done regardless since it can cause 2 HB to be outstanding at the same time. -vlad > } else if (asoc) { > asoc->hbinterval = > msecs_to_jiffies(params->spp_hbinterval); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval
On 02/18/2014 12:56 AM, Xufeng Zhang wrote: For an established association, if user has updated the HB.interval parameter by setsockopt(), this new heartbeat interval will not take effect until: - the expiry of the heartbeat timer and new hearbeat is sent. - DATA chunk has been sent and the transport resets the timer. This could not meet the requirement of the user who need to get HEARTBEAT sent at the specified time. Thus, we need to update the heartbeat timer immediately after user has changed HB.interval. Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com --- net/sctp/socket.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index 9e91d6e..699ae1e 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -2344,6 +2344,11 @@ static int sctp_apply_peer_addr_params(struct sctp_paddrparams *params, if (trans) { trans-hbinterval = msecs_to_jiffies(params-spp_hbinterval); + + /* Update the heartbeat timer immediately. */ + if (!mod_timer(trans-hb_timer, + sctp_transport_timeout(trans))) + sctp_transport_hold(trans); This is causes a rather inconsistent behavior in that it doesn't account of the amount of time left on the hb timer. Thus it doesn't always do what commit log implies. If the remaining time is shorter then the new timeout value, it will take longer till the next HB the application expects. If the application has very strict timing requirement on the HBs, it should trigger the HB manually. We could rework the code to allow the combination of SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows: 1) update the timer to the new value 2) Send an immediate HB a) Update the hb timeout. 2a should probably be done regardless since it can cause 2 HB to be outstanding at the same time. -vlad } else if (asoc) { asoc-hbinterval = msecs_to_jiffies(params-spp_hbinterval); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval
On 02/18/2014 10:39 AM, David Laight wrote: + + /* Update the heartbeat timer immediately. */ + if (!mod_timer(trans-hb_timer, + sctp_transport_timeout(trans))) + sctp_transport_hold(trans); This is causes a rather inconsistent behavior in that it doesn't account of the amount of time left on the hb timer. Thus it doesn't always do what commit log implies. If the remaining time is shorter then the new timeout value, it will take longer till the next HB the application expects. Being able to stop heartbeats being sent by repeatedly configuring the interval is certainly not a good idea! If the application has very strict timing requirement on the HBs, it should trigger the HB manually. We could rework the code to allow the combination of SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows: 1) update the timer to the new value 2) Send an immediate HB a) Update the hb timeout. 2a should probably be done regardless since it can cause 2 HB to be outstanding at the same time. Sending a heartbeat 'on demand' might be problematic if one has also just been sent (from the timer). Yes, you are right. This will trigger an rto doubling since we'll be in a state where we have an outstanding HB without an ack. I'd have thought that you wouldn't want to send a heartbeat while still expecting a response from the previous one (this might require splitting the time interval in two). This one is interesting. Technically, a user can request an immediate HB any time, and if the user requests it, we should honor the request. Also, the user may not know that there is an outstanding HB. I'll need to think about this a bit. Thanks for bringing this up. -vlad David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap
On 01/10/2014 02:06 AM, Jason Wang wrote: > On 01/10/2014 05:39 AM, Stephen Hemminger wrote: >> On Thu, 09 Jan 2014 16:55:07 +0800 >> Jason Wang wrote: >> >>> What if use do want a qdisc and want to change the its queue length for >>> tun/macvlan? And the the name tx_queue_length is misleading. For tun it >>> may make sense since it was used in transmission path. For macvtap it >>> was not. So maybe what we need is just a new ioctl for both tun/macvtap >>> and a new feature flag. If user create the device with new feature flag, >>> the socket receive queue length could be changed by ioctl instead of >>> dev->tx_queue_length. If not, the old behaviour could be kept. >> The overloading of tx_queue_len in macvtap was the original design mistake. >> Can't this just be undone and expose rx_queue_len as sysfs attribute? > > That works. But we current allow user to change the socket sndbuf > through TUNSNDBUF. Maybe we need a similar one for receive. > That would make sense. Since the user interacts with tun fd almost as a socket and there is actually a socket hiding in the kernel, it almost begs for actual SO_SNDBUF/SO_RCVBUF support :) -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap
On 01/10/2014 02:06 AM, Jason Wang wrote: On 01/10/2014 05:39 AM, Stephen Hemminger wrote: On Thu, 09 Jan 2014 16:55:07 +0800 Jason Wang jasow...@redhat.com wrote: What if use do want a qdisc and want to change the its queue length for tun/macvlan? And the the name tx_queue_length is misleading. For tun it may make sense since it was used in transmission path. For macvtap it was not. So maybe what we need is just a new ioctl for both tun/macvtap and a new feature flag. If user create the device with new feature flag, the socket receive queue length could be changed by ioctl instead of dev-tx_queue_length. If not, the old behaviour could be kept. The overloading of tx_queue_len in macvtap was the original design mistake. Can't this just be undone and expose rx_queue_len as sysfs attribute? That works. But we current allow user to change the socket sndbuf through TUNSNDBUF. Maybe we need a similar one for receive. That would make sense. Since the user interacts with tun fd almost as a socket and there is actually a socket hiding in the kernel, it almost begs for actual SO_SNDBUF/SO_RCVBUF support :) -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch net-next v3] ipv6: log autoconfiguration failures
On 12/13/2013 10:45 AM, Denys Vlasenko wrote: > If ipv6 auto-configuration does not work, currently it's hard > to track what's going on. This change adds log messages > (at debug level) on every code path where ipv6 autoconf fails. > > v3: changed pr_debug's to pr_warn's. > > Signed-off-by: Denys Vlasenko > --- > net/ipv6/addrconf.c | 45 - > 1 file changed, 36 insertions(+), 9 deletions(-) > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index 3c3425e..0b354f0 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -1694,8 +1694,11 @@ static void addrconf_leave_anycast(struct inet6_ifaddr > *ifp) > > static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev) > { > - if (dev->addr_len != ETH_ALEN) > + if (dev->addr_len != ETH_ALEN) { > + pr_warn("IPv6 addrconf: %s: address length %d != %s\n", > + dev->name, dev->addr_len, "ETH_ALEN"); > return -1; > + } > memcpy(eui, dev->dev_addr, 3); > memcpy(eui + 5, dev->dev_addr + 3, 3); > > @@ -1725,8 +1728,11 @@ static int addrconf_ifid_eui48(u8 *eui, struct > net_device *dev) > > static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev) > { > - if (dev->addr_len != IEEE802154_ADDR_LEN) > + if (dev->addr_len != IEEE802154_ADDR_LEN) { > + pr_warn("IPv6 addrconf: %s: address length %d != %s\n", > + dev->name, dev->addr_len, "IEEE802154_ADDR_LEN"); > return -1; > + } > memcpy(eui, dev->dev_addr, 8); > eui[0] ^= 2; > return 0; > @@ -1736,8 +1742,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct > net_device *dev) > { > union fwnet_hwaddr *ha; > > - if (dev->addr_len != FWNET_ALEN) > + if (dev->addr_len != FWNET_ALEN) { > + pr_warn("IPv6 addrconf: %s: address length %d != %s\n", > + dev->name, dev->addr_len, "FWNET_ALEN"); > return -1; > + } > > ha = (union fwnet_hwaddr *)dev->dev_addr; > > @@ -1749,8 +1758,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct > net_device *dev) > static int addrconf_ifid_arcnet(u8 *eui, struct net_device *dev) > { > /* XXX: inherit EUI-64 from other interface -- yoshfuji */ > - if (dev->addr_len != ARCNET_ALEN) > + if (dev->addr_len != ARCNET_ALEN) { > + pr_warn("IPv6 addrconf: %s: address length %d != %s\n", > + dev->name, dev->addr_len, "ARCNET_ALEN"); > return -1; > + } > memset(eui, 0, 7); > eui[7] = *(u8 *)dev->dev_addr; > return 0; > @@ -1758,17 +1770,25 @@ static int addrconf_ifid_arcnet(u8 *eui, struct > net_device *dev) > > static int addrconf_ifid_infiniband(u8 *eui, struct net_device *dev) > { > - if (dev->addr_len != INFINIBAND_ALEN) > + if (dev->addr_len != INFINIBAND_ALEN) { > + pr_warn("IPv6 addrconf: %s: address length %d != %s\n", > + dev->name, dev->addr_len, "INFINIBAND_ALEN"); > return -1; > + } > memcpy(eui, dev->dev_addr + 12, 8); > eui[0] |= 2; > return 0; > } > > -static int __ipv6_isatap_ifid(u8 *eui, __be32 addr) > +static int __ipv6_isatap_ifid(u8 *eui, struct net_device *dev) > { > - if (addr == 0) > + __be32 addr = *(__be32 *)dev->dev_addr; > + > + if (addr == 0) { > + pr_warn("IPv6 addrconf: %s: bad dev_addr %pM\n", > + dev->name, dev->dev_addr); > return -1; > + } > eui[0] = (ipv4_is_zeronet(addr) || ipv4_is_private_10(addr) || > ipv4_is_loopback(addr) || ipv4_is_linklocal_169(addr) || > ipv4_is_private_172(addr) || ipv4_is_test_192(addr) || > @@ -1785,13 +1805,14 @@ static int __ipv6_isatap_ifid(u8 *eui, __be32 addr) > static int addrconf_ifid_sit(u8 *eui, struct net_device *dev) > { > if (dev->priv_flags & IFF_ISATAP) > - return __ipv6_isatap_ifid(eui, *(__be32 *)dev->dev_addr); > + return __ipv6_isatap_ifid(eui, dev); > + pr_warn("IPv6 addrconf: %s: IFF_ISATAP is unset\n", dev->name); > return -1; > } > > static int addrconf_ifid_gre(u8 *eui, struct net_device *dev) > { > - return __ipv6_isatap_ifid(eui, *(__be32 *)dev->dev_addr); > + return __ipv6_isatap_ifid(eui, dev); > } > > static int addrconf_ifid_ip6tnl(u8 *eui, struct net_device *dev) > @@ -1825,6 +1846,8 @@ static int ipv6_generate_eui64(u8 *eui, struct > net_device *dev) > case ARPHRD_TUNNEL6: > return addrconf_ifid_ip6tnl(eui, dev); > } > + pr_warn("IPv6 addrconf: %s: dev->type %d is not supported\n", > + dev->name, dev->type); > return -1; > } This one should probably be a pr_debug or counter. This is not a critical issue and it makes no sense to spam the log if IPv6 is not supported on a particular interface. > > @@ -1842,6 +1865,10 @@
Re: [patch net-next v3] ipv6: log autoconfiguration failures
On 12/13/2013 10:45 AM, Denys Vlasenko wrote: If ipv6 auto-configuration does not work, currently it's hard to track what's going on. This change adds log messages (at debug level) on every code path where ipv6 autoconf fails. v3: changed pr_debug's to pr_warn's. Signed-off-by: Denys Vlasenko dvlas...@redhat.com --- net/ipv6/addrconf.c | 45 - 1 file changed, 36 insertions(+), 9 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 3c3425e..0b354f0 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1694,8 +1694,11 @@ static void addrconf_leave_anycast(struct inet6_ifaddr *ifp) static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev) { - if (dev-addr_len != ETH_ALEN) + if (dev-addr_len != ETH_ALEN) { + pr_warn(IPv6 addrconf: %s: address length %d != %s\n, + dev-name, dev-addr_len, ETH_ALEN); return -1; + } memcpy(eui, dev-dev_addr, 3); memcpy(eui + 5, dev-dev_addr + 3, 3); @@ -1725,8 +1728,11 @@ static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev) static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev) { - if (dev-addr_len != IEEE802154_ADDR_LEN) + if (dev-addr_len != IEEE802154_ADDR_LEN) { + pr_warn(IPv6 addrconf: %s: address length %d != %s\n, + dev-name, dev-addr_len, IEEE802154_ADDR_LEN); return -1; + } memcpy(eui, dev-dev_addr, 8); eui[0] ^= 2; return 0; @@ -1736,8 +1742,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct net_device *dev) { union fwnet_hwaddr *ha; - if (dev-addr_len != FWNET_ALEN) + if (dev-addr_len != FWNET_ALEN) { + pr_warn(IPv6 addrconf: %s: address length %d != %s\n, + dev-name, dev-addr_len, FWNET_ALEN); return -1; + } ha = (union fwnet_hwaddr *)dev-dev_addr; @@ -1749,8 +1758,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct net_device *dev) static int addrconf_ifid_arcnet(u8 *eui, struct net_device *dev) { /* XXX: inherit EUI-64 from other interface -- yoshfuji */ - if (dev-addr_len != ARCNET_ALEN) + if (dev-addr_len != ARCNET_ALEN) { + pr_warn(IPv6 addrconf: %s: address length %d != %s\n, + dev-name, dev-addr_len, ARCNET_ALEN); return -1; + } memset(eui, 0, 7); eui[7] = *(u8 *)dev-dev_addr; return 0; @@ -1758,17 +1770,25 @@ static int addrconf_ifid_arcnet(u8 *eui, struct net_device *dev) static int addrconf_ifid_infiniband(u8 *eui, struct net_device *dev) { - if (dev-addr_len != INFINIBAND_ALEN) + if (dev-addr_len != INFINIBAND_ALEN) { + pr_warn(IPv6 addrconf: %s: address length %d != %s\n, + dev-name, dev-addr_len, INFINIBAND_ALEN); return -1; + } memcpy(eui, dev-dev_addr + 12, 8); eui[0] |= 2; return 0; } -static int __ipv6_isatap_ifid(u8 *eui, __be32 addr) +static int __ipv6_isatap_ifid(u8 *eui, struct net_device *dev) { - if (addr == 0) + __be32 addr = *(__be32 *)dev-dev_addr; + + if (addr == 0) { + pr_warn(IPv6 addrconf: %s: bad dev_addr %pM\n, + dev-name, dev-dev_addr); return -1; + } eui[0] = (ipv4_is_zeronet(addr) || ipv4_is_private_10(addr) || ipv4_is_loopback(addr) || ipv4_is_linklocal_169(addr) || ipv4_is_private_172(addr) || ipv4_is_test_192(addr) || @@ -1785,13 +1805,14 @@ static int __ipv6_isatap_ifid(u8 *eui, __be32 addr) static int addrconf_ifid_sit(u8 *eui, struct net_device *dev) { if (dev-priv_flags IFF_ISATAP) - return __ipv6_isatap_ifid(eui, *(__be32 *)dev-dev_addr); + return __ipv6_isatap_ifid(eui, dev); + pr_warn(IPv6 addrconf: %s: IFF_ISATAP is unset\n, dev-name); return -1; } static int addrconf_ifid_gre(u8 *eui, struct net_device *dev) { - return __ipv6_isatap_ifid(eui, *(__be32 *)dev-dev_addr); + return __ipv6_isatap_ifid(eui, dev); } static int addrconf_ifid_ip6tnl(u8 *eui, struct net_device *dev) @@ -1825,6 +1846,8 @@ static int ipv6_generate_eui64(u8 *eui, struct net_device *dev) case ARPHRD_TUNNEL6: return addrconf_ifid_ip6tnl(eui, dev); } + pr_warn(IPv6 addrconf: %s: dev-type %d is not supported\n, + dev-name, dev-type); return -1; } This one should probably be a pr_debug or counter. This is not a critical issue and it makes no sense to spam the log if IPv6 is not supported on a particular interface. @@ -1842,6 +1865,10 @@ static int ipv6_inherit_eui64(u8 *eui, struct inet6_dev *idev) } } read_unlock_bh(idev-lock); + if (err) + pr_warn(IPv6
Re: [patch net-next v2] ipv6: log autoconfiguration failures
On 12/12/2013 12:22 PM, David Miller wrote: > From: Denys Vlasenko > Date: Thu, 12 Dec 2013 12:17:42 +0100 > >> I can easily imagine their frustration. Kernel _knows_ why >> it didn't work, and it's not expected to normally pappen, >> why didn't it tell anything about it? > > Packets are dropped silently, ARP fails and entries go stale silently, > none of this is logged with kernel messages, why is ipv6 autoconf so > unique and important to justify different behavior? But most of the changes from V2 report an actual error condition in the kernel: addr_len is not set correctly for the device type. These changes are worth keeping and not just as pr_debug. Having a counter of these failures will not provide much useful data if you happen to have multiple such device and where one happens to work and the other doesn't (very unlikely, I know). > > Give it statistics just like we have for every other kind of similar > event. The fact that link-local address could not be generated for a specific device type does not lend itself well to counting. There is no way from to tell from the counter which device does not support IPv6 autoconfig. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch net-next v2] ipv6: log autoconfiguration failures
On 12/12/2013 12:22 PM, David Miller wrote: From: Denys Vlasenko dvlas...@redhat.com Date: Thu, 12 Dec 2013 12:17:42 +0100 I can easily imagine their frustration. Kernel _knows_ why it didn't work, and it's not expected to normally pappen, why didn't it tell anything about it? Packets are dropped silently, ARP fails and entries go stale silently, none of this is logged with kernel messages, why is ipv6 autoconf so unique and important to justify different behavior? But most of the changes from V2 report an actual error condition in the kernel: addr_len is not set correctly for the device type. These changes are worth keeping and not just as pr_debug. Having a counter of these failures will not provide much useful data if you happen to have multiple such device and where one happens to work and the other doesn't (very unlikely, I know). Give it statistics just like we have for every other kind of similar event. The fact that link-local address could not be generated for a specific device type does not lend itself well to counting. There is no way from to tell from the counter which device does not support IPv6 autoconfig. -vlad -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net V3 2/2] macvtap: signal truncated packets
On 12/11/2013 12:08 AM, Jason Wang wrote: > macvtap_put_user() never return a value grater than iov length, this in fact > bypasses the truncated checking in macvtap_recvmsg(). Fix this by always > returning the size of packet plus the possible vlan header to let the trunca > checking work. > > Cc: Vlad Yasevich > Cc: Zhi Yong Wu > Cc: Michael S. Tsirkin > Signed-off-by: Jason Wang Acked-by: Vlad Yasevich -vlad > --- > Changes from V2: > - use total as the bytes of packet returned > Changes from V1: > - increase total unconditionally > - do not move the structure veth out of the vlan handling block > --- > drivers/net/macvtap.c | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c > index 957cc5c..2a89da0 100644 > --- a/drivers/net/macvtap.c > +++ b/drivers/net/macvtap.c > @@ -770,7 +770,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, > int ret; > int vnet_hdr_len = 0; > int vlan_offset = 0; > - int copied; > + int copied, total; > > if (q->flags & IFF_VNET_HDR) { > struct virtio_net_hdr vnet_hdr; > @@ -785,7 +785,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, > if (memcpy_toiovecend(iv, (void *)_hdr, 0, > sizeof(vnet_hdr))) > return -EFAULT; > } > - copied = vnet_hdr_len; > + total = copied = vnet_hdr_len; > + total += skb->len; > > if (!vlan_tx_tag_present(skb)) > len = min_t(int, skb->len, len); > @@ -800,6 +801,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, > > vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto); > len = min_t(int, skb->len + VLAN_HLEN, len); > + total += VLAN_HLEN; > > copy = min_t(int, vlan_offset, len); > ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy); > @@ -817,10 +819,9 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, > } > > ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len); > - copied += len; > > done: > - return ret ? ret : copied; > + return ret ? ret : total; > } > > static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb, > @@ -875,7 +876,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const > struct iovec *iv, > } > > ret = macvtap_do_read(q, iocb, iv, len, file->f_flags & O_NONBLOCK); > - ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */ > + ret = min_t(ssize_t, ret, len); > if (ret > 0) > iocb->ki_pos = ret; > out: > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net V3 2/2] macvtap: signal truncated packets
On 12/11/2013 12:08 AM, Jason Wang wrote: macvtap_put_user() never return a value grater than iov length, this in fact bypasses the truncated checking in macvtap_recvmsg(). Fix this by always returning the size of packet plus the possible vlan header to let the trunca checking work. Cc: Vlad Yasevich vyasev...@gmail.com Cc: Zhi Yong Wu wu...@linux.vnet.ibm.com Cc: Michael S. Tsirkin m...@redhat.com Signed-off-by: Jason Wang jasow...@redhat.com Acked-by: Vlad Yasevich vyasev...@gmail.com -vlad --- Changes from V2: - use total as the bytes of packet returned Changes from V1: - increase total unconditionally - do not move the structure veth out of the vlan handling block --- drivers/net/macvtap.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 957cc5c..2a89da0 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -770,7 +770,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, int ret; int vnet_hdr_len = 0; int vlan_offset = 0; - int copied; + int copied, total; if (q-flags IFF_VNET_HDR) { struct virtio_net_hdr vnet_hdr; @@ -785,7 +785,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, if (memcpy_toiovecend(iv, (void *)vnet_hdr, 0, sizeof(vnet_hdr))) return -EFAULT; } - copied = vnet_hdr_len; + total = copied = vnet_hdr_len; + total += skb-len; if (!vlan_tx_tag_present(skb)) len = min_t(int, skb-len, len); @@ -800,6 +801,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto); len = min_t(int, skb-len + VLAN_HLEN, len); + total += VLAN_HLEN; copy = min_t(int, vlan_offset, len); ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy); @@ -817,10 +819,9 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, } ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len); - copied += len; done: - return ret ? ret : copied; + return ret ? ret : total; } static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb, @@ -875,7 +876,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, } ret = macvtap_do_read(q, iocb, iv, len, file-f_flags O_NONBLOCK); - ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */ + ret = min_t(ssize_t, ret, len); if (ret 0) iocb-ki_pos = ret; out: -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()
On 12/09/2013 08:36 PM, David Miller wrote: > From: Zhi Yong Wu > Date: Sat, 7 Dec 2013 04:55:00 +0800 > >> From: Zhi Yong Wu >> >> By checking related codes, it is impossible that ret > len or total_len, >> so we should remove some useless codes in both above functions. >> >> Signed-off-by: Zhi Yong Wu > > Applied. Wait a sec. We want to be able to return a value bigger then len to trigger a MSG_TRUNC. Jason has patches for to fix this. If you apply this, we'll have to re-introduce this code back in. Same goes for patch 1/2. -vlad > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()
On 12/09/2013 08:36 PM, David Miller wrote: From: Zhi Yong Wu zwu.ker...@gmail.com Date: Sat, 7 Dec 2013 04:55:00 +0800 From: Zhi Yong Wu wu...@linux.vnet.ibm.com By checking related codes, it is impossible that ret len or total_len, so we should remove some useless codes in both above functions. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Applied. Wait a sec. We want to be able to return a value bigger then len to trigger a MSG_TRUNC. Jason has patches for to fix this. If you apply this, we'll have to re-introduce this code back in. Same goes for patch 1/2. -vlad -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net 1/2] tun: unbreak truncated packet signalling
On 12/09/2013 05:55 AM, Michael S. Tsirkin wrote: > On Mon, Dec 09, 2013 at 06:25:16PM +0800, Jason Wang wrote: >> Commit 6680ec68eff47d36f67b4351bc9836fd6cba9532 >> (tuntap: hardware vlan tx support) breaks the truncated packet signal by never >> return a length greater than iov length in tun_put_user(). This patch fixes this >> by always return the length of packet plus possible vlan header. Caller can >> detect the truncated packet by comparing the return value and the size of iov >> length. >> >> Reported-by: Vlad Yasevich >> Cc: Vlad Yasevich >> Cc: Zhi Yong Wu >> Signed-off-by: Jason Wang > > So writer gets back a value greater than what was written? > >> --- >> The patch is needed for stable. >> --- >> drivers/net/tun.c | 23 --- >> 1 file changed, 12 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >> index e26cbea..dd1bd7a 100644 >> --- a/drivers/net/tun.c >> +++ b/drivers/net/tun.c >> @@ -1183,7 +1183,11 @@ static ssize_t tun_put_user(struct tun_struct *tun, >> const struct iovec *iv, int len) >> { >> struct tun_pi pi = { 0, skb->protocol }; >> -ssize_t total = 0; >> +struct { >> +__be16 h_vlan_proto; >> +__be16 h_vlan_TCI; >> +} veth; >> +ssize_t total = 0, off = 0; > > Why off = 0 here? > We initialize it to total unconditionally, don't we? > >> int vlan_offset = 0; >> >> if (!(tun->flags & TUN_NO_PI)) { >> @@ -1248,14 +1252,11 @@ static ssize_t tun_put_user(struct tun_struct *tun, >> total += tun->vnet_hdr_sz; >> } >> >> +off = total; >> if (!vlan_tx_tag_present(skb)) { >> len = min_t(int, skb->len, len); >> } else { >> int copy, ret; >> -struct { >> -__be16 h_vlan_proto; >> -__be16 h_vlan_TCI; >> -} veth; >> >> veth.h_vlan_proto = skb->vlan_proto; >> veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb)); >> @@ -1264,22 +1265,22 @@ static ssize_t tun_put_user(struct tun_struct *tun, >> len = min_t(int, skb->len + VLAN_HLEN, len); >> >> copy = min_t(int, vlan_offset, len); >> -ret = skb_copy_datagram_const_iovec(skb, 0, iv, total, copy); >> +ret = skb_copy_datagram_const_iovec(skb, 0, iv, off, copy); >> len -= copy; >> -total += copy; >> +off += copy; >> if (ret || !len) >> goto done; >> >> copy = min_t(int, sizeof(veth), len); >> -ret = memcpy_toiovecend(iv, (void *), total, copy); >> +ret = memcpy_toiovecend(iv, (void *), off, copy); >> len -= copy; >> -total += copy; >> +off += copy; >> if (ret || !len) >> goto done; > > This seems wrong: if one of the branches above is taken, total is > never incremented. > >> } >> >> -skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len); >> -total += len; >> +skb_copy_datagram_const_iovec(skb, vlan_offset, iv, off, len); >> +total += skb->len + (vlan_offset ? sizeof(veth) : 0); >> >> done: >> tun->dev->stats.tx_packets++; > > I also think it's inelegant that the veth struct is now in the > outside scope, and the extra ? is also ugly. > > Here's a smaller patch to fix all these problems - what do you think? > > > > Signed-off-by: Michael S. Tsirkin > > --- > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 782e38b..3297e41 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -1183,7 +1183,7 @@ static ssize_t tun_put_user(struct tun_struct *tun, > const struct iovec *iv, int len) > { > struct tun_pi pi = { 0, skb->protocol }; > - ssize_t total = 0; > + ssize_t total = 0, offset; > int vlan_offset = 0; > > if (!(tun->flags & TUN_NO_PI)) { > @@ -1248,6 +1248,8 @@ static ssize_t tun_put_user(struct tun_struct *tun, > total += tun->vnet_hdr_sz; > } > > + offset = total; > + total += skb->len; > if (!vlan_tx_tag_present(skb)) { > len = min_t(int, skb->len, len); > } else { > @@ -1257,6 +1259,8 @@ static ssize_t tun_put_user(struct tun_s
Re: [PATCH net 1/2] tun: unbreak truncated packet signalling
On 12/09/2013 05:55 AM, Michael S. Tsirkin wrote: On Mon, Dec 09, 2013 at 06:25:16PM +0800, Jason Wang wrote: Commit 6680ec68eff47d36f67b4351bc9836fd6cba9532 (tuntap: hardware vlan tx support) breaks the truncated packet signal by never return a length greater than iov length in tun_put_user(). This patch fixes this by always return the length of packet plus possible vlan header. Caller can detect the truncated packet by comparing the return value and the size of iov length. Reported-by: Vlad Yasevich vyasev...@gmail.com Cc: Vlad Yasevich vyasev...@gmail.com Cc: Zhi Yong Wu wu...@linux.vnet.ibm.com Signed-off-by: Jason Wang jasow...@redhat.com So writer gets back a value greater than what was written? --- The patch is needed for stable. --- drivers/net/tun.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index e26cbea..dd1bd7a 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1183,7 +1183,11 @@ static ssize_t tun_put_user(struct tun_struct *tun, const struct iovec *iv, int len) { struct tun_pi pi = { 0, skb-protocol }; -ssize_t total = 0; +struct { +__be16 h_vlan_proto; +__be16 h_vlan_TCI; +} veth; +ssize_t total = 0, off = 0; Why off = 0 here? We initialize it to total unconditionally, don't we? int vlan_offset = 0; if (!(tun-flags TUN_NO_PI)) { @@ -1248,14 +1252,11 @@ static ssize_t tun_put_user(struct tun_struct *tun, total += tun-vnet_hdr_sz; } +off = total; if (!vlan_tx_tag_present(skb)) { len = min_t(int, skb-len, len); } else { int copy, ret; -struct { -__be16 h_vlan_proto; -__be16 h_vlan_TCI; -} veth; veth.h_vlan_proto = skb-vlan_proto; veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb)); @@ -1264,22 +1265,22 @@ static ssize_t tun_put_user(struct tun_struct *tun, len = min_t(int, skb-len + VLAN_HLEN, len); copy = min_t(int, vlan_offset, len); -ret = skb_copy_datagram_const_iovec(skb, 0, iv, total, copy); +ret = skb_copy_datagram_const_iovec(skb, 0, iv, off, copy); len -= copy; -total += copy; +off += copy; if (ret || !len) goto done; copy = min_t(int, sizeof(veth), len); -ret = memcpy_toiovecend(iv, (void *)veth, total, copy); +ret = memcpy_toiovecend(iv, (void *)veth, off, copy); len -= copy; -total += copy; +off += copy; if (ret || !len) goto done; This seems wrong: if one of the branches above is taken, total is never incremented. } -skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len); -total += len; +skb_copy_datagram_const_iovec(skb, vlan_offset, iv, off, len); +total += skb-len + (vlan_offset ? sizeof(veth) : 0); done: tun-dev-stats.tx_packets++; I also think it's inelegant that the veth struct is now in the outside scope, and the extra ? is also ugly. Here's a smaller patch to fix all these problems - what do you think? Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 782e38b..3297e41 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1183,7 +1183,7 @@ static ssize_t tun_put_user(struct tun_struct *tun, const struct iovec *iv, int len) { struct tun_pi pi = { 0, skb-protocol }; - ssize_t total = 0; + ssize_t total = 0, offset; int vlan_offset = 0; if (!(tun-flags TUN_NO_PI)) { @@ -1248,6 +1248,8 @@ static ssize_t tun_put_user(struct tun_struct *tun, total += tun-vnet_hdr_sz; } + offset = total; + total += skb-len; if (!vlan_tx_tag_present(skb)) { len = min_t(int, skb-len, len); } else { @@ -1257,6 +1259,8 @@ static ssize_t tun_put_user(struct tun_struct *tun, __be16 h_vlan_TCI; } veth; + total += sizeof(veth); + veth.h_vlan_proto = skb-vlan_proto; veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb)); @@ -1279,7 +1283,6 @@ static ssize_t tun_put_user(struct tun_struct *tun, } skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len); - total += len; done: tun-dev-stats.tx_packets++; You have to use 'offset' instead of 'total' when doing skb_copy and adjust offset as you write the vlan header. I think something like this will fix it: diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 782e38b..d71c393 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1183,7 +1183,7 @@ static ssize_t
Re: [PATCH 1/2] macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()
On 12/06/2013 03:54 PM, Zhi Yong Wu wrote: > From: Zhi Yong Wu > > By checking related codes, it is impossible that ret > len or total_len, > so we should remove some useless coeds in both above functions. Looks like commit 6680ec68eff47d36f67b4351bc9836fd6cba9532 Author: Jason Wang Date: Thu Jul 25 13:00:33 2013 +0800 tuntap: hardware vlan tx support Introduced a change in tun_put_user() where we can never return a length longer then len or total_len. This has an effect that is now impossible to signal truncated status. It seems like a potential loss of functionality and it might make sense to restore it. -vlad > > Signed-off-by: Zhi Yong Wu > --- > drivers/net/macvtap.c |5 - > 1 files changed, 0 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c > index 4c6f84c..7f4ccdd 100644 > --- a/drivers/net/macvtap.c > +++ b/drivers/net/macvtap.c > @@ -871,7 +871,6 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const > struct iovec *iv, > } > > ret = macvtap_do_read(q, iv, len, file->f_flags & O_NONBLOCK); > - ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */ > out: > return ret; > } > @@ -1104,10 +1103,6 @@ static int macvtap_recvmsg(struct kiocb *iocb, struct > socket *sock, > return -EINVAL; > ret = macvtap_do_read(q, m->msg_iov, total_len, > flags & MSG_DONTWAIT); > - if (ret > total_len) { > - m->msg_flags |= MSG_TRUNC; > - ret = flags & MSG_TRUNC ? ret : total_len; > - } > return ret; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] tun: update file current position
On 12/06/2013 12:45 PM, David Miller wrote: > From: Zhi Yong Wu > Date: Fri, 6 Dec 2013 17:08:50 +0800 > >> From: Zhi Yong Wu >> >> Signed-off-by: Zhi Yong Wu > > Also applied and queued up for -stable, thanks. > > I noticed in these two cases that that min_t() adjustment of 'ret' > seems strange. I can't understand why it's needed. > > If, for example, tun_do_read() really did read more than 'len' > bytes: > > 1) That would write past the end of the buffer. > > 2) Writing a different value to the ->ki_pos would mean >that ->ki_pos is now inaccurate. > > Unless someone can explain why the min_t() is needed, we should remove > it. So, back when that code was added, it was actually possible for the tun_do_read to return a value larger then user specified length, but the copy would only be done the length bytes. This was used to signal MSG_TRUNC. Specifically we had the following code in tun_put_user() len = min_t(int, skb->len, len); skb_copy_datagram_const_iovec(skb, 0, iv, total, len); total += skb->len; This has since changed to: len = min_t(int, skb->len, len); ... skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len); total += len; So, no it seems impossible to return a value larger then user specified length, so MSG_TRUNC will never be set. It probably makes sense to signal message truncation, but that seems broken right now. -vlad > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/