from:"Vlad Yasevich"

Re: [PATCH net] net: account for current skb length when deciding about UFO

2017-06-19 Thread Vlad Yasevich

On 06/19/2017 07:03 AM, Michal Kubecek wrote:
> Our customer encountered stuck NFS writes for blocks starting at specific
> offsets w.r.t. page boundary caused by networking stack sending packets via
> UFO enabled device with wrong checksum. The problem can be reproduced by
> composing a long UDP datagram from multiple parts using MSG_MORE flag:
> 
>   sendto(sd, buff, 1000, MSG_MORE, ...);
>   sendto(sd, buff, 1000, MSG_MORE, ...);
>   sendto(sd, buff, 3000, 0, ...);
> 
> Assume this packet is to be routed via a device with MTU 1500 and
> NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
> this condition is tested (among others) to decide whether to call
> ip_ufo_append_data():
> 
>   ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
> 
> At the moment, we already have skb with 1028 bytes of data which is not
> marked for GSO so that the test is false (fragheaderlen is usually 20).
> Thus we append second 1000 bytes to this skb without invoking UFO. Third
> sendto(), however, has sufficient length to trigger the UFO path so that we
> end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
> uses udp_csum() to calculate the checksum but that assumes all fragments
> have correct checksum in skb->csum which is not true for UFO fragments.
> 
> When checking against MTU, we need to add skb->len to length of new segment
> if we already have a partially filled skb and fragheaderlen only if there
> isn't one.
> 
> In the IPv6 case, skb can only be null if this is the first segment so that
> we have to use headersize (length of the first IPv6 header) rather than
> fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
> 
> Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
> Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for
>   ip6 fragment between __ip6_append_data and ip6_finish_output")
> Signed-off-by: Michal Kubecek <mkube...@suse.cz>

LGTM.

Acked-by: Vlad Yasevich <vyase...@redhat.com>

-vlad

> ---
>  net/ipv4/ip_output.c  | 3 ++-
>  net/ipv6/ip6_output.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7a3fd25e8913..532b36e9ce2a 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -964,7 +964,8 @@ static int __ip_append_data(struct sock *sk,
>   csummode = CHECKSUM_PARTIAL;
>  
>   cork->length += length;
> - if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
> + if length + (skb ? skb->len : fragheaderlen)) > mtu) ||
> +  (skb && skb_is_gso(skb))) &&
>   (sk->sk_protocol == IPPROTO_UDP) &&
>   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
>   (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index bf8a58a1c32d..1699acb2fa2c 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1390,7 +1390,7 @@ static int __ip6_append_data(struct sock *sk,
>*/
>  
>   cork->length += length;
> - if length + fragheaderlen) > mtu) ||
> + if length + (skb ? skb->len : headersize)) > mtu) ||
>(skb && skb_is_gso(skb))) &&
>   (sk->sk_protocol == IPPROTO_UDP) &&
>   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
>

Re: [PATCH net] net: account for current skb length when deciding about UFO

2017-06-19 Thread Vlad Yasevich

On 06/19/2017 07:03 AM, Michal Kubecek wrote:
> Our customer encountered stuck NFS writes for blocks starting at specific
> offsets w.r.t. page boundary caused by networking stack sending packets via
> UFO enabled device with wrong checksum. The problem can be reproduced by
> composing a long UDP datagram from multiple parts using MSG_MORE flag:
> 
>   sendto(sd, buff, 1000, MSG_MORE, ...);
>   sendto(sd, buff, 1000, MSG_MORE, ...);
>   sendto(sd, buff, 3000, 0, ...);
> 
> Assume this packet is to be routed via a device with MTU 1500 and
> NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
> this condition is tested (among others) to decide whether to call
> ip_ufo_append_data():
> 
>   ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
> 
> At the moment, we already have skb with 1028 bytes of data which is not
> marked for GSO so that the test is false (fragheaderlen is usually 20).
> Thus we append second 1000 bytes to this skb without invoking UFO. Third
> sendto(), however, has sufficient length to trigger the UFO path so that we
> end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
> uses udp_csum() to calculate the checksum but that assumes all fragments
> have correct checksum in skb->csum which is not true for UFO fragments.
> 
> When checking against MTU, we need to add skb->len to length of new segment
> if we already have a partially filled skb and fragheaderlen only if there
> isn't one.
> 
> In the IPv6 case, skb can only be null if this is the first segment so that
> we have to use headersize (length of the first IPv6 header) rather than
> fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
> 
> Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
> Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for
>   ip6 fragment between __ip6_append_data and ip6_finish_output")
> Signed-off-by: Michal Kubecek 

LGTM.

Acked-by: Vlad Yasevich 

-vlad

> ---
>  net/ipv4/ip_output.c  | 3 ++-
>  net/ipv6/ip6_output.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7a3fd25e8913..532b36e9ce2a 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -964,7 +964,8 @@ static int __ip_append_data(struct sock *sk,
>   csummode = CHECKSUM_PARTIAL;
>  
>   cork->length += length;
> - if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
> + if length + (skb ? skb->len : fragheaderlen)) > mtu) ||
> +  (skb && skb_is_gso(skb))) &&
>   (sk->sk_protocol == IPPROTO_UDP) &&
>   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
>   (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index bf8a58a1c32d..1699acb2fa2c 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1390,7 +1390,7 @@ static int __ip6_append_data(struct sock *sk,
>*/
>  
>   cork->length += length;
> - if length + fragheaderlen) > mtu) ||
> + if length + (skb ? skb->len : headersize)) > mtu) ||
>(skb && skb_is_gso(skb))) &&
>   (sk->sk_protocol == IPPROTO_UDP) &&
>   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
>

Re: [PATCH 5/5] sctp: Adjust one function call together with a variable assignment

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:41 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 18:15:12 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> ERROR: do not use assignment in if condition
> 
> Thus fix the affected source code place.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 057479b7bd72..be2fe3ebae78 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -141,7 +141,8 @@ static void sctp_v4_copy_addrlist(struct list_head 
> *addrlist,
>   struct sctp_sockaddr_entry *addr;
>  
>   rcu_read_lock();
> - if ((in_dev = __in_dev_get_rcu(dev)) == NULL) {
> + in_dev = __in_dev_get_rcu(dev);
> +     if (!in_dev) {
>   rcu_read_unlock();
>   return;
>   }
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

Re: [PATCH 5/5] sctp: Adjust one function call together with a variable assignment

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:41 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 22 May 2017 18:15:12 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> ERROR: do not use assignment in if condition
> 
> Thus fix the affected source code place.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/sctp/protocol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 057479b7bd72..be2fe3ebae78 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -141,7 +141,8 @@ static void sctp_v4_copy_addrlist(struct list_head 
> *addrlist,
>   struct sctp_sockaddr_entry *addr;
>  
>   rcu_read_lock();
> - if ((in_dev = __in_dev_get_rcu(dev)) == NULL) {
> + in_dev = __in_dev_get_rcu(dev);
> + if (!in_dev) {
>   rcu_read_unlock();
>   return;
>   }
> 

Acked-by: Vlad Yasevich 

-vlad

Re: [PATCH 4/5] sctp: Improve a size determination in sctp_inetaddr_event()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:40 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 18:08:24 +0200
> 
> Replace the specification of a data structure by a pointer dereference
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 64756c42cec9..057479b7bd72 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -784,7 +784,7 @@ static int sctp_inetaddr_event(struct notifier_block 
> *this, unsigned long ev,
>  
>   switch (ev) {
>   case NETDEV_UP:
> - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
> + addr = kmalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v4.sin_family = AF_INET;
>   addr->a.v4.sin_port = 0;
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

Re: [PATCH 4/5] sctp: Improve a size determination in sctp_inetaddr_event()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:40 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 22 May 2017 18:08:24 +0200
> 
> Replace the specification of a data structure by a pointer dereference
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/sctp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 64756c42cec9..057479b7bd72 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -784,7 +784,7 @@ static int sctp_inetaddr_event(struct notifier_block 
> *this, unsigned long ev,
>  
>   switch (ev) {
>   case NETDEV_UP:
> - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
> + addr = kmalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v4.sin_family = AF_INET;
>   addr->a.v4.sin_port = 0;
> 

Acked-by: Vlad Yasevich 

-vlad

Re: [PATCH 3/5] sctp: Fix a typo in a comment line in sctp_init()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:39 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 17:43:44 +0200
> 
> Add a missing character in this description.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 5e7c8a344770..64756c42cec9 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1454,7 +1454,7 @@ static __init int sctp_init(void)
>   }
>  
>   /* Allocate and initialize the SCTP port hash table.
> -  * Note that order is initalized to start at the max sized
> +  * Note that order is initialized to start at the max sized
>* table we want to support.  If we can't get that many pages
>    * reduce the order and try again
>*/
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com.

-vlad

Re: [PATCH 3/5] sctp: Fix a typo in a comment line in sctp_init()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:39 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 22 May 2017 17:43:44 +0200
> 
> Add a missing character in this description.
> 
> Signed-off-by: Markus Elfring 
> ---
>  net/sctp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 5e7c8a344770..64756c42cec9 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1454,7 +1454,7 @@ static __init int sctp_init(void)
>   }
>  
>   /* Allocate and initialize the SCTP port hash table.
> -  * Note that order is initalized to start at the max sized
> +  * Note that order is initialized to start at the max sized
>* table we want to support.  If we can't get that many pages
>    * reduce the order and try again
>*/
> 

Acked-by: Vlad Yasevich

Re: [PATCH 2/5] sctp: Delete an error message for a failed memory allocation in sctp_init()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:38 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 17:28:14 +0200
> 
> Omit an extra message for a memory allocation failure in this function.
> 
> This issue was detected by using the Coccinelle software.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 2b1a6215bd2f..5e7c8a344770 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1447,5 +1447,4 @@ static __init int sctp_init(void)
>   if (!sctp_ep_hashtable) {
> - pr_err("Failed endpoint_hash alloc\n");
>       status = -ENOMEM;
>   goto err_ehash_alloc;
>   }
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

At the time this was written, it was patterned after TCP.  Since then TCP 
changed
significantly.  We can surely clean-up the pr_err() here and possibly update the
code as well later.

-vlad

Re: [PATCH 2/5] sctp: Delete an error message for a failed memory allocation in sctp_init()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:38 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 22 May 2017 17:28:14 +0200
> 
> Omit an extra message for a memory allocation failure in this function.
> 
> This issue was detected by using the Coccinelle software.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> Signed-off-by: Markus Elfring 
> ---
>  net/sctp/protocol.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 2b1a6215bd2f..5e7c8a344770 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1447,5 +1447,4 @@ static __init int sctp_init(void)
>   if (!sctp_ep_hashtable) {
> - pr_err("Failed endpoint_hash alloc\n");
>   status = -ENOMEM;
>   goto err_ehash_alloc;
>   }
> 

Acked-by: Vlad Yasevich 

At the time this was written, it was patterned after TCP.  Since then TCP 
changed
significantly.  We can surely clean-up the pr_err() here and possibly update the
code as well later.

-vlad

Re: [PATCH 1/5] sctp: Use kmalloc_array() in sctp_init()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:37 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 17:20:11 +0200
> 
> * A multiplication for the size determination of a memory allocation
>   indicated that an array data structure should be processed.
>   Thus use the corresponding function "kmalloc_array".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of a data structure by a pointer dereference
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  net/sctp/protocol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 989a900383b5..2b1a6215bd2f 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1442,6 +1442,6 @@ static __init int sctp_init(void)
>  
>   /* Allocate and initialize the endpoint hash table.  */
>   sctp_ep_hashsize = 64;
> - sctp_ep_hashtable =
> - kmalloc(64 * sizeof(struct sctp_hashbucket), GFP_KERNEL);
> + sctp_ep_hashtable = kmalloc_array(64, sizeof(*sctp_ep_hashtable),
> +   GFP_KERNEL);
>   if (!sctp_ep_hashtable) {
>

Re: [PATCH 1/5] sctp: Use kmalloc_array() in sctp_init()

2017-05-22 Thread Vlad Yasevich

On 05/22/2017 12:37 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Mon, 22 May 2017 17:20:11 +0200
> 
> * A multiplication for the size determination of a memory allocation
>   indicated that an array data structure should be processed.
>   Thus use the corresponding function "kmalloc_array".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of a data structure by a pointer dereference
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 

Acked-by: Vlad Yasevich 

-vlad

> ---
>  net/sctp/protocol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 989a900383b5..2b1a6215bd2f 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1442,6 +1442,6 @@ static __init int sctp_init(void)
>  
>   /* Allocate and initialize the endpoint hash table.  */
>   sctp_ep_hashsize = 64;
> - sctp_ep_hashtable =
> - kmalloc(64 * sizeof(struct sctp_hashbucket), GFP_KERNEL);
> + sctp_ep_hashtable = kmalloc_array(64, sizeof(*sctp_ep_hashtable),
> +   GFP_KERNEL);
>   if (!sctp_ep_hashtable) {
>

Re: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-09 Thread Vlad Yasevich

On 12/09/2015 10:25 AM, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> SCTP is lacking proper np->opt cloning at accept() time.
> 
> TCP and DCCP use ipv6_dup_options() helper, do the same
> in SCTP.
> 
> We might later factorize this code in a common helper to avoid
> future mistakes.
> 
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Eric Dumazet 

Acked-by: Vlad Yasevich 

This is sufficient for accept() processing, but looks like peeloff is missing
a bunch of ipv6 support.  I'll see if I can cook something up to fix that part.

-vlad

> ---
>  net/sctp/ipv6.c |8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index d28c0b4c9128..ec529121f38a 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -641,6 +641,7 @@ static struct sock *sctp_v6_create_accept_sk(struct sock 
> *sk,
>   struct sock *newsk;
>   struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
>   struct sctp6_sock *newsctp6sk;
> + struct ipv6_txoptions *opt;
>  
>   newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot, 0);
>   if (!newsk)
> @@ -660,6 +661,13 @@ static struct sock *sctp_v6_create_accept_sk(struct sock 
> *sk,
>  
>   memcpy(newnp, np, sizeof(struct ipv6_pinfo));
>  
> + rcu_read_lock();
> + opt = rcu_dereference(np->opt);
> + if (opt)
> + opt = ipv6_dup_options(newsk, opt);
> + RCU_INIT_POINTER(newnp->opt, opt);
> + rcu_read_unlock();
> +
>   /* Initialize sk's sport, dport, rcv_saddr and daddr for getsockname()
>* and getpeername().
>*/
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-09 Thread Vlad Yasevich

On 12/09/2015 10:25 AM, Eric Dumazet wrote:
> From: Eric Dumazet <eduma...@google.com>
> 
> SCTP is lacking proper np->opt cloning at accept() time.
> 
> TCP and DCCP use ipv6_dup_options() helper, do the same
> in SCTP.
> 
> We might later factorize this code in a common helper to avoid
> future mistakes.
> 
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

This is sufficient for accept() processing, but looks like peeloff is missing
a bunch of ipv6 support.  I'll see if I can cook something up to fix that part.

-vlad

> ---
>  net/sctp/ipv6.c |8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index d28c0b4c9128..ec529121f38a 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -641,6 +641,7 @@ static struct sock *sctp_v6_create_accept_sk(struct sock 
> *sk,
>   struct sock *newsk;
>   struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
>   struct sctp6_sock *newsctp6sk;
> + struct ipv6_txoptions *opt;
>  
>   newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot, 0);
>   if (!newsk)
> @@ -660,6 +661,13 @@ static struct sock *sctp_v6_create_accept_sk(struct sock 
> *sk,
>  
>   memcpy(newnp, np, sizeof(struct ipv6_pinfo));
>  
> + rcu_read_lock();
> + opt = rcu_dereference(np->opt);
> + if (opt)
> + opt = ipv6_dup_options(newsk, opt);
> + RCU_INIT_POINTER(newnp->opt, opt);
> + rcu_read_unlock();
> +
>   /* Initialize sk's sport, dport, rcv_saddr and daddr for getsockname()
>* and getpeername().
>*/
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: use-after-free in sctp_do_sm

2015-12-04 Thread Vlad Yasevich

On 12/04/2015 07:55 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Dec 04, 2015 at 11:40:02AM +0100, Dmitry Vyukov wrote:
>> On Thu, Dec 3, 2015 at 9:51 PM, Joe Perches  wrote:
>>> (adding lkml as this is likely better discussed there)
>>>
>>> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote:
 On 12/03/2015 03:24 PM, Joe Perches wrote:
> On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote:
>> On 12/03/2015 03:03 PM, Joe Perches wrote:
>>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote:
 On 12/03/2015 01:52 PM, Aaron Conole wrote:
> I think that as a minimum, the following patch should be evaluted,
> but am unsure to whom I should submit it (after I test):
>>> []
 Agreed - the intention here is certainly to have no side effects. It
 looks like 'no_printk()' is used in quite a few other places that would
 benefit from this change. So we probably want a generic
 'really_no_printk()' macro.
>>>
>>> https://lkml.org/lkml/2012/6/17/231
>>
>> I don't see this in the tree.
>
> It never got applied.
>
>> Also maybe we should just convert
>> no_printk() to do what your 'eliminated_printk()'.
>
> Some of them at least.
>
>> So we can convert all users with this change?
>
> I don't think so, I think there are some
> function evaluation/side effects that are
> required.  I believe some do hardware I/O.
>
> It'd be good to at least isolate them.
>
> I'm not sure how to find them via some
> automated tool/mechanism though.
>
> I asked Julia Lawall about it once in this
> thread:  https://lkml.org/lkml/2014/12/3/696
>

 Seems rather fragile to have side effects that we rely
 upon hidden in a printk().
>>>
>>> Yup.
>>>
 Just convert them and see what breaks :)
>>>
>>> I appreciate your optimism.  It's very 1995.
>>> Try it and see what happens.
>>
>>
>> Whatever is the resolution for pr_debug, we still need to fix this
>> particular use-after-free. It affects stability of debug builds, gives
>> invalid debug output, prevents us from finding more bugs in SCTP. And
>> maybe somebody uses CONFIG_DYNAMIC_DEBUG in production.
> 
> Agreed. I'm already working on a fix for this particular use-after-free.
> 
> Another interesting thing about this is that sctp_do_sm() is called for
> nearly every movement that happens on a sctp socket. Said that, that
> always-running IDR search hidden on that debug statement do have some
> nasty performance impact, specially because it's serialized on a
> spinlock.

YUCK!  I didn't really pay much attention to those debug macros before, but
debug_post_sfx() is truly awful.

This wasn't such a bad thing where these macros depended on CONFIG_SCTP_DEBUG,
but now that they are always built, we need fix them.

-vlad



> This wouldn't be happening if it was fully ellided and would
> be ok if that pr_debug() was really being printed, but not as it is.
> Kudos to this report that I could notice this. I'm trying to fix this on
> SCTP-side as well.
> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: use-after-free in sctp_do_sm

2015-12-04 Thread Vlad Yasevich

On 12/04/2015 07:55 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Dec 04, 2015 at 11:40:02AM +0100, Dmitry Vyukov wrote:
>> On Thu, Dec 3, 2015 at 9:51 PM, Joe Perches  wrote:
>>> (adding lkml as this is likely better discussed there)
>>>
>>> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote:
 On 12/03/2015 03:24 PM, Joe Perches wrote:
> On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote:
>> On 12/03/2015 03:03 PM, Joe Perches wrote:
>>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote:
 On 12/03/2015 01:52 PM, Aaron Conole wrote:
> I think that as a minimum, the following patch should be evaluted,
> but am unsure to whom I should submit it (after I test):
>>> []
 Agreed - the intention here is certainly to have no side effects. It
 looks like 'no_printk()' is used in quite a few other places that would
 benefit from this change. So we probably want a generic
 'really_no_printk()' macro.
>>>
>>> https://lkml.org/lkml/2012/6/17/231
>>
>> I don't see this in the tree.
>
> It never got applied.
>
>> Also maybe we should just convert
>> no_printk() to do what your 'eliminated_printk()'.
>
> Some of them at least.
>
>> So we can convert all users with this change?
>
> I don't think so, I think there are some
> function evaluation/side effects that are
> required.  I believe some do hardware I/O.
>
> It'd be good to at least isolate them.
>
> I'm not sure how to find them via some
> automated tool/mechanism though.
>
> I asked Julia Lawall about it once in this
> thread:  https://lkml.org/lkml/2014/12/3/696
>

 Seems rather fragile to have side effects that we rely
 upon hidden in a printk().
>>>
>>> Yup.
>>>
 Just convert them and see what breaks :)
>>>
>>> I appreciate your optimism.  It's very 1995.
>>> Try it and see what happens.
>>
>>
>> Whatever is the resolution for pr_debug, we still need to fix this
>> particular use-after-free. It affects stability of debug builds, gives
>> invalid debug output, prevents us from finding more bugs in SCTP. And
>> maybe somebody uses CONFIG_DYNAMIC_DEBUG in production.
> 
> Agreed. I'm already working on a fix for this particular use-after-free.
> 
> Another interesting thing about this is that sctp_do_sm() is called for
> nearly every movement that happens on a sctp socket. Said that, that
> always-running IDR search hidden on that debug statement do have some
> nasty performance impact, specially because it's serialized on a
> spinlock.

YUCK!  I didn't really pay much attention to those debug macros before, but
debug_post_sfx() is truly awful.

This wasn't such a bad thing where these macros depended on CONFIG_SCTP_DEBUG,
but now that they are always built, we need fix them.

-vlad



> This wouldn't be happening if it was fully ellided and would
> be ok if that pr_debug() was really being printed, but not as it is.
> Kudos to this report that I could notice this. I'm trying to fix this on
> SCTP-side as well.
> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE

2015-08-28 Thread Vlad Yasevich

On 08/27/2015 10:42 PM, Jason Wang wrote:
> 
> 
> On 08/27/2015 06:43 PM, Michael S. Tsirkin wrote:
>> On Wed, Aug 26, 2015 at 01:45:30PM +0800, Jason Wang wrote:
>>>
>>> On 08/26/2015 12:32 AM, Vlad Yasevich wrote:
>>>> On 08/25/2015 07:30 AM, Jason Wang wrote:
>>>>> On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote:
>>>>>> On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote:
>>>>>>>> For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0.
>>>>>>>>
>>>>>>>> For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
>>>>>>>> ("macvtap: Add support of packet capture on macvtap
>>>>>>>> device."). Multiqueue macvtap suffers from single qdisc lock
>>>>>>>> contention. This is because macvtap claims a non zero tx_queue_len and
>>>>>>>> it reuses this value as it socket receive queue size.Thanks to
>>>>>>>> IFF_NO_QUEUE, we can remove the lock contention without breaking
>>>>>>>> existing socket receive queue length logic.
>>>>>>>>
>>>>>>>> Cc: Patrick McHardy 
>>>>>>>> Cc: Vladislav Yasevich 
>>>>>>>> Cc: Michael S. Tsirkin 
>>>>>>>> Signed-off-by: Jason Wang 
>>>>>> Seems to make sense. Give me a day or two to get over the jet lag
>>>>>> (and get out from under the pile of mail accumulated while I was 
>>>>>> traveling),
>>>>>> I'll review properly and ack.
>>>>>>
>>>>> A note on this patch: only default qdisc were removed but we don't lose
>>>>> the ability to attach a qdisc to macvtap (though it may cause lock
>>>>> contention on multiqueue case).
>>>>>
>>>> Wouldn't that lock contention be solved if we really had multiple queues
>>>> for multi-queue macvtaps?
>>>>
>>>> -vlad
>>> Yes, but this introduce another layer of txq locks contention?
>> I don't follow - why does it? Could you clarify please?
> 
> I believe Vlad wants to remove NETIF_F_LLTX. If yes, core will do an
> extra tx lock at macvlan layer.

No, I don't want to remove it.  In a sense, it would function similar to
how it works when fwd_priv is populated.  I am still testing the code
as it's showing some strange artifacts...  could be due to keeping LLTX.

-vlad

> 
>>
>>> And it
>>> also needs macvlan multiqueue support. We used to do something like this
>>> but switch to NETIF_F_LLTX finally. You may refer:
>>>
>>> 2c11455321f37da6fe6cc36353149f9ac9183334 macvlan: add multiqueue capability
>>> 8ffab51b3dfc54876f145f15b351c41f3f703195 macvlan: lockless tx path
>> My concern is that the moment someone configures a non-standard qdisc
>> scalability suddenly disappears. That would also be tricky to debug in the
>> field as not a lot of developers use non-standard qdiscs.
>> What do you think?
> 
> Probably not an issue. Non-standard qdisc may need be attached manually
> after device creation, and we don't lose this ability with this patch.
> (Unless somebody changes default_qdisc). Actually, user before
> 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 does not expect any qdisc work
> for macvtap like other stacked devices. This patch also restore this.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE

2015-08-28 Thread Vlad Yasevich

On 08/27/2015 10:42 PM, Jason Wang wrote:
 
 
 On 08/27/2015 06:43 PM, Michael S. Tsirkin wrote:
 On Wed, Aug 26, 2015 at 01:45:30PM +0800, Jason Wang wrote:

 On 08/26/2015 12:32 AM, Vlad Yasevich wrote:
 On 08/25/2015 07:30 AM, Jason Wang wrote:
 On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote:
 On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote:
 For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0.

 For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 (macvtap: Add support of packet capture on macvtap
 device.). Multiqueue macvtap suffers from single qdisc lock
 contention. This is because macvtap claims a non zero tx_queue_len and
 it reuses this value as it socket receive queue size.Thanks to
 IFF_NO_QUEUE, we can remove the lock contention without breaking
 existing socket receive queue length logic.

 Cc: Patrick McHardy ka...@trash.net
 Cc: Vladislav Yasevich vyase...@redhat.com
 Cc: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 Seems to make sense. Give me a day or two to get over the jet lag
 (and get out from under the pile of mail accumulated while I was 
 traveling),
 I'll review properly and ack.

 A note on this patch: only default qdisc were removed but we don't lose
 the ability to attach a qdisc to macvtap (though it may cause lock
 contention on multiqueue case).

 Wouldn't that lock contention be solved if we really had multiple queues
 for multi-queue macvtaps?

 -vlad
 Yes, but this introduce another layer of txq locks contention?
 I don't follow - why does it? Could you clarify please?
 
 I believe Vlad wants to remove NETIF_F_LLTX. If yes, core will do an
 extra tx lock at macvlan layer.

No, I don't want to remove it.  In a sense, it would function similar to
how it works when fwd_priv is populated.  I am still testing the code
as it's showing some strange artifacts...  could be due to keeping LLTX.

-vlad

 

 And it
 also needs macvlan multiqueue support. We used to do something like this
 but switch to NETIF_F_LLTX finally. You may refer:

 2c11455321f37da6fe6cc36353149f9ac9183334 macvlan: add multiqueue capability
 8ffab51b3dfc54876f145f15b351c41f3f703195 macvlan: lockless tx path
 My concern is that the moment someone configures a non-standard qdisc
 scalability suddenly disappears. That would also be tricky to debug in the
 field as not a lot of developers use non-standard qdiscs.
 What do you think?
 
 Probably not an issue. Non-standard qdisc may need be attached manually
 after device creation, and we don't lose this ability with this patch.
 (Unless somebody changes default_qdisc). Actually, user before
 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 does not expect any qdisc work
 for macvtap like other stacked devices. This patch also restore this.
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE

2015-08-25 Thread Vlad Yasevich

On 08/25/2015 07:30 AM, Jason Wang wrote:
> 
> 
> On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote:
>> On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote:
 For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0.

 For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 ("macvtap: Add support of packet capture on macvtap
 device."). Multiqueue macvtap suffers from single qdisc lock
 contention. This is because macvtap claims a non zero tx_queue_len and
 it reuses this value as it socket receive queue size.Thanks to
 IFF_NO_QUEUE, we can remove the lock contention without breaking
 existing socket receive queue length logic.

 Cc: Patrick McHardy 
 Cc: Vladislav Yasevich 
 Cc: Michael S. Tsirkin 
 Signed-off-by: Jason Wang 
>> Seems to make sense. Give me a day or two to get over the jet lag
>> (and get out from under the pile of mail accumulated while I was traveling),
>> I'll review properly and ack.
>>
> 
> A note on this patch: only default qdisc were removed but we don't lose
> the ability to attach a qdisc to macvtap (though it may cause lock
> contention on multiqueue case).
> 

Wouldn't that lock contention be solved if we really had multiple queues
for multi-queue macvtaps?

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE

2015-08-25 Thread Vlad Yasevich

On 08/25/2015 07:30 AM, Jason Wang wrote:
 
 
 On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote:
 On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote:
 For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0.

 For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 (macvtap: Add support of packet capture on macvtap
 device.). Multiqueue macvtap suffers from single qdisc lock
 contention. This is because macvtap claims a non zero tx_queue_len and
 it reuses this value as it socket receive queue size.Thanks to
 IFF_NO_QUEUE, we can remove the lock contention without breaking
 existing socket receive queue length logic.

 Cc: Patrick McHardy ka...@trash.net
 Cc: Vladislav Yasevich vyase...@redhat.com
 Cc: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 Seems to make sense. Give me a day or two to get over the jet lag
 (and get out from under the pile of mail accumulated while I was traveling),
 I'll review properly and ack.

 
 A note on this patch: only default qdisc were removed but we don't lose
 the ability to attach a qdisc to macvtap (though it may cause lock
 contention on multiqueue case).
 

Wouldn't that lock contention be solved if we really had multiple queues
for multi-queue macvtaps?

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sctp: Add counters for out data chunk discards

2015-07-06 Thread Vlad Yasevich

On 07/06/2015 01:37 PM, Vitaly Andrianov wrote:
> This commit adds a MIB entry for out data chunk discards.
> Number of outgoing SCTP DATA chunks for a SCTP association for which no
> problems were encountered to prevent their transmission but were discarded.
> Data chunks are discarded due to ungraceful closing of the SCTP
> association.
> 
> Signed-off-by: Vitaly Andrianov 
> ---
>  include/net/sctp/sctp.h | 1 +
>  net/sctp/outqueue.c | 1 +
>  net/sctp/proc.c | 2 ++
>  3 files changed, 4 insertions(+)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index ce13cf2..fd806ea 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -216,6 +216,7 @@ enum {
>   SCTP_MIB_IN_PKT_BACKLOG,
>   SCTP_MIB_IN_PKT_DISCARDS,
>   SCTP_MIB_IN_DATA_CHUNK_DISCARDS,
> + SCTP_MIB_OUT_DATA_CHUNK_DISCARDS,
>   __SCTP_MIB_MAX
>  };
>  
> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> index 7e8f0a1..dc1a40f 100644
> --- a/net/sctp/outqueue.c
> +++ b/net/sctp/outqueue.c
> @@ -315,6 +315,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
> *chunk)
>   case SCTP_STATE_SHUTDOWN_RECEIVED:
>   case SCTP_STATE_SHUTDOWN_ACK_SENT:
>   /* Cannot send after transport endpoint shutdown */
> + SCTP_INC_STATS(net, SCTP_MIB_OUT_DATA_CHUNK_DISCARDS);
>   error = -ESHUTDOWN;
>   break;

First, this is not the only case where a chunk is discarded on output.  If you
are going to add an OUT_DATA_DISCARDS counter, you need to catch all such 
discards.
Second,  in this particular case, you will count discarded messages, not chunks 
since
sctp_outq_tail is guaranteed to queue the whole message or none of it.

-vlad

>  
> diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> index 0697eda..1a10b81 100644
> --- a/net/sctp/proc.c
> +++ b/net/sctp/proc.c
> @@ -67,6 +67,8 @@ static const struct snmp_mib sctp_snmp_list[] = {
>   SNMP_MIB_ITEM("SctpInPktBacklog", SCTP_MIB_IN_PKT_BACKLOG),
>   SNMP_MIB_ITEM("SctpInPktDiscards", SCTP_MIB_IN_PKT_DISCARDS),
>   SNMP_MIB_ITEM("SctpInDataChunkDiscards", 
> SCTP_MIB_IN_DATA_CHUNK_DISCARDS),
> + SNMP_MIB_ITEM("SctpOutDataChunkDiscards",
> +   SCTP_MIB_OUT_DATA_CHUNK_DISCARDS),
>   SNMP_MIB_SENTINEL
>  };
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sctp: Add counters for out data chunk discards

2015-07-06 Thread Vlad Yasevich

On 07/06/2015 01:37 PM, Vitaly Andrianov wrote:
 This commit adds a MIB entry for out data chunk discards.
 Number of outgoing SCTP DATA chunks for a SCTP association for which no
 problems were encountered to prevent their transmission but were discarded.
 Data chunks are discarded due to ungraceful closing of the SCTP
 association.
 
 Signed-off-by: Vitaly Andrianov vita...@ti.com
 ---
  include/net/sctp/sctp.h | 1 +
  net/sctp/outqueue.c | 1 +
  net/sctp/proc.c | 2 ++
  3 files changed, 4 insertions(+)
 
 diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
 index ce13cf2..fd806ea 100644
 --- a/include/net/sctp/sctp.h
 +++ b/include/net/sctp/sctp.h
 @@ -216,6 +216,7 @@ enum {
   SCTP_MIB_IN_PKT_BACKLOG,
   SCTP_MIB_IN_PKT_DISCARDS,
   SCTP_MIB_IN_DATA_CHUNK_DISCARDS,
 + SCTP_MIB_OUT_DATA_CHUNK_DISCARDS,
   __SCTP_MIB_MAX
  };
  
 diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
 index 7e8f0a1..dc1a40f 100644
 --- a/net/sctp/outqueue.c
 +++ b/net/sctp/outqueue.c
 @@ -315,6 +315,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
 *chunk)
   case SCTP_STATE_SHUTDOWN_RECEIVED:
   case SCTP_STATE_SHUTDOWN_ACK_SENT:
   /* Cannot send after transport endpoint shutdown */
 + SCTP_INC_STATS(net, SCTP_MIB_OUT_DATA_CHUNK_DISCARDS);
   error = -ESHUTDOWN;
   break;

First, this is not the only case where a chunk is discarded on output.  If you
are going to add an OUT_DATA_DISCARDS counter, you need to catch all such 
discards.
Second,  in this particular case, you will count discarded messages, not chunks 
since
sctp_outq_tail is guaranteed to queue the whole message or none of it.

-vlad

  
 diff --git a/net/sctp/proc.c b/net/sctp/proc.c
 index 0697eda..1a10b81 100644
 --- a/net/sctp/proc.c
 +++ b/net/sctp/proc.c
 @@ -67,6 +67,8 @@ static const struct snmp_mib sctp_snmp_list[] = {
   SNMP_MIB_ITEM(SctpInPktBacklog, SCTP_MIB_IN_PKT_BACKLOG),
   SNMP_MIB_ITEM(SctpInPktDiscards, SCTP_MIB_IN_PKT_DISCARDS),
   SNMP_MIB_ITEM(SctpInDataChunkDiscards, 
 SCTP_MIB_IN_DATA_CHUNK_DISCARDS),
 + SNMP_MIB_ITEM(SctpOutDataChunkDiscards,
 +   SCTP_MIB_OUT_DATA_CHUNK_DISCARDS),
   SNMP_MIB_SENTINEL
  };
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel >= 4.0: crashes when using traceroute6 with isatap

2015-05-12 Thread Vlad Yasevich

On 05/12/2015 04:36 PM, Eric Dumazet wrote:
> On Tue, 2015-05-12 at 16:18 -0400, Vlad Yasevich wrote:
>> On 05/06/2015 06:11 PM, Wolfgang Walter wrote:
>>> Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie:
>>>> (Cc'ing netdev.)
>>>>
>>>> On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter  wrote:
>>>>> Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter:
>>>>>> Hello,
>>>>>>
>>>>>> kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an
>>>>>> isatap-tunnel.
>>>>>
>>>>> I did some further tests. To trigger the crash you need
>>>>>
>>>>> * isatap-tunnel (probably any sit-tunnel will do it)
>>>>> * raw-socket
>>>>> * udp
>>>>>
>>>>> Using icmpv6 or tcp i.e. does not trigger it.
>>>>
>>>> Do you have a script to reproduce it?
>>>>
>>>>
>>>> Thanks for the bug report!
>>>
>>> You need a isatap-server with say ipv4-address $X
>>>
>>> Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X
>>>
>>> then do
>>>
>>> traceroute6 www.google.de
>>>
>>> Regards,
>>>
>>
>> Hi Walter
>>
>> Could you try this patch.  Looks like raw passes transhdrlen
>> of 0 on the first packet and that makes IPv4 behave correctly,
>> but not IPv6.
>>
>>
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 7fde1f2..fd9c079 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -1278,7 +1278,7 @@ emsgsize:
>>  /* If this is the first and only packet and device
>>   * supports checksum offloading, let's use it.
>>   */
>> -if (!skb && sk->sk_protocol == IPPROTO_UDP &&
>> +if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
>>  length + fragheaderlen < mtu &&
>>  rt->dst.dev->features & NETIF_F_V6_CSUM &&
>>  !exthdrlen)
> 
> And make sure the checksum is correct ;)
> 
> Vlad, can you tell where skb->cum_start and skb->csum_offset are set ?
> 
> 

For udp, udp6_hwcsum_outgoing(), but with the above patch, the check above will 
return
false, and we'll fallback to using CHECKSUM_NONE.

Before, the !skb was true since there there was no skb on the queue.  Now, that
we are checking transhdr and raw passing in 0, the check will be false.  This
is what's making IPv4 work correctly in this case.

-vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel >= 4.0: crashes when using traceroute6 with isatap

2015-05-12 Thread Vlad Yasevich

On 05/06/2015 06:11 PM, Wolfgang Walter wrote:
> Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie:
>> (Cc'ing netdev.)
>>
>> On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter  wrote:
>>> Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter:
 Hello,

 kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an
 isatap-tunnel.
>>>
>>> I did some further tests. To trigger the crash you need
>>>
>>> * isatap-tunnel (probably any sit-tunnel will do it)
>>> * raw-socket
>>> * udp
>>>
>>> Using icmpv6 or tcp i.e. does not trigger it.
>>
>> Do you have a script to reproduce it?
>>
>>
>> Thanks for the bug report!
>
> You need a isatap-server with say ipv4-address $X
>
> Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X
>
> then do
>
>   traceroute6 www.google.de
>
> Regards,
>

Hi Walter

Could you try this patch.  Looks like raw passes transhdrlen
of 0 on the first packet and that makes IPv4 behave correctly,
but not IPv6.


diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..fd9c079 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1278,7 +1278,7 @@ emsgsize:
/* If this is the first and only packet and device
 * supports checksum offloading, let's use it.
 */
-   if (!skb && sk->sk_protocol == IPPROTO_UDP &&
+   if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
length + fragheaderlen < mtu &&
rt->dst.dev->features & NETIF_F_V6_CSUM &&
!exthdrlen)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel >= 4.0: crashes when using traceroute6 with isatap

2015-05-12 Thread Vlad Yasevich

On 05/06/2015 06:42 PM, Eric Dumazet wrote:
> On Thu, 2015-05-07 at 00:04 +0200, Wolfgang Walter wrote:
>> Am Mittwoch, 6. Mai 2015, 12:10:00 schrieb Eric Dumazet:
>>> On Wed, 2015-05-06 at 11:15 -0700, Cong Wang wrote:
 (Cc'ing netdev.)

 On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter  wrote:
> Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter:
>> Hello,
>>
>> kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with
>> an
>> isatap-tunnel.
>
> I did some further tests. To trigger the crash you need
>
> * isatap-tunnel (probably any sit-tunnel will do it)
> * raw-socket
> * udp
>
> Using icmpv6 or tcp i.e. does not trigger it.

 Do you have a script to reproduce it?


 Thanks for the bug report!
 --
>>>
>>> Please Wolfgang try to revert 32dce968dd987adfb0c00946d78dad9154f64759
>>> ("ipv6: Allow for partial checksums on non-ufo packets")
>>
>> Indeed, that fixes the problem.
> 
> Yes, setting skb->csum to 0 is clearly wrong for CHECKSUM_PARTIAL
> 
> Would you try :
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 
> 7fde1f265c90e90f16291e6c861b6e242111c25b..694ae630e1ca67e25ab1e5f6dd0b3597db3669b0
>  100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1416,6 +1416,10 @@ alloc_new_skb:
>   data += fragheaderlen;
>   skb->transport_header = (skb->network_header +
>fragheaderlen);
> + if (csummode == CHECKSUM_PARTIAL) {
> + skb->csum_start = skb_transport_header(skb) - 
> skb->head;
> + skb->csum_offset = offsetof(struct udphdr, 
> check);
> + }
>   if (fraggap) {
>   skb->csum = skb_copy_and_csum_bits(
>   skb_prev, maxfraglen,
> 
> 
> 

So why is this not an issue in __ip_append_data()?

-vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel = 4.0: crashes when using traceroute6 with isatap

2015-05-12 Thread Vlad Yasevich

On 05/06/2015 06:42 PM, Eric Dumazet wrote:
 On Thu, 2015-05-07 at 00:04 +0200, Wolfgang Walter wrote:
 Am Mittwoch, 6. Mai 2015, 12:10:00 schrieb Eric Dumazet:
 On Wed, 2015-05-06 at 11:15 -0700, Cong Wang wrote:
 (Cc'ing netdev.)

 On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter li...@stwm.de wrote:
 Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter:
 Hello,

 kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with
 an
 isatap-tunnel.

 I did some further tests. To trigger the crash you need

 * isatap-tunnel (probably any sit-tunnel will do it)
 * raw-socket
 * udp

 Using icmpv6 or tcp i.e. does not trigger it.

 Do you have a script to reproduce it?


 Thanks for the bug report!
 --

 Please Wolfgang try to revert 32dce968dd987adfb0c00946d78dad9154f64759
 (ipv6: Allow for partial checksums on non-ufo packets)

 Indeed, that fixes the problem.
 
 Yes, setting skb-csum to 0 is clearly wrong for CHECKSUM_PARTIAL
 
 Would you try :
 
 diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
 index 
 7fde1f265c90e90f16291e6c861b6e242111c25b..694ae630e1ca67e25ab1e5f6dd0b3597db3669b0
  100644
 --- a/net/ipv6/ip6_output.c
 +++ b/net/ipv6/ip6_output.c
 @@ -1416,6 +1416,10 @@ alloc_new_skb:
   data += fragheaderlen;
   skb-transport_header = (skb-network_header +
fragheaderlen);
 + if (csummode == CHECKSUM_PARTIAL) {
 + skb-csum_start = skb_transport_header(skb) - 
 skb-head;
 + skb-csum_offset = offsetof(struct udphdr, 
 check);
 + }
   if (fraggap) {
   skb-csum = skb_copy_and_csum_bits(
   skb_prev, maxfraglen,
 
 
 

So why is this not an issue in __ip_append_data()?

-vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel = 4.0: crashes when using traceroute6 with isatap

2015-05-12 Thread Vlad Yasevich

On 05/06/2015 06:11 PM, Wolfgang Walter wrote:
 Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie:
 (Cc'ing netdev.)

 On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter li...@stwm.de wrote:
 Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter:
 Hello,

 kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an
 isatap-tunnel.

 I did some further tests. To trigger the crash you need

 * isatap-tunnel (probably any sit-tunnel will do it)
 * raw-socket
 * udp

 Using icmpv6 or tcp i.e. does not trigger it.

 Do you have a script to reproduce it?


 Thanks for the bug report!

 You need a isatap-server with say ipv4-address $X

 Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X

 then do

   traceroute6 www.google.de

 Regards,


Hi Walter

Could you try this patch.  Looks like raw passes transhdrlen
of 0 on the first packet and that makes IPv4 behave correctly,
but not IPv6.


diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..fd9c079 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1278,7 +1278,7 @@ emsgsize:
/* If this is the first and only packet and device
 * supports checksum offloading, let's use it.
 */
-   if (!skb  sk-sk_protocol == IPPROTO_UDP 
+   if (transhdrlen  sk-sk_protocol == IPPROTO_UDP 
length + fragheaderlen  mtu 
rt-dst.dev-features  NETIF_F_V6_CSUM 
!exthdrlen)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel = 4.0: crashes when using traceroute6 with isatap

2015-05-12 Thread Vlad Yasevich

On 05/12/2015 04:36 PM, Eric Dumazet wrote:
 On Tue, 2015-05-12 at 16:18 -0400, Vlad Yasevich wrote:
 On 05/06/2015 06:11 PM, Wolfgang Walter wrote:
 Am Mittwoch, 6. Mai 2015, 11:15:18 schrieben Sie:
 (Cc'ing netdev.)

 On Sat, May 2, 2015 at 5:29 AM, Wolfgang Walter li...@stwm.de wrote:
 Am Samstag, 2. Mai 2015, 02:16:36 schrieb Wolfgang Walter:
 Hello,

 kernel 4.0 (and 4.0.1) crashes immediately when I use traceroute6 with an
 isatap-tunnel.

 I did some further tests. To trigger the crash you need

 * isatap-tunnel (probably any sit-tunnel will do it)
 * raw-socket
 * udp

 Using icmpv6 or tcp i.e. does not trigger it.

 Do you have a script to reproduce it?


 Thanks for the bug report!

 You need a isatap-server with say ipv4-address $X

 Then, on host with 4.0, start isatapd: isatapd --mtu 1280 $X

 then do

 traceroute6 www.google.de

 Regards,


 Hi Walter

 Could you try this patch.  Looks like raw passes transhdrlen
 of 0 on the first packet and that makes IPv4 behave correctly,
 but not IPv6.


 diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
 index 7fde1f2..fd9c079 100644
 --- a/net/ipv6/ip6_output.c
 +++ b/net/ipv6/ip6_output.c
 @@ -1278,7 +1278,7 @@ emsgsize:
  /* If this is the first and only packet and device
   * supports checksum offloading, let's use it.
   */
 -if (!skb  sk-sk_protocol == IPPROTO_UDP 
 +if (transhdrlen  sk-sk_protocol == IPPROTO_UDP 
  length + fragheaderlen  mtu 
  rt-dst.dev-features  NETIF_F_V6_CSUM 
  !exthdrlen)
 
 And make sure the checksum is correct ;)
 
 Vlad, can you tell where skb-cum_start and skb-csum_offset are set ?
 
 

For udp, udp6_hwcsum_outgoing(), but with the above patch, the check above will 
return
false, and we'll fallback to using CHECKSUM_NONE.

Before, the !skb was true since there there was no skb on the queue.  Now, that
we are checking transhdr and raw passing in 0, the check will be false.  This
is what's making IPv4 work correctly in this case.

-vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode

2015-02-17 Thread Vlad Yasevich

On 02/15/2015 11:54 PM, Hiroshi Shimamoto wrote:
>>>>>>>>>>> Can you please fix up your patches based on my tree:
>>>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que
>>>>>>>>>>> ue.git
>>>>>>>>>>
>>>>>>>>>> Yes. I haven't noticed your tree.
>>>>>>>>>> Will resend patches against it.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I encountered an issue with your tree, the commit id is below.
>>>>>>>>>
>>>>>>>>> $ git log | head
>>>>>>>>> commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e
>>>>>>>>> Author: Rasmus Villemoes 
>>>>>>>>> Date:   Fri Jan 23 20:43:14 2015 -0800
>>>>>>>>>
>>>>>>>>> ethernet: fm10k: Actually drop 4 bits
>>>>>>>>>
>>>>>>>>> The comment explains the intention, but vid has type u16. Before
>>>> the
>>>>>>>>> inner shift, it is promoted to int, which has plenty of space for 
>>>>>>>>> all
>>>>>>>>> vid's bits, so nothing is dropped. Use a simple mask instead.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I use the kernel from your tree in both host and guest.
>>>>>>>>>
>>>>>>>>> Assign an IPv6 for VF in guest.
>>>>>>>>> # ip -6 addr add 2001:db8::18:1/64 dev ens0
>>>>>>>>>
>>>>>>>>> Send ping packet from other server to the VM.
>>>>>>>>> # ping6  2001:db8::18:1 -I eth0
>>>>>>>>>
>>>>>>>>> The following message was shown.
>>>>>>>>> ixgbevf :00:08.0: partial checksum but l4 proto=3a!
>>>>>>>>>
>>>>>>>>> If I did the same operation in the host, I saw the same error
>>>>>>>>> message in
>>>>>>> host too.
>>>>>>>>> ixgbe :2d:00.0: partial checksum but l4 proto=3a!
>>>>>>>>>
>>>>>>>>> Do you have any idea about that?
>>>>>>>>
>>>>>>>> Ah, sorry about that, try this tree again:
>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git
>>>>>>>>
>>>>>>>> That patch was dropped for favor of a patch that Matthew Vick
>>>>>>>> put together (and recently got pushed upstream).  So my queue no
>>>>>>>> longer has that patch in the queue, since it got dropped.
>>>>>>>
>>>>>>> I still see the same error, the head id is the below
>>>>>>>
>>>>>>> $ git log | head
>>>>>>> commit a072afb0b45904022b76deef3b770ee9a93cb13a
>>>>>>> Author: Nicholas Krause 
>>>>>>> Date:   Mon Feb 9 00:27:00 2015 -0800
>>>>>>>
>>>>>>> igb: Remove outdated fix me comment in the
>>>>>>> function,gb_acquire_swfw_sync_i210
>>>>>>>
>>>>>>>
>>>>>>> thanks,
>>>>>>> Hiroshi
>>>>>>
>>>>>> I'm having our validation see if they can recreate the same issue
>>>>>> internally.  When they get back to me I'll let you
>>>>> know
>>>>>> what we found.
>>>>>
>>>>> We did bisect, and the below looks the culprit;
>>>>>
>>>>> 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit
>>>>> commit 32dce968dd987adfb0c00946d78dad9154f64759
>>>>> Author: Vlad Yasevich 
>>>>> Date:   Sat Jan 31 10:40:18 2015 -0500
>>>>>
>>>>> ipv6: Allow for partial checksums on non-ufo packets
>>>>>
>>>>> Currntly, if we are not doing UFO on the packet, all UDP
>>>>> packets will start with CHECKSUM_NONE and thus perform full
>>>>> checksum computations in software even if device support
>>>>> IPv6 checksum offloading.
>>>>>
>>>>>

Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode

2015-02-17 Thread Vlad Yasevich

On 02/15/2015 11:54 PM, Hiroshi Shimamoto wrote:
 Can you please fix up your patches based on my tree:
 git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que
 ue.git

 Yes. I haven't noticed your tree.
 Will resend patches against it.


 I encountered an issue with your tree, the commit id is below.

 $ git log | head
 commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e
 Author: Rasmus Villemoes li...@rasmusvillemoes.dk
 Date:   Fri Jan 23 20:43:14 2015 -0800

 ethernet: fm10k: Actually drop 4 bits

 The comment explains the intention, but vid has type u16. Before
 the
 inner shift, it is promoted to int, which has plenty of space for 
 all
 vid's bits, so nothing is dropped. Use a simple mask instead.


 I use the kernel from your tree in both host and guest.

 Assign an IPv6 for VF in guest.
 # ip -6 addr add 2001:db8::18:1/64 dev ens0

 Send ping packet from other server to the VM.
 # ping6  2001:db8::18:1 -I eth0

 The following message was shown.
 ixgbevf :00:08.0: partial checksum but l4 proto=3a!

 If I did the same operation in the host, I saw the same error
 message in
 host too.
 ixgbe :2d:00.0: partial checksum but l4 proto=3a!

 Do you have any idea about that?

 Ah, sorry about that, try this tree again:
 git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git

 That patch was dropped for favor of a patch that Matthew Vick
 put together (and recently got pushed upstream).  So my queue no
 longer has that patch in the queue, since it got dropped.

 I still see the same error, the head id is the below

 $ git log | head
 commit a072afb0b45904022b76deef3b770ee9a93cb13a
 Author: Nicholas Krause xerofo...@gmail.com
 Date:   Mon Feb 9 00:27:00 2015 -0800

 igb: Remove outdated fix me comment in the
 function,gb_acquire_swfw_sync_i210


 thanks,
 Hiroshi

 I'm having our validation see if they can recreate the same issue
 internally.  When they get back to me I'll let you
 know
 what we found.

 We did bisect, and the below looks the culprit;

 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit
 commit 32dce968dd987adfb0c00946d78dad9154f64759
 Author: Vlad Yasevich vyasev...@gmail.com
 Date:   Sat Jan 31 10:40:18 2015 -0500

 ipv6: Allow for partial checksums on non-ufo packets

 Currntly, if we are not doing UFO on the packet, all UDP
 packets will start with CHECKSUM_NONE and thus perform full
 checksum computations in software even if device support
 IPv6 checksum offloading.

 Let's start start with CHECKSUM_PARTIAL if the device
 supports it and we are sending only a single packet at
 or below mtu size.

 Signed-off-by: Vladislav Yasevich vyase...@redhat.com
 Signed-off-by: David S. Miller da...@davemloft.net

 :04 04 4437eaf7e944f5a6136ebf668a256fee688fda3d
 fade8da998d35c8da97a15f0556949ad371e5347 M  net

 When I reverted the commit, the issue was solved.

 thanks,
 Hiroshi

 I believe the issue is that this patch (32dce968dd98 - ipv6: Allow for 
 partial checksums on non-ufo packets) is that
 it now sets CHECKSUM_PARTIAL on all IPv6 packets including ICMPv6 ones.  Our 
 HW (82599) only supports checksum offload
 on TCP/UDP (NETIF_F_IPV6_CSUM) so we get hung up on the skb's protocol and 
 the fact that it is CHECKSUM_PARTIAL.

 Another thing that confuses me is the feature test in this patch.  It 
 checks (rt-dst.dev-features  NETIF_F_V6_CSUM)
 but NETIF_F_V6_CSUM is a two bit field?

 #define NETIF_F_V6_CSUM (NETIF_F_GEN_CSUM | NETIF_F_IPV6_CSUM)

 So the test would succeed if either bit was high, that doesn't seem right.  
 I cc'd the author so maybe he could clue
 us in.

 This has been addressed by:
 commit bf250a1fa769f2eb8fc7a4e28b3b523e9cb67eef
 Author: Vlad Yasevich vyasev...@gmail.com
 Date:   Tue Feb 10 11:37:29 2015 -0500

 ipv6: Partial checksum only UDP packets


 As far the 2 bit issue, GEN_CSUM (HW_SUM) and IPV6_CSUM can not coexist at 
 the same time.
 See netdev_fix_features().

 
 thanks for pointing it. I will test with that commit.
 
 Jeff's tree hasn't included that commit yet, right?
 Which branch has the commit?

This is in DaveM's net and net-next trees.

-vlad

 
 thanks,
 Hiroshi
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode

2015-02-13 Thread Vlad Yasevich

On 02/13/2015 12:26 PM, Skidmore, Donald C wrote:
> 
> 
>> -Original Message-
>> From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com]
>> Sent: Thursday, February 12, 2015 8:45 PM
>> To: Skidmore, Donald C; Kirsher, Jeffrey T
>> Cc: Alexander Duyck; Bjørn Mork; e1000-de...@lists.sourceforge.net;
>> net...@vger.kernel.org; Choi, Sy Jong; linux-kernel@vger.kernel.org; David
>> Laight; Hayato Momma
>> Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to
>> enable MC promiscuous mode
>>
>>>>> -Original Message-
>>>>> From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com]
>>>>> Sent: Monday, February 09, 2015 6:29 PM
>>>>> To: Kirsher, Jeffrey T
>>>>> Cc: Alexander Duyck; Skidmore, Donald C; Bjørn Mork; e1000-
>>>>> de...@lists.sourceforge.net; net...@vger.kernel.org; Choi, Sy
>>>>> Jong; linux- ker...@vger.kernel.org; David Laight; Hayato Momma
>>>>> Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new
>>>>> mbox API to enable MC promiscuous mode
>>>>>
>>>>>>>>> Can you please fix up your patches based on my tree:
>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que
>>>>>>>>> ue.git
>>>>>>>>
>>>>>>>> Yes. I haven't noticed your tree.
>>>>>>>> Will resend patches against it.
>>>>>>>>
>>>>>>>
>>>>>>> I encountered an issue with your tree, the commit id is below.
>>>>>>>
>>>>>>> $ git log | head
>>>>>>> commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e
>>>>>>> Author: Rasmus Villemoes 
>>>>>>> Date:   Fri Jan 23 20:43:14 2015 -0800
>>>>>>>
>>>>>>> ethernet: fm10k: Actually drop 4 bits
>>>>>>>
>>>>>>> The comment explains the intention, but vid has type u16. Before
>> the
>>>>>>> inner shift, it is promoted to int, which has plenty of space for 
>>>>>>> all
>>>>>>> vid's bits, so nothing is dropped. Use a simple mask instead.
>>>>>>>
>>>>>>>
>>>>>>> I use the kernel from your tree in both host and guest.
>>>>>>>
>>>>>>> Assign an IPv6 for VF in guest.
>>>>>>> # ip -6 addr add 2001:db8::18:1/64 dev ens0
>>>>>>>
>>>>>>> Send ping packet from other server to the VM.
>>>>>>> # ping6  2001:db8::18:1 -I eth0
>>>>>>>
>>>>>>> The following message was shown.
>>>>>>> ixgbevf :00:08.0: partial checksum but l4 proto=3a!
>>>>>>>
>>>>>>> If I did the same operation in the host, I saw the same error
>>>>>>> message in
>>>>> host too.
>>>>>>> ixgbe :2d:00.0: partial checksum but l4 proto=3a!
>>>>>>>
>>>>>>> Do you have any idea about that?
>>>>>>
>>>>>> Ah, sorry about that, try this tree again:
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git
>>>>>>
>>>>>> That patch was dropped for favor of a patch that Matthew Vick
>>>>>> put together (and recently got pushed upstream).  So my queue no
>>>>>> longer has that patch in the queue, since it got dropped.
>>>>>
>>>>> I still see the same error, the head id is the below
>>>>>
>>>>> $ git log | head
>>>>> commit a072afb0b45904022b76deef3b770ee9a93cb13a
>>>>> Author: Nicholas Krause 
>>>>> Date:   Mon Feb 9 00:27:00 2015 -0800
>>>>>
>>>>> igb: Remove outdated fix me comment in the
>>>>> function,gb_acquire_swfw_sync_i210
>>>>>
>>>>>
>>>>> thanks,
>>>>> Hiroshi
>>>>
>>>> I'm having our validation see if they can recreate the same issue
>>>> internally.  When they get back to me I'll let you
>>> know
>>>> what we found.
>>>
>>> We did bisect, and the below looks the culprit;
>>>
>>> 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit
>>> commit 32dce968dd98

Re: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode

2015-02-13 Thread Vlad Yasevich

On 02/13/2015 12:26 PM, Skidmore, Donald C wrote:

 -Original Message-
 From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com]
 Sent: Thursday, February 12, 2015 8:45 PM
 To: Skidmore, Donald C; Kirsher, Jeffrey T
 Cc: Alexander Duyck; Bjørn Mork; e1000-de...@lists.sourceforge.net;
 net...@vger.kernel.org; Choi, Sy Jong; linux-kernel@vger.kernel.org; David
 Laight; Hayato Momma
 Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new mbox API to
 enable MC promiscuous mode

 -Original Message-
 From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com]
 Sent: Monday, February 09, 2015 6:29 PM
 To: Kirsher, Jeffrey T
 Cc: Alexander Duyck; Skidmore, Donald C; Bjørn Mork; e1000-
 de...@lists.sourceforge.net; net...@vger.kernel.org; Choi, Sy
 Jong; linux- ker...@vger.kernel.org; David Laight; Hayato Momma
 Subject: RE: [E1000-devel] [PATCH 1/3] ixgbe, ixgbevf: Add new
 mbox API to enable MC promiscuous mode

 Can you please fix up your patches based on my tree:
 git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/que
 ue.git

 Yes. I haven't noticed your tree.
 Will resend patches against it.

 I encountered an issue with your tree, the commit id is below.

 $ git log | head
 commit e6f1649780f8f5a87299bf6af04453f93d1e3d5e
 Author: Rasmus Villemoes li...@rasmusvillemoes.dk
 Date:   Fri Jan 23 20:43:14 2015 -0800

 ethernet: fm10k: Actually drop 4 bits

 The comment explains the intention, but vid has type u16. Before
 the
 inner shift, it is promoted to int, which has plenty of space for 
 all
 vid's bits, so nothing is dropped. Use a simple mask instead.

 I use the kernel from your tree in both host and guest.

 Assign an IPv6 for VF in guest.
 # ip -6 addr add 2001:db8::18:1/64 dev ens0

 Send ping packet from other server to the VM.
 # ping6  2001:db8::18:1 -I eth0

 The following message was shown.
 ixgbevf :00:08.0: partial checksum but l4 proto=3a!

 If I did the same operation in the host, I saw the same error
 message in
 host too.
 ixgbe :2d:00.0: partial checksum but l4 proto=3a!

 Do you have any idea about that?

 Ah, sorry about that, try this tree again:
 git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/queue.git

 That patch was dropped for favor of a patch that Matthew Vick
 put together (and recently got pushed upstream).  So my queue no
 longer has that patch in the queue, since it got dropped.

 I still see the same error, the head id is the below

 $ git log | head
 commit a072afb0b45904022b76deef3b770ee9a93cb13a
 Author: Nicholas Krause xerofo...@gmail.com
 Date:   Mon Feb 9 00:27:00 2015 -0800

 igb: Remove outdated fix me comment in the
 function,gb_acquire_swfw_sync_i210

 thanks,
 Hiroshi

 I'm having our validation see if they can recreate the same issue
 internally.  When they get back to me I'll let you
 know
 what we found.

 We did bisect, and the below looks the culprit;

 32dce968dd987adfb0c00946d78dad9154f64759 is the first bad commit
 commit 32dce968dd987adfb0c00946d78dad9154f64759
 Author: Vlad Yasevich vyasev...@gmail.com
 Date:   Sat Jan 31 10:40:18 2015 -0500

 ipv6: Allow for partial checksums on non-ufo packets

 Currntly, if we are not doing UFO on the packet, all UDP
 packets will start with CHECKSUM_NONE and thus perform full
 checksum computations in software even if device support
 IPv6 checksum offloading.

 Let's start start with CHECKSUM_PARTIAL if the device
 supports it and we are sending only a single packet at
 or below mtu size.

 Signed-off-by: Vladislav Yasevich vyase...@redhat.com
 Signed-off-by: David S. Miller da...@davemloft.net

 :04 04 4437eaf7e944f5a6136ebf668a256fee688fda3d
 fade8da998d35c8da97a15f0556949ad371e5347 M  net

 When I reverted the commit, the issue was solved.

 thanks,
 Hiroshi

 I believe the issue is that this patch (32dce968dd98 - ipv6: Allow for 
 partial checksums on non-ufo packets) is that it now sets CHECKSUM_PARTIAL on 
 all IPv6 packets including ICMPv6 ones.  Our HW (82599) only supports 
 checksum offload on TCP/UDP (NETIF_F_IPV6_CSUM) so we get hung up on the 
 skb's protocol and the fact that it is CHECKSUM_PARTIAL.

 Another thing that confuses me is the feature test in this patch.  It checks 
 (rt-dst.dev-features  NETIF_F_V6_CSUM) but NETIF_F_V6_CSUM is a two bit 
 field?  

 #define NETIF_F_V6_CSUM (NETIF_F_GEN_CSUM | NETIF_F_IPV6_CSUM)

 So the test would succeed if either bit was high, that doesn't seem right.  I 
 cc'd the author so maybe he could clue us in.

This has been addressed by:
commit bf250a1fa769f2eb8fc7a4e28b3b523e9cb67eef
Author: Vlad Yasevich vyasev...@gmail.com
Date:   Tue Feb 10 11:37:29 2015 -0500

ipv6: Partial checksum only UDP packets

As far the 2 bit issue, GEN_CSUM (HW_SUM) and IPV6_CSUM can not coexist at the 
same time.
See netdev_fix_features().

-vlad

 Thanks,
 -Don Skidmore donald.c.skidm...@intel.com

--
To unsubscribe from this list: send

Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached

2015-01-23 Thread Vlad Yasevich

On 01/23/2015 12:10 PM, Daniel Borkmann wrote:
> On 01/23/2015 05:05 PM, Vlad Yasevich wrote:
>> On 01/23/2015 06:50 AM, Daniel Borkmann wrote:
>>> On 01/23/2015 11:25 AM, Sun Paul wrote:
>>> ...
>>>> I would like to check the behave in LKSCTP.
>>>>
>>>> we are running DIAMETER message over SCTP, and we have set the
>>>> parameter "net.sctp.association_max_retrans = 4" in the LinuxOS.
>>>>
>>>> We noticed that when remote peer have retry to send the same request
>>>> for 4 times, the LKSCTP will initiate an ABORT chunk with reason
>>>> "association exceeded its max_retrans count".
>>>>
>>>> We would like to know whether this is the correct behavior? is there
>>>> any other option that we can alter in order to avoid the ABORT chunk
>>>> being sent?
>>>
>>> I don't recall the RFC saying to send an ABORT, but let me double
>>> check in the mean time.
>>
>> The RFC is silent on the matter.  The abort got added in 3.8 so
>> it's been there for a while.
> 
> I see, commit de4594a51c90 ("sctp: send abort chunk when max_retrans
> exceeded") added the behaviour.
> 
>>> Hmm, untested, but could you try something like that?
>>>
>>> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
>>> index fef2acd..5ce198d 100644
>>> --- a/net/sctp/sm_sideeffect.c
>>> +++ b/net/sctp/sm_sideeffect.c
>>> @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t 
>>> *commands,
>>>   sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP,
>>>   SCTP_ULPEVENT(event));
>>>
>>> -if (asoc->overall_error_count >= asoc->max_retrans) {
>>> +if (asoc->overall_error_count >= asoc->max_retrans &&
>>> +error != SCTP_ERROR_NO_ERROR) {
>>>   abort = sctp_make_violation_max_retrans(asoc, chunk);
>>>   if (abort)
>>>   sctp_add_cmd_sf(commands, SCTP_CMD_REPLY,
>>
>> This would pretty much stop all ABORTs due to excessive rtx.  Might
>> as well take the code out :).
>>
>> I was a bit concerned about this ABORT when it went in.
> 
> So effectively, if I understand the argument from the commit, the
> assumption is that the ABORT would never reach the peer anyway, but
> is a way for tcpdump users to see on the wire that rtx limit has
> been exceeded and since there's not mentioned anything in the RFC
> about this, it doesn't break it. Hm.
> 

Additionally I seem to recall BSD sending this type of ABORT for pretty
much the same reason.

-vlad

> Sun Paul, what exactly broke in your scenario? Can you be more explicit?
> 
> Thanks,
> Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached

2015-01-23 Thread Vlad Yasevich

On 01/23/2015 05:25 AM, Sun Paul wrote:
> Hi
> 
> I would like to check the behave in LKSCTP.
> 
> we are running DIAMETER message over SCTP, and we have set the
> parameter "net.sctp.association_max_retrans = 4" in the LinuxOS.
> 
> We noticed that when remote peer have retry to send the same request
> for 4 times, the LKSCTP will initiate an ABORT chunk with reason
> "association exceeded its max_retrans count".
> 
> We would like to know whether this is the correct behavior? is there
> any other option that we can alter in order to avoid the ABORT chunk
> being sent?
> 

Why do you not want ABORT to be sent? SCTP has attempted to retransmit
the data maximum allows times, and at this point it will terminate
the association.  It sends an ABORT notifying the peer of this, but
most likely the peer is unreachable anyway.

Any message that a peer sends at this point will most likely result
in an ABORT to be send back or an association restart.  Might
as well start fresh.

-vlad

> Thanks
> 
> PS
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached

2015-01-23 Thread Vlad Yasevich

On 01/23/2015 06:50 AM, Daniel Borkmann wrote:
> Hi,
> 
> On 01/23/2015 11:25 AM, Sun Paul wrote:
> ...
>> I would like to check the behave in LKSCTP.
>>
>> we are running DIAMETER message over SCTP, and we have set the
>> parameter "net.sctp.association_max_retrans = 4" in the LinuxOS.
>>
>> We noticed that when remote peer have retry to send the same request
>> for 4 times, the LKSCTP will initiate an ABORT chunk with reason
>> "association exceeded its max_retrans count".
>>
>> We would like to know whether this is the correct behavior? is there
>> any other option that we can alter in order to avoid the ABORT chunk
>> being sent?
> 
> I don't recall the RFC saying to send an ABORT, but let me double
> check in the mean time.

The RFC is silent on the matter.  The abort got added in 3.8 so
it's been there for a while.


> 
> Hmm, untested, but could you try something like that?
> 
> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> index fef2acd..5ce198d 100644
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t 
> *commands,
>  sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP,
>  SCTP_ULPEVENT(event));
> 
> -if (asoc->overall_error_count >= asoc->max_retrans) {
> +if (asoc->overall_error_count >= asoc->max_retrans &&
> +error != SCTP_ERROR_NO_ERROR) {
>  abort = sctp_make_violation_max_retrans(asoc, chunk);
>  if (abort)
>  sctp_add_cmd_sf(commands, SCTP_CMD_REPLY,


This would pretty much stop all ABORTs due to excessive rtx.  Might
as well take the code out :).

I was a bit concerned about this ABORT when it went in.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached

2015-01-23 Thread Vlad Yasevich

On 01/23/2015 12:10 PM, Daniel Borkmann wrote:
 On 01/23/2015 05:05 PM, Vlad Yasevich wrote:
 On 01/23/2015 06:50 AM, Daniel Borkmann wrote:
 On 01/23/2015 11:25 AM, Sun Paul wrote:
 ...
 I would like to check the behave in LKSCTP.

 we are running DIAMETER message over SCTP, and we have set the
 parameter net.sctp.association_max_retrans = 4 in the LinuxOS.

 We noticed that when remote peer have retry to send the same request
 for 4 times, the LKSCTP will initiate an ABORT chunk with reason
 association exceeded its max_retrans count.

 We would like to know whether this is the correct behavior? is there
 any other option that we can alter in order to avoid the ABORT chunk
 being sent?

 I don't recall the RFC saying to send an ABORT, but let me double
 check in the mean time.

 The RFC is silent on the matter.  The abort got added in 3.8 so
 it's been there for a while.
 
 I see, commit de4594a51c90 (sctp: send abort chunk when max_retrans
 exceeded) added the behaviour.
 
 Hmm, untested, but could you try something like that?

 diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
 index fef2acd..5ce198d 100644
 --- a/net/sctp/sm_sideeffect.c
 +++ b/net/sctp/sm_sideeffect.c
 @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t 
 *commands,
   sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP,
   SCTP_ULPEVENT(event));

 -if (asoc-overall_error_count = asoc-max_retrans) {
 +if (asoc-overall_error_count = asoc-max_retrans 
 +error != SCTP_ERROR_NO_ERROR) {
   abort = sctp_make_violation_max_retrans(asoc, chunk);
   if (abort)
   sctp_add_cmd_sf(commands, SCTP_CMD_REPLY,

 This would pretty much stop all ABORTs due to excessive rtx.  Might
 as well take the code out :).

 I was a bit concerned about this ABORT when it went in.
 
 So effectively, if I understand the argument from the commit, the
 assumption is that the ABORT would never reach the peer anyway, but
 is a way for tcpdump users to see on the wire that rtx limit has
 been exceeded and since there's not mentioned anything in the RFC
 about this, it doesn't break it. Hm.
 

Additionally I seem to recall BSD sending this type of ABORT for pretty
much the same reason.

-vlad

 Sun Paul, what exactly broke in your scenario? Can you be more explicit?
 
 Thanks,
 Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached

2015-01-23 Thread Vlad Yasevich

On 01/23/2015 06:50 AM, Daniel Borkmann wrote:
 Hi,
 
 On 01/23/2015 11:25 AM, Sun Paul wrote:
 ...
 I would like to check the behave in LKSCTP.

 we are running DIAMETER message over SCTP, and we have set the
 parameter net.sctp.association_max_retrans = 4 in the LinuxOS.

 We noticed that when remote peer have retry to send the same request
 for 4 times, the LKSCTP will initiate an ABORT chunk with reason
 association exceeded its max_retrans count.

 We would like to know whether this is the correct behavior? is there
 any other option that we can alter in order to avoid the ABORT chunk
 being sent?
 
 I don't recall the RFC saying to send an ABORT, but let me double
 check in the mean time.

The RFC is silent on the matter.  The abort got added in 3.8 so
it's been there for a while.


 
 Hmm, untested, but could you try something like that?
 
 diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
 index fef2acd..5ce198d 100644
 --- a/net/sctp/sm_sideeffect.c
 +++ b/net/sctp/sm_sideeffect.c
 @@ -584,7 +584,8 @@ static void sctp_cmd_assoc_failed(sctp_cmd_seq_t 
 *commands,
  sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP,
  SCTP_ULPEVENT(event));
 
 -if (asoc-overall_error_count = asoc-max_retrans) {
 +if (asoc-overall_error_count = asoc-max_retrans 
 +error != SCTP_ERROR_NO_ERROR) {
  abort = sctp_make_violation_max_retrans(asoc, chunk);
  if (abort)
  sctp_add_cmd_sf(commands, SCTP_CMD_REPLY,


This would pretty much stop all ABORTs due to excessive rtx.  Might
as well take the code out :).

I was a bit concerned about this ABORT when it went in.

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fwd: Question on SCTP ABORT chunk is generated when the association_max_retrans is reached

2015-01-23 Thread Vlad Yasevich

On 01/23/2015 05:25 AM, Sun Paul wrote:
 Hi
 
 I would like to check the behave in LKSCTP.
 
 we are running DIAMETER message over SCTP, and we have set the
 parameter net.sctp.association_max_retrans = 4 in the LinuxOS.
 
 We noticed that when remote peer have retry to send the same request
 for 4 times, the LKSCTP will initiate an ABORT chunk with reason
 association exceeded its max_retrans count.
 
 We would like to know whether this is the correct behavior? is there
 any other option that we can alter in order to avoid the ABORT chunk
 being sent?
 

Why do you not want ABORT to be sent? SCTP has attempted to retransmit
the data maximum allows times, and at this point it will terminate
the association.  It sends an ABORT notifying the peer of this, but
most likely the peer is unreachable anyway.

Any message that a peer sends at this point will most likely result
in an ABORT to be send back or an association restart.  Might
as well start fresh.

-vlad

 Thanks
 
 PS
 --
 To unsubscribe from this list: send the line unsubscribe linux-sctp in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.2 105/131] net: Correctly set segment mac_len in skb_segment().

2014-09-11 Thread Vlad Yasevich

On 09/11/2014 08:32 AM, Ben Hutchings wrote:
> 3.2.63-rc1 review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Vlad Yasevich 
> 
> [ Upstream commit fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ]
> 
> When performing segmentation, the mac_len value is copied right
> out of the original skb.  However, this value is not always set correctly
> (like when the packet is VLAN-tagged) and we'll end up copying a bad
> value.
> 
> One way to demonstrate this is to configure a VM which tags
> packets internally and turn off VLAN acceleration on the forwarding
> bridge port.  The packets show up corrupt like this:
> 16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
> (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
> 0x:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.d.
> 0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..dA.^d..0..
> 0x0020:  29e3 01c9 f871  0101 080a 000a e833)q.3
> 0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
> 0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
> 0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
> 0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
> ...
> 
> This also leads to awful throughput as GSO packets are dropped and
> cause retransmissions.
> 
> The solution is to set the mac_len using the values already available
> in then new skb.  We've already adjusted all of the header offset, so we
> might as well correctly figure out the mac_len using skb_reset_mac_len().
> After this change, packets are segmented correctly and performance
> is restored.
> 
> CC: Eric Dumazet 
> Signed-off-by: Vlad Yasevich 
> Signed-off-by: David S. Miller 
> Signed-off-by: Ben Hutchings 
> ---
>  net/core/skbuff.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7121d9b..0ccfb53 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2669,7 +2669,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 
> features)
>   tail = nskb;
>  
>   __copy_skb_header(nskb, skb);
> - nskb->mac_len = skb->mac_len;
>  
>   /* nskb and skb might have different headroom */
>   if (nskb->ip_summed == CHECKSUM_PARTIAL)
> @@ -2679,6 +2678,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 
> features)
>   skb_set_network_header(nskb, skb->mac_len);
>   nskb->transport_header = (nskb->network_header +
> skb_network_header_len(skb));
> + skb_reset_mac_len(nskb);

This will not fix the problem here because the network header above will
already be set incorrectly based on the old mac_len.

This patch depends on
commit 030737bcc3c404e273e97dbe06fe9561699a411b
Author: Eric Dumazet 
Date:   Sat Oct 19 11:42:54 2013 -0700

net: generalize skb_segment()

that correctly populates the header offsets in the new segment.

-vlad

>   skb_copy_from_linear_data(skb, nskb->data, doffset);
>  
>   if (fskb != skb_shinfo(skb)->frag_list)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.2 105/131] net: Correctly set segment mac_len in skb_segment().

2014-09-11 Thread Vlad Yasevich

On 09/11/2014 08:32 AM, Ben Hutchings wrote:
 3.2.63-rc1 review patch.  If anyone has any objections, please let me know.
 
 --
 
 From: Vlad Yasevich vyase...@redhat.com
 
 [ Upstream commit fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ]
 
 When performing segmentation, the mac_len value is copied right
 out of the original skb.  However, this value is not always set correctly
 (like when the packet is VLAN-tagged) and we'll end up copying a bad
 value.
 
 One way to demonstrate this is to configure a VM which tags
 packets internally and turn off VLAN acceleration on the forwarding
 bridge port.  The packets show up corrupt like this:
 16:18:24.985548 52:54:00:ab:be:25  52:54:00:26:ce:a3, ethertype 802.1Q
 (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
 0x:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.d.
 0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..dA.^d..0..
 0x0020:  29e3 01c9 f871  0101 080a 000a e833)q.3
 0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
 0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
 0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
 0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
 ...
 
 This also leads to awful throughput as GSO packets are dropped and
 cause retransmissions.
 
 The solution is to set the mac_len using the values already available
 in then new skb.  We've already adjusted all of the header offset, so we
 might as well correctly figure out the mac_len using skb_reset_mac_len().
 After this change, packets are segmented correctly and performance
 is restored.
 
 CC: Eric Dumazet eduma...@google.com
 Signed-off-by: Vlad Yasevich vyase...@redhat.com
 Signed-off-by: David S. Miller da...@davemloft.net
 Signed-off-by: Ben Hutchings b...@decadent.org.uk
 ---
  net/core/skbuff.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index 7121d9b..0ccfb53 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -2669,7 +2669,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 
 features)
   tail = nskb;
  
   __copy_skb_header(nskb, skb);
 - nskb-mac_len = skb-mac_len;
  
   /* nskb and skb might have different headroom */
   if (nskb-ip_summed == CHECKSUM_PARTIAL)
 @@ -2679,6 +2678,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 
 features)
   skb_set_network_header(nskb, skb-mac_len);
   nskb-transport_header = (nskb-network_header +
 skb_network_header_len(skb));
 + skb_reset_mac_len(nskb);

This will not fix the problem here because the network header above will
already be set incorrectly based on the old mac_len.

This patch depends on
commit 030737bcc3c404e273e97dbe06fe9561699a411b
Author: Eric Dumazet eduma...@google.com
Date:   Sat Oct 19 11:42:54 2013 -0700

net: generalize skb_segment()

that correctly populates the header offsets in the new segment.

-vlad

   skb_copy_from_linear_data(skb, nskb-data, doffset);
  
   if (fskb != skb_shinfo(skb)-frag_list)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)

2014-09-02 Thread Vlad Yasevich

On 09/02/2014 02:15 PM, Cong Wang wrote:
> On Tue, Sep 2, 2014 at 11:11 AM, Eric Dumazet  wrote:
>> On Tue, 2014-09-02 at 11:04 -0700, Cong Wang wrote:
>>> On Tue, Sep 2, 2014 at 10:58 AM, Hannes Frederic Sowa
>>
 I definitely don't have a problem cleaning this up in net-next. I wanted
 a minimal patch for stable because I didn't check history where and when
 additional users of dev_get_by_flags_rcu were removed.
>>>
>>> `git grep` should show you we only have one caller. Apparently we don't
>>> care about any out-of-tree module.
>>
>> Point is : you did not check if some stable versions had more callers.
>>
>> Its very nice you checked current version, but it is not enough for a
>> stable candidate.
> 
> That is what we do when backporting patches, I can do that if David asks
> me to backport it, but you know for netdev that is David's work.
> 
> (I am not saying I don't want to help him, I just want to point out the fact.
> I am very pleased to help David for stable backports as long as he asks)

Instead of helping after the fact, why not arrange the patches so that it's
not such a big issue.  Leave the _rcu variant alone.  Add an _rtnl variant
of the function and use that in the patch.  Have a follow-on patch that
removes the _rcu variant all by itself.  This way backports become easier,
and if anyone wants the _rcu variant back, all they have to do is revert
a very simple commit.

-vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)

2014-09-02 Thread Vlad Yasevich

On 09/02/2014 02:15 PM, Cong Wang wrote:
 On Tue, Sep 2, 2014 at 11:11 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Tue, 2014-09-02 at 11:04 -0700, Cong Wang wrote:
 On Tue, Sep 2, 2014 at 10:58 AM, Hannes Frederic Sowa

 I definitely don't have a problem cleaning this up in net-next. I wanted
 a minimal patch for stable because I didn't check history where and when
 additional users of dev_get_by_flags_rcu were removed.

 `git grep` should show you we only have one caller. Apparently we don't
 care about any out-of-tree module.

 Point is : you did not check if some stable versions had more callers.

 Its very nice you checked current version, but it is not enough for a
 stable candidate.
 
 That is what we do when backporting patches, I can do that if David asks
 me to backport it, but you know for netdev that is David's work.
 
 (I am not saying I don't want to help him, I just want to point out the fact.
 I am very pleased to help David for stable backports as long as he asks)

Instead of helping after the fact, why not arrange the patches so that it's
not such a big issue.  Leave the _rcu variant alone.  Add an _rtnl variant
of the function and use that in the patch.  Have a follow-on patch that
removes the _rcu variant all by itself.  This way backports become easier,
and if anyone wants the _rcu variant back, all they have to do is revert
a very simple commit.

-vlad

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)

2014-08-29 Thread Vlad Yasevich

On 08/29/2014 11:26 AM, Tommi Rantala wrote:
> Hi,
> 
> Was fuzzing Linus v3.17-rc2-89-g59753a8 with Trinity as the root user
> in qemu, when I hit the following assertion failures.
> 
> Tommi
> 
> 
> [init] Started watchdog process, PID is 4841
> [main] Main thread is alive.
> [   77.229699] sctp: [Deprecated]: trinity-main (pid 4842) Use of int
> in max_burst socket option deprecated.
> [   77.229699] Use struct sctp_assoc_value instead
> [   77.297196] RTNL: assertion failed at net/ipv6/addrconf.c (1699)
> [   77.298080] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30
> [   77.299039] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   77.299789]  88003d76a618 880026133c50 8238ba79
> 880037c84520
> [   77.300829]  880026133c90 820bd52b 
> 82d86c40
> [   77.301869]   f76fd1e1 8800382d8000
> 8800382d8220
> [   77.302906] Call Trace:
> [   77.303246]  [] dump_stack+0x4d/0x66
> [   77.303928]  [] addrconf_join_solict+0x4b/0xb0
> [   77.304731]  [] ipv6_dev_ac_inc+0x2bb/0x330
> [   77.305498]  [] ? ac6_seq_start+0x260/0x260
> [   77.306257]  [] ipv6_sock_ac_join+0x26e/0x360
> [   77.307046]  [] ? ipv6_sock_ac_join+0x99/0x360
> [   77.307798]  [] do_ipv6_setsockopt.isra.5+0xa70/0xf20
> [   77.308570]  [] ? sched_clock_local+0x1d/0x80
> [   77.309260]  [] ? kvm_clock_read+0x27/0x40
> [   77.309915]  [] ? sched_clock+0x9/0x10
> [   77.310537]  [] ? sock_has_perm+0x168/0x1e0
> [   77.311204]  [] ? sched_clock_cpu+0xa8/0xf0
> [   77.311866]  [] ? local_clock+0x1b/0x30
> [   77.312501]  [] ? lock_release_holdtime+0x1d/0x170
> [   77.313241]  [] ? sock_has_perm+0x180/0x1e0
> [   77.313905]  [] ?
> selinux_msg_queue_alloc_security+0xa0/0xa0
> [   77.314746]  [] ipv6_setsockopt+0x53/0xb0
> [   77.315397]  [] udpv6_setsockopt+0x25/0x30
> [   77.316058]  [] sock_common_setsockopt+0xf/0x20
> [   77.316764]  [] SyS_setsockopt+0x8e/0xd0
> [   77.317406]  [] system_call_fastpath+0x16/0x1b
> [main] 375 sockets created based on info from socket cachefile.
> [main] Generating file descriptors
> [main] Added 129 filenames from /dev
> [main] Added 44048 filenames from /proc
> [main] Added 18192 filenames from /sys
> [main] Enabled 9 fd providers.
> [watchdog] Watchdog is alive. (pid:4841)
> [child3:4846] finit_module (313) returned ENOSYS, marking as inactive.
> [child1:4844] kcmp (312) returned ENOSYS, marking as inactive.
> [child2:4845] uselib (134) returned ENOSYS, marking as inactive.
> [child1:4844] nfsservctl (180) returned ENOSYS, marking as inactive.
> [child2:4845] delete_module (129:[32BIT]) returned ENOSYS, marking as 
> inactive.
> [child2:4845] init_module (175) returned ENOSYS, marking as inactive.
> [   84.126609] trinity-c7: vm86 mode not supported on 64 bit kernel
> [child7:4850] vm86 (166:[32BIT]) returned ENOSYS, marking as inactive.
> [main] Bailing main loop because ctrl-c.
> [   84.345840] RTNL: assertion failed at net/ipv6/addrconf.c (1712)
> [   84.346615] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30
> [   84.347426] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   84.348102]  88003d76a618 880026133d10 8238ba79
> 8800382d8000
> [   84.349018]  880026133d50 820bd5db 81141555
> 8800382d8220
> [   84.349935]  8800382d8000 f76fd1e1 88003d76a618
> 8800382d8000
> [   84.350848] Call Trace:
> [   84.351149]  [] dump_stack+0x4d/0x66
> [   84.351751]  [] addrconf_leave_solict+0x4b/0xb0
> [   84.352574]  [] ? __local_bh_enable_ip+0xa5/0xf0
> [   84.353315]  [] __ipv6_dev_ac_dec+0xc3/0x140
> [   84.354019]  [] ipv6_dev_ac_dec+0x98/0xb0
> [   84.354687]  [] ipv6_sock_ac_close+0x10d/0x1a0
> [   84.355410]  [] ? ipv6_sock_ac_close+0x2e/0x1a0
> [   84.356147]  [] inet6_release+0x23/0x40
> [   84.356789]  [] sock_release+0x14/0x80
> [   84.357410]  [] sock_close+0xd/0x20
> [   84.358042]  [] __fput+0x111/0x1e0
> [   84.358622]  [] fput+0x9/0x10
> [   84.359196]  [] task_work_run+0x9e/0xd0
> [   84.359825]  [] do_exit+0x456/0xb30
> [   84.360419]  [] ? retint_swapgs+0x13/0x1b
> [   84.361075]  [] do_group_exit+0x84/0xd0
> [   84.361705]  [] SyS_exit_group+0xf/0x10
> [   84.362338]  [] system_call_fastpath+0x16/0x1b
> [watchdog] [4841] Watchdog exiting because ctrl-c.
> [init] Ran 775 syscalls. Successes: 179  Failures: 596
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Yep,  looks like ipv6_dev_ac_inc() and __ipv6_dev_ac_dec() are called
without RNTL in the socket option path and with RTNL in the address
configuration path.  So it look like this this can actually trigger
list corruptions.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: RTNL: assertion failed at net/ipv6/addrconf.c (1699)

2014-08-29 Thread Vlad Yasevich

On 08/29/2014 11:26 AM, Tommi Rantala wrote:
 Hi,
 
 Was fuzzing Linus v3.17-rc2-89-g59753a8 with Trinity as the root user
 in qemu, when I hit the following assertion failures.
 
 Tommi
 
 
 [init] Started watchdog process, PID is 4841
 [main] Main thread is alive.
 [   77.229699] sctp: [Deprecated]: trinity-main (pid 4842) Use of int
 in max_burst socket option deprecated.
 [   77.229699] Use struct sctp_assoc_value instead
 [   77.297196] RTNL: assertion failed at net/ipv6/addrconf.c (1699)
 [   77.298080] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30
 [   77.299039] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   77.299789]  88003d76a618 880026133c50 8238ba79
 880037c84520
 [   77.300829]  880026133c90 820bd52b 
 82d86c40
 [   77.301869]   f76fd1e1 8800382d8000
 8800382d8220
 [   77.302906] Call Trace:
 [   77.303246]  [8238ba79] dump_stack+0x4d/0x66
 [   77.303928]  [820bd52b] addrconf_join_solict+0x4b/0xb0
 [   77.304731]  [820b031b] ipv6_dev_ac_inc+0x2bb/0x330
 [   77.305498]  [820b0060] ? ac6_seq_start+0x260/0x260
 [   77.306257]  [820b05fe] ipv6_sock_ac_join+0x26e/0x360
 [   77.307046]  [820b0429] ? ipv6_sock_ac_join+0x99/0x360
 [   77.307798]  [820cdd60] do_ipv6_setsockopt.isra.5+0xa70/0xf20
 [   77.308570]  [8117097d] ? sched_clock_local+0x1d/0x80
 [   77.309260]  [810a8a27] ? kvm_clock_read+0x27/0x40
 [   77.309915]  [810736d9] ? sched_clock+0x9/0x10
 [   77.310537]  [815afff8] ? sock_has_perm+0x168/0x1e0
 [   77.311204]  [81170bb8] ? sched_clock_cpu+0xa8/0xf0
 [   77.311866]  [81170d1b] ? local_clock+0x1b/0x30
 [   77.312501]  [811872cd] ? lock_release_holdtime+0x1d/0x170
 [   77.313241]  [815b0010] ? sock_has_perm+0x180/0x1e0
 [   77.313905]  [815afe90] ?
 selinux_msg_queue_alloc_security+0xa0/0xa0
 [   77.314746]  [820ce263] ipv6_setsockopt+0x53/0xb0
 [   77.315397]  [820d3135] udpv6_setsockopt+0x25/0x30
 [   77.316058]  [81f9930f] sock_common_setsockopt+0xf/0x20
 [   77.316764]  [81f9305e] SyS_setsockopt+0x8e/0xd0
 [   77.317406]  [823a47e9] system_call_fastpath+0x16/0x1b
 [main] 375 sockets created based on info from socket cachefile.
 [main] Generating file descriptors
 [main] Added 129 filenames from /dev
 [main] Added 44048 filenames from /proc
 [main] Added 18192 filenames from /sys
 [main] Enabled 9 fd providers.
 [watchdog] Watchdog is alive. (pid:4841)
 [child3:4846] finit_module (313) returned ENOSYS, marking as inactive.
 [child1:4844] kcmp (312) returned ENOSYS, marking as inactive.
 [child2:4845] uselib (134) returned ENOSYS, marking as inactive.
 [child1:4844] nfsservctl (180) returned ENOSYS, marking as inactive.
 [child2:4845] delete_module (129:[32BIT]) returned ENOSYS, marking as 
 inactive.
 [child2:4845] init_module (175) returned ENOSYS, marking as inactive.
 [   84.126609] trinity-c7: vm86 mode not supported on 64 bit kernel
 [child7:4850] vm86 (166:[32BIT]) returned ENOSYS, marking as inactive.
 [main] Bailing main loop because ctrl-c.
 [   84.345840] RTNL: assertion failed at net/ipv6/addrconf.c (1712)
 [   84.346615] CPU: 0 PID: 4842 Comm: trinity-main Not tainted 3.17.0-rc2+ #30
 [   84.347426] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   84.348102]  88003d76a618 880026133d10 8238ba79
 8800382d8000
 [   84.349018]  880026133d50 820bd5db 81141555
 8800382d8220
 [   84.349935]  8800382d8000 f76fd1e1 88003d76a618
 8800382d8000
 [   84.350848] Call Trace:
 [   84.351149]  [8238ba79] dump_stack+0x4d/0x66
 [   84.351751]  [820bd5db] addrconf_leave_solict+0x4b/0xb0
 [   84.352574]  [81141555] ? __local_bh_enable_ip+0xa5/0xf0
 [   84.353315]  [820b07b3] __ipv6_dev_ac_dec+0xc3/0x140
 [   84.354019]  [820b08c8] ipv6_dev_ac_dec+0x98/0xb0
 [   84.354687]  [820b0bcd] ipv6_sock_ac_close+0x10d/0x1a0
 [   84.355410]  [820b0aee] ? ipv6_sock_ac_close+0x2e/0x1a0
 [   84.356147]  [820ae9d3] inet6_release+0x23/0x40
 [   84.356789]  [81f91834] sock_release+0x14/0x80
 [   84.357410]  [81f918ad] sock_close+0xd/0x20
 [   84.358042]  [8127fa91] __fput+0x111/0x1e0
 [   84.358622]  [8127fba9] fput+0x9/0x10
 [   84.359196]  [8115e3ee] task_work_run+0x9e/0xd0
 [   84.359825]  [8113f4b6] do_exit+0x456/0xb30
 [   84.360419]  [823a541c] ? retint_swapgs+0x13/0x1b
 [   84.361075]  [8113fc54] do_group_exit+0x84/0xd0
 [   84.361705]  [8113fcaf] SyS_exit_group+0xf/0x10
 [   84.362338]  [823a47e9] system_call_fastpath+0x16/0x1b
 [watchdog] [4841] Watchdog exiting because ctrl-c.
 [init] Ran 775 syscalls. Successes: 179  Failures: 596
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the

Re: [PATCH 1/1] sctp: not send SCTP_PEER_ADDR_CHANGE notifications with failed probe

2014-08-20 Thread Vlad Yasevich

On 08/20/2014 05:31 AM, Zhu Yanjun wrote:
> Since the transport has always been in state SCTP_UNCONFIRMED, it
> therefore wasn't active before and hasn't been used before, and it
> always has been, so it is unnecessary to bug the user with a 
> notification.
> 
> Reported-by: Deepak Khandelwal   
> Suggested-by: Vlad Yasevich  
> Suggested-by: Michael Tuexen 
> Suggested-by: Daniel Borkmann 
> Signed-off-by: Zhu Yanjun 

Acked-by: Vlad Yasevich 

Thanks
-vlad
> ---
>  net/sctp/associola.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 9de23a2..2e23f6b 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -813,6 +813,7 @@ void sctp_assoc_control_transport(struct sctp_association 
> *asoc,
>   else {
>   dst_release(transport->dst);
>   transport->dst = NULL;
> + ulp_notify = false;
>   }
>  
>   spc_state = SCTP_ADDR_UNREACHABLE;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] sctp: not send SCTP_PEER_ADDR_CHANGE notifications with failed probe

2014-08-20 Thread Vlad Yasevich

On 08/20/2014 05:31 AM, Zhu Yanjun wrote:
 Since the transport has always been in state SCTP_UNCONFIRMED, it
 therefore wasn't active before and hasn't been used before, and it
 always has been, so it is unnecessary to bug the user with a 
 notification.
 
 Reported-by: Deepak Khandelwal khandelwal.deepak.1...@gmail.com  
 Suggested-by: Vlad Yasevich vyasev...@gmail.com 
 Suggested-by: Michael Tuexen tue...@fh-muenster.de
 Suggested-by: Daniel Borkmann dbork...@redhat.com
 Signed-off-by: Zhu Yanjun yanjun@windriver.com

Acked-by: Vlad Yasevich vyasev...@gmail.com

Thanks
-vlad
 ---
  net/sctp/associola.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/net/sctp/associola.c b/net/sctp/associola.c
 index 9de23a2..2e23f6b 100644
 --- a/net/sctp/associola.c
 +++ b/net/sctp/associola.c
 @@ -813,6 +813,7 @@ void sctp_assoc_control_transport(struct sctp_association 
 *asoc,
   else {
   dst_release(transport-dst);
   transport-dst = NULL;
 + ulp_notify = false;
   }
  
   spc_state = SCTP_ADDR_UNREACHABLE;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

3.16-rc1 kernel BUG triggered at mutex.c:586

2014-06-16 Thread Vlad Yasevich

Trying to boot 3.16.0-rc1+ (the + is a small vlan related patch) on a
tests system triggers the following BUG:

Jun 16 13:54:50 scratch kernel: BUG: sleeping function called from
invalid context at kernel/locking/mutex.c:586
Jun 16 13:54:50 scratch kernel: in_atomic(): 1, irqs_disabled(): 1, pid:
0, name: swapper/3
Jun 16 13:54:50 scratch kernel: 1 lock held by swapper/3/0:
Jun 16 13:54:50 scratch kernel: #0:
(&(>counts.lock)->rlock){..-...}, at: []
bitmap_endwrite+0x58/0x230
Jun 16 13:54:50 scratch kernel: irq event stamp: 211959
Jun 16 13:54:50 scratch kernel: hardirqs last  enabled at (211958):
[] _raw_spin_unlock_irqrestore+0x36/0x70
Jun 16 13:54:50 scratch kernel: hardirqs last disabled at (211959):
[] _raw_spin_lock_irqsave+0x29/0x70
Jun 16 13:54:50 scratch kernel: softirqs last  enabled at (211952):
[] _local_bh_enable+0x22/0x50
Jun 16 13:54:50 scratch kernel: softirqs last disabled at (211953):
[] irq_exit+0xc5/0xd0
Jun 16 13:54:50 scratch kernel: CPU: 3 PID: 0 Comm: swapper/3 Not
tainted 3.16.0-rc1+ #35
Jun 16 13:54:50 scratch kernel: Hardware name: Hewlett-Packard HP xw8400
Workstation/0A08h, BIOS 786D5 v02.38 10/25/2010
Jun 16 13:54:50 scratch kernel:  88047fcc3bd0
81706ef2 81c857e0
Jun 16 13:54:50 scratch kernel: 88047fcc3be0 810b4dba
88047fcc3c50 8170c79c
Jun 16 13:54:50 scratch kernel:  81c85938
0046 0092
Jun 16 13:54:50 scratch kernel: Call Trace:
Jun 16 13:54:50 scratch kernel:   []
dump_stack+0x4d/0x66
Jun 16 13:54:50 scratch kernel: []
__might_sleep+0xfa/0x130
Jun 16 13:54:50 scratch kernel: []
mutex_lock_nested+0x3c/0x3a0
Jun 16 13:54:50 scratch kernel: []
kernfs_notify+0x8a/0x140
Jun 16 13:54:50 scratch kernel: []
bitmap_endwrite+0xb9/0x230
Jun 16 13:54:50 scratch kernel: []
close_write+0x93/0xb0 [raid1]
Jun 16 13:54:50 scratch kernel: []
r1_bio_write_done+0x29/0x50 [raid1]
Jun 16 13:54:50 scratch kernel: []
raid1_end_write_request+0xcf/0x220 [raid1]
Jun 16 13:54:50 scratch kernel: [] bio_endio+0x5b/0xa0
Jun 16 13:54:50 scratch kernel: []
blk_update_request+0x90/0x330
Jun 16 13:54:50 scratch kernel: []
blk_update_bidi_request+0x1c/0x80
Jun 16 13:54:50 scratch kernel: []
blk_end_bidi_request+0x1f/0x60
Jun 16 13:54:50 scratch kernel: []
blk_end_request+0x10/0x20
Jun 16 13:54:50 scratch kernel: []
scsi_io_completion+0xf4/0x6e0
Jun 16 13:54:50 scratch kernel: []
scsi_finish_command+0xb3/0x110
Jun 16 13:54:50 scratch kernel: []
scsi_softirq_done+0x137/0x160
Jun 16 13:54:50 scratch kernel: []
blk_done_softirq+0x90/0xb0
Jun 16 13:54:50 scratch kernel: []
__do_softirq+0x125/0x300
Jun 16 13:54:50 scratch kernel: [] irq_exit+0xc5/0xd0
Jun 16 13:54:50 scratch kernel: []
smp_call_function_single_interrupt+0x35/0x40
Jun 16 13:54:50 scratch kernel: []
call_function_single_interrupt+0x72/0x80
Jun 16 13:54:50 scratch kernel:   [] ?
native_safe_halt+0x6/0x10
Jun 16 13:54:50 scratch kernel: [] ?
trace_hardirqs_on+0xd/0x10
Jun 16 13:54:50 scratch kernel: [] default_idle+0x24/0xe0
Jun 16 13:54:50 scratch kernel: [] arch_cpu_idle+0xf/0x20
Jun 16 13:54:50 scratch kernel: []
cpu_startup_entry+0x2cb/0x450
Jun 16 13:54:50 scratch kernel: [] ?
clockevents_register_device+0xbc/0x120
Jun 16 13:54:50 scratch kernel: []
start_secondary+0x1de/0x290
Jun 16 13:54:50 scratch kernel:
Jun 16 13:54:50 scratch kernel: =
Jun 16 13:54:50 scratch kernel: [ INFO: inconsistent lock state ]
Jun 16 13:54:50 scratch kernel: 3.16.0-rc1+ #35 Not tainted
Jun 16 13:54:50 scratch kernel: -
Jun 16 13:54:50 scratch kernel: inconsistent {SOFTIRQ-ON-W} ->
{IN-SOFTIRQ-W} usage.
Jun 16 13:54:50 scratch kernel: swapper/3/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
Jun 16 13:54:50 scratch kernel: (kernfs_mutex){+.?.+.}, at:
[] kernfs_notify+0x8a/0x140
Jun 16 13:54:50 scratch kernel: {SOFTIRQ-ON-W} state was registered at:
Jun 16 13:54:50 scratch kernel:  []
__lock_acquire+0x316/0x1a60
Jun 16 13:54:50 scratch kernel:  []
lock_acquire+0xa2/0x130
Jun 16 13:54:50 scratch kernel:  []
mutex_lock_nested+0x6a/0x3a0
Jun 16 13:54:50 scratch kernel:  []
kernfs_activate+0x1f/0xf0
Jun 16 13:54:50 scratch kernel:  []
kernfs_create_root+0xe8/0x110
Jun 16 13:54:50 scratch kernel:  [] sysfs_init+0x13/0x51
Jun 16 13:54:50 scratch kernel:  [] mnt_init+0x118/0x225
Jun 16 13:54:50 scratch kernel:  []
vfs_caches_init+0x99/0x115
Jun 16 13:54:50 scratch kernel:  []
start_kernel+0x3ec/0x443
Jun 16 13:54:50 scratch kernel:  []
x86_64_start_reservations+0x2a/0x2c
Jun 16 13:54:50 scratch kernel:  []
x86_64_start_kernel+0x13e/0x14d
Jun 16 13:54:50 scratch kernel: irq event stamp: 211959
Jun 16 13:54:50 scratch kernel: hardirqs last  enabled at (211958):
[] _raw_spin_unlock_irqrestore+0x36/0x70
Jun 16 13:54:50 scratch kernel: hardirqs last disabled at (211959):
[] _raw_spin_lock_irqsave+0x29/0x70
Jun 16 13:54:50 scratch kernel: softirqs last  enabled at (211952):
[] _local_bh_enable+0x22/0x50
Jun 16 13:54:50

3.16-rc1 kernel BUG triggered at mutex.c:586

2014-06-16 Thread Vlad Yasevich

Trying to boot 3.16.0-rc1+ (the + is a small vlan related patch) on a
tests system triggers the following BUG:

Jun 16 13:54:50 scratch kernel: BUG: sleeping function called from
invalid context at kernel/locking/mutex.c:586
Jun 16 13:54:50 scratch kernel: in_atomic(): 1, irqs_disabled(): 1, pid:
0, name: swapper/3
Jun 16 13:54:50 scratch kernel: 1 lock held by swapper/3/0:
Jun 16 13:54:50 scratch kernel: #0:
((bitmap-counts.lock)-rlock){..-...}, at: [815729e8]
bitmap_endwrite+0x58/0x230
Jun 16 13:54:50 scratch kernel: irq event stamp: 211959
Jun 16 13:54:50 scratch kernel: hardirqs last  enabled at (211958):
[8170f4d6] _raw_spin_unlock_irqrestore+0x36/0x70
Jun 16 13:54:50 scratch kernel: hardirqs last disabled at (211959):
[8170fba9] _raw_spin_lock_irqsave+0x29/0x70
Jun 16 13:54:50 scratch kernel: softirqs last  enabled at (211952):
[8108bab2] _local_bh_enable+0x22/0x50
Jun 16 13:54:50 scratch kernel: softirqs last disabled at (211953):
[8108cc15] irq_exit+0xc5/0xd0
Jun 16 13:54:50 scratch kernel: CPU: 3 PID: 0 Comm: swapper/3 Not
tainted 3.16.0-rc1+ #35
Jun 16 13:54:50 scratch kernel: Hardware name: Hewlett-Packard HP xw8400
Workstation/0A08h, BIOS 786D5 v02.38 10/25/2010
Jun 16 13:54:50 scratch kernel:  88047fcc3bd0
81706ef2 81c857e0
Jun 16 13:54:50 scratch kernel: 88047fcc3be0 810b4dba
88047fcc3c50 8170c79c
Jun 16 13:54:50 scratch kernel:  81c85938
0046 0092
Jun 16 13:54:50 scratch kernel: Call Trace:
Jun 16 13:54:50 scratch kernel: IRQ  [81706ef2]
dump_stack+0x4d/0x66
Jun 16 13:54:50 scratch kernel: [810b4dba]
__might_sleep+0xfa/0x130
Jun 16 13:54:50 scratch kernel: [8170c79c]
mutex_lock_nested+0x3c/0x3a0
Jun 16 13:54:50 scratch kernel: [81270f5a]
kernfs_notify+0x8a/0x140
Jun 16 13:54:50 scratch kernel: [81572a49]
bitmap_endwrite+0xb9/0x230
Jun 16 13:54:50 scratch kernel: [a027a823]
close_write+0x93/0xb0 [raid1]
Jun 16 13:54:50 scratch kernel: [a027af89]
r1_bio_write_done+0x29/0x50 [raid1]
Jun 16 13:54:50 scratch kernel: [a027c98f]
raid1_end_write_request+0xcf/0x220 [raid1]
Jun 16 13:54:50 scratch kernel: [8132f0bb] bio_endio+0x5b/0xa0
Jun 16 13:54:50 scratch kernel: [81335f60]
blk_update_request+0x90/0x330
Jun 16 13:54:50 scratch kernel: [8133621c]
blk_update_bidi_request+0x1c/0x80
Jun 16 13:54:50 scratch kernel: [8133656f]
blk_end_bidi_request+0x1f/0x60
Jun 16 13:54:50 scratch kernel: [813365c0]
blk_end_request+0x10/0x20
Jun 16 13:54:50 scratch kernel: [81497194]
scsi_io_completion+0xf4/0x6e0
Jun 16 13:54:50 scratch kernel: [8148c613]
scsi_finish_command+0xb3/0x110
Jun 16 13:54:50 scratch kernel: [81496fb7]
scsi_softirq_done+0x137/0x160
Jun 16 13:54:50 scratch kernel: [8133cfe0]
blk_done_softirq+0x90/0xb0
Jun 16 13:54:50 scratch kernel: [8108c705]
__do_softirq+0x125/0x300
Jun 16 13:54:50 scratch kernel: [8108cc15] irq_exit+0xc5/0xd0
Jun 16 13:54:50 scratch kernel: [810439d5]
smp_call_function_single_interrupt+0x35/0x40
Jun 16 13:54:50 scratch kernel: [81711552]
call_function_single_interrupt+0x72/0x80
Jun 16 13:54:50 scratch kernel: EOI  [81053fa6] ?
native_safe_halt+0x6/0x10
Jun 16 13:54:50 scratch kernel: [810d891d] ?
trace_hardirqs_on+0xd/0x10
Jun 16 13:54:50 scratch kernel: [8101e0c4] default_idle+0x24/0xe0
Jun 16 13:54:50 scratch kernel: [8101ea6f] arch_cpu_idle+0xf/0x20
Jun 16 13:54:50 scratch kernel: [810d0d1b]
cpu_startup_entry+0x2cb/0x450
Jun 16 13:54:50 scratch kernel: [810fed4c] ?
clockevents_register_device+0xbc/0x120
Jun 16 13:54:50 scratch kernel: [810442ee]
start_secondary+0x1de/0x290
Jun 16 13:54:50 scratch kernel:
Jun 16 13:54:50 scratch kernel: =
Jun 16 13:54:50 scratch kernel: [ INFO: inconsistent lock state ]
Jun 16 13:54:50 scratch kernel: 3.16.0-rc1+ #35 Not tainted
Jun 16 13:54:50 scratch kernel: -
Jun 16 13:54:50 scratch kernel: inconsistent {SOFTIRQ-ON-W} -
{IN-SOFTIRQ-W} usage.
Jun 16 13:54:50 scratch kernel: swapper/3/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
Jun 16 13:54:50 scratch kernel: (kernfs_mutex){+.?.+.}, at:
[81270f5a] kernfs_notify+0x8a/0x140
Jun 16 13:54:50 scratch kernel: {SOFTIRQ-ON-W} state was registered at:
Jun 16 13:54:50 scratch kernel:  [810d8f66]
__lock_acquire+0x316/0x1a60
Jun 16 13:54:50 scratch kernel:  [810dae72]
lock_acquire+0xa2/0x130
Jun 16 13:54:50 scratch kernel:  [8170c7ca]
mutex_lock_nested+0x6a/0x3a0
Jun 16 13:54:50 scratch kernel:  [8127066f]
kernfs_activate+0x1f/0xf0
Jun 16 13:54:50 scratch kernel:  [81270a08]
kernfs_create_root+0xe8/0x110
Jun 16 13:54:50 scratch kernel:  [81d73567] sysfs_init+0x13/0x51
Jun 16 13:54:50 scratch kernel:  [81d70d06]

Re: [PATCH v2] sctp: Fix sk_ack_backlog wrap-around problem

2014-06-12 Thread Vlad Yasevich

On 06/11/2014 10:53 PM, Xufeng Zhang wrote:
> Consider the scenario:
> For a TCP-style socket, while processing the COOKIE_ECHO chunk in
> sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
> a new association would be created in sctp_unpack_cookie(), but afterwards,
> some processing maybe failed, and sctp_association_free() will be called to
> free the previously allocated association, in sctp_association_free(),
> sk_ack_backlog value is decremented for this socket, since the initial
> value for sk_ack_backlog is 0, after the decrement, it will be 65535,
> a wrap-around problem happens, and if we want to establish new associations
> afterward in the same socket, ABORT would be triggered since sctp deem the
> accept queue as full.
> Fix this issue by only decrementing sk_ack_backlog for associations in
> the endpoint's list.
> 
> Fix-suggested-by: Neil Horman 
> Signed-off-by: Xufeng Zhang 

Acked-by: Vlad Yasevich 

Thanks
-vlad


> ---
> Change for v2:
>   Drop the redundant test for temp suggested by Vlad Yasevich.
> 
>  net/sctp/associola.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 39579c3..0b8 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc)
>   /* Only real associations count against the endpoint, so
>* don't bother for if this is a temporary association.
>*/
> - if (!asoc->temp) {
> + if (!list_empty(>asocs)) {
>   list_del(>asocs);
>  
>   /* Decrement the backlog value for a TCP-style listening
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] sctp: Fix sk_ack_backlog wrap-around problem

2014-06-12 Thread Vlad Yasevich

On 06/11/2014 10:53 PM, Xufeng Zhang wrote:
 Consider the scenario:
 For a TCP-style socket, while processing the COOKIE_ECHO chunk in
 sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
 a new association would be created in sctp_unpack_cookie(), but afterwards,
 some processing maybe failed, and sctp_association_free() will be called to
 free the previously allocated association, in sctp_association_free(),
 sk_ack_backlog value is decremented for this socket, since the initial
 value for sk_ack_backlog is 0, after the decrement, it will be 65535,
 a wrap-around problem happens, and if we want to establish new associations
 afterward in the same socket, ABORT would be triggered since sctp deem the
 accept queue as full.
 Fix this issue by only decrementing sk_ack_backlog for associations in
 the endpoint's list.
 
 Fix-suggested-by: Neil Horman nhor...@tuxdriver.com
 Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com

Acked-by: Vlad Yasevich vyasev...@gmail.com

Thanks
-vlad


 ---
 Change for v2:
   Drop the redundant test for temp suggested by Vlad Yasevich.
 
  net/sctp/associola.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/net/sctp/associola.c b/net/sctp/associola.c
 index 39579c3..0b8 100644
 --- a/net/sctp/associola.c
 +++ b/net/sctp/associola.c
 @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc)
   /* Only real associations count against the endpoint, so
* don't bother for if this is a temporary association.
*/
 - if (!asoc-temp) {
 + if (!list_empty(asoc-asocs)) {
   list_del(asoc-asocs);
  
   /* Decrement the backlog value for a TCP-style listening
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem

2014-06-11 Thread Vlad Yasevich

On 06/11/2014 08:55 AM, Vlad Yasevich wrote:
> On 06/10/2014 10:37 PM, Xufeng Zhang wrote:
>> Consider the scenario:
>> For a TCP-style socket, while processing the COOKIE_ECHO chunk in
>> sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
>> a new association would be created in sctp_unpack_cookie(), but afterwards,
>> some processing maybe failed, and sctp_association_free() will be called to
>> free the previously allocated association, in sctp_association_free(),
>> sk_ack_backlog value is decremented for this socket, since the initial
>> value for sk_ack_backlog is 0, after the decrement, it will be 65535,
>> a wrap-around problem happens, and if we want to establish new associations
>> afterward in the same socket, ABORT would be triggered since sctp deem the
>> accept queue as full.
>> Fix this issue by only decrementing sk_ack_backlog for associations in
>> the endpoint's list.
>>
>> Fix-suggested-by: Neil Horman 
>> Signed-off-by: Xufeng Zhang 
>> ---
>>  net/sctp/associola.c |2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
>> index 39579c3..60564f2 100644
>> --- a/net/sctp/associola.c
>> +++ b/net/sctp/associola.c
>> @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc)
>>  /* Only real associations count against the endpoint, so
>>   * don't bother for if this is a temporary association.
>>   */
>> -if (!asoc->temp) {
>> +if (!asoc->temp && !list_empty(>asocs)) {
>>  list_del(>asocs);
>>  
>>  /* Decrement the backlog value for a TCP-style listening
>>
> 
> I am not crazy about this patch.   It's been suggested before that may
> be duplicate cookie processing should really be creating a temporary
> association since that's is how that association is being used.

I had another look at the description for triggering this issue and
realized that I was thinking about something else when looking at
this solution.

There is however no need to test both the list and temp value.
We can simply always test the that list is not empty before doing
list_del().

-vlad

>  It
> might be nice at that approach.  It actually benefits us as the
> association destruction would happen immediately instead of being delayed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem

2014-06-11 Thread Vlad Yasevich

On 06/10/2014 10:37 PM, Xufeng Zhang wrote:
> Consider the scenario:
> For a TCP-style socket, while processing the COOKIE_ECHO chunk in
> sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
> a new association would be created in sctp_unpack_cookie(), but afterwards,
> some processing maybe failed, and sctp_association_free() will be called to
> free the previously allocated association, in sctp_association_free(),
> sk_ack_backlog value is decremented for this socket, since the initial
> value for sk_ack_backlog is 0, after the decrement, it will be 65535,
> a wrap-around problem happens, and if we want to establish new associations
> afterward in the same socket, ABORT would be triggered since sctp deem the
> accept queue as full.
> Fix this issue by only decrementing sk_ack_backlog for associations in
> the endpoint's list.
> 
> Fix-suggested-by: Neil Horman 
> Signed-off-by: Xufeng Zhang 
> ---
>  net/sctp/associola.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 39579c3..60564f2 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc)
>   /* Only real associations count against the endpoint, so
>* don't bother for if this is a temporary association.
>*/
> - if (!asoc->temp) {
> + if (!asoc->temp && !list_empty(>asocs)) {
>   list_del(>asocs);
>  
>   /* Decrement the backlog value for a TCP-style listening
> 

I am not crazy about this patch.   It's been suggested before that may
be duplicate cookie processing should really be creating a temporary
association since that's is how that association is being used.  It
might be nice at that approach.  It actually benefits us as the
association destruction would happen immediately instead of being delayed.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem

2014-06-11 Thread Vlad Yasevich

On 06/10/2014 10:37 PM, Xufeng Zhang wrote:
 Consider the scenario:
 For a TCP-style socket, while processing the COOKIE_ECHO chunk in
 sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
 a new association would be created in sctp_unpack_cookie(), but afterwards,
 some processing maybe failed, and sctp_association_free() will be called to
 free the previously allocated association, in sctp_association_free(),
 sk_ack_backlog value is decremented for this socket, since the initial
 value for sk_ack_backlog is 0, after the decrement, it will be 65535,
 a wrap-around problem happens, and if we want to establish new associations
 afterward in the same socket, ABORT would be triggered since sctp deem the
 accept queue as full.
 Fix this issue by only decrementing sk_ack_backlog for associations in
 the endpoint's list.
 
 Fix-suggested-by: Neil Horman nhor...@tuxdriver.com
 Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com
 ---
  net/sctp/associola.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/net/sctp/associola.c b/net/sctp/associola.c
 index 39579c3..60564f2 100644
 --- a/net/sctp/associola.c
 +++ b/net/sctp/associola.c
 @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc)
   /* Only real associations count against the endpoint, so
* don't bother for if this is a temporary association.
*/
 - if (!asoc-temp) {
 + if (!asoc-temp  !list_empty(asoc-asocs)) {
   list_del(asoc-asocs);
  
   /* Decrement the backlog value for a TCP-style listening
 

I am not crazy about this patch.   It's been suggested before that may
be duplicate cookie processing should really be creating a temporary
association since that's is how that association is being used.  It
might be nice at that approach.  It actually benefits us as the
association destruction would happen immediately instead of being delayed.

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sctp: Fix sk_ack_backlog wrap-around problem

2014-06-11 Thread Vlad Yasevich

On 06/11/2014 08:55 AM, Vlad Yasevich wrote:
 On 06/10/2014 10:37 PM, Xufeng Zhang wrote:
 Consider the scenario:
 For a TCP-style socket, while processing the COOKIE_ECHO chunk in
 sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
 a new association would be created in sctp_unpack_cookie(), but afterwards,
 some processing maybe failed, and sctp_association_free() will be called to
 free the previously allocated association, in sctp_association_free(),
 sk_ack_backlog value is decremented for this socket, since the initial
 value for sk_ack_backlog is 0, after the decrement, it will be 65535,
 a wrap-around problem happens, and if we want to establish new associations
 afterward in the same socket, ABORT would be triggered since sctp deem the
 accept queue as full.
 Fix this issue by only decrementing sk_ack_backlog for associations in
 the endpoint's list.

 Fix-suggested-by: Neil Horman nhor...@tuxdriver.com
 Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com
 ---
  net/sctp/associola.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/net/sctp/associola.c b/net/sctp/associola.c
 index 39579c3..60564f2 100644
 --- a/net/sctp/associola.c
 +++ b/net/sctp/associola.c
 @@ -330,7 +330,7 @@ void sctp_association_free(struct sctp_association *asoc)
  /* Only real associations count against the endpoint, so
   * don't bother for if this is a temporary association.
   */
 -if (!asoc-temp) {
 +if (!asoc-temp  !list_empty(asoc-asocs)) {
  list_del(asoc-asocs);
  
  /* Decrement the backlog value for a TCP-style listening

 
 I am not crazy about this patch.   It's been suggested before that may
 be duplicate cookie processing should really be creating a temporary
 association since that's is how that association is being used.

I had another look at the description for triggering this issue and
realized that I was thinking about something else when looking at
this solution.

There is however no need to test both the list and temp value.
We can simply always test the that list is not empty before doing
list_del().

-vlad

  It
 might be nice at that approach.  It actually benefits us as the
 association destruction would happen immediately instead of being delayed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PULL 2/2] vhost: replace rcu with mutex

2014-06-03 Thread Vlad Yasevich

On 06/03/2014 08:48 AM, Paolo Bonzini wrote:
> Il 02/06/2014 23:58, Eric Dumazet ha scritto:
>> This looks dubious
>>
>> What about using kfree_rcu() instead ?
> 
> It would lead to unbound allocation from userspace.
> 
>> translate_desc() still uses rcu_read_lock(), its not clear if the mutex
>> is really held.
> 
> Yes, vhost_get_vq_desc must be called with the vq mutex held.
> 
> The rcu_read_lock/unlock in translate_desc is unnecessary.
> 

If that's true, then does dev->memory really needs to be rcu protected?
It appears to always be read under mutex.

-vlad

> Paolo
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PULL 2/2] vhost: replace rcu with mutex

2014-06-03 Thread Vlad Yasevich

On 06/03/2014 08:48 AM, Paolo Bonzini wrote:
 Il 02/06/2014 23:58, Eric Dumazet ha scritto:
 This looks dubious

 What about using kfree_rcu() instead ?
 
 It would lead to unbound allocation from userspace.
 
 translate_desc() still uses rcu_read_lock(), its not clear if the mutex
 is really held.
 
 Yes, vhost_get_vq_desc must be called with the vq mutex held.
 
 The rcu_read_lock/unlock in translate_desc is unnecessary.
 

If that's true, then does dev-memory really needs to be rcu protected?
It appears to always be read under mutex.

-vlad

 Paolo
 -- 
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] bridge: notify user space after fdb update

2014-05-28 Thread Vlad Yasevich

On 05/27/2014 06:45 PM, Jon Maxwell wrote:
> There has been a number incidents recently where customers running KVM have
> reported that VM hosts on different Hypervisors are unreachable. Based on
> pcap traces we found that the bridge was broadcasting the ARP request out
> onto the network. However some NICs have an inbuilt switch which on occasions
> were broadcasting the VMs ARP request back through the physical NIC on the
> Hypervisor. This resulted in the bridge changing ports and incorrectly 
> learning
> that the VMs mac address was external. As a result the ARP reply was directed
> back onto the external network and VM never updated it's ARP cache. This patch
> will notify the bridge command, after a fdb has been updated to identify such
> port toggling.
> 
> Signed-off-by: Jon Maxwell 
> ---
>  net/bridge/br_fdb.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index 9203d5a..f3ee2da 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -487,6 +487,7 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
>  {
>   struct hlist_head *head = >hash[br_mac_hash(addr, vid)];
>   struct net_bridge_fdb_entry *fdb;
> + bool fdb_modified = 0;
>  
>   /* some users want to always flood. */
>   if (hold_time(br) == 0)
> @@ -507,10 +508,15 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
>   source->dev->name);
>   } else {
>   /* fastpath: update of existing entry */
> - fdb->dst = source;
> + if (unlikely(source != fdb->dst)) {
> + fdb->dst = source;
> + fdb_modified = 1;
> + }

This looks over-indented.

-vlad

>   fdb->updated = jiffies;
>   if (unlikely(added_by_user))
>   fdb->added_by_user = 1;
> + if (unlikely(fdb_modified))
> + fdb_notify(br, fdb, RTM_NEWNEIGH);
>   }
>   } else {
>   spin_lock(>hash_lock);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] bridge: notify user space after fdb update

2014-05-28 Thread Vlad Yasevich

On 05/27/2014 06:45 PM, Jon Maxwell wrote:
 There has been a number incidents recently where customers running KVM have
 reported that VM hosts on different Hypervisors are unreachable. Based on
 pcap traces we found that the bridge was broadcasting the ARP request out
 onto the network. However some NICs have an inbuilt switch which on occasions
 were broadcasting the VMs ARP request back through the physical NIC on the
 Hypervisor. This resulted in the bridge changing ports and incorrectly 
 learning
 that the VMs mac address was external. As a result the ARP reply was directed
 back onto the external network and VM never updated it's ARP cache. This patch
 will notify the bridge command, after a fdb has been updated to identify such
 port toggling.
 
 Signed-off-by: Jon Maxwell jmaxwel...@gmail.com
 ---
  net/bridge/br_fdb.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
 index 9203d5a..f3ee2da 100644
 --- a/net/bridge/br_fdb.c
 +++ b/net/bridge/br_fdb.c
 @@ -487,6 +487,7 @@ void br_fdb_update(struct net_bridge *br, struct 
 net_bridge_port *source,
  {
   struct hlist_head *head = br-hash[br_mac_hash(addr, vid)];
   struct net_bridge_fdb_entry *fdb;
 + bool fdb_modified = 0;
  
   /* some users want to always flood. */
   if (hold_time(br) == 0)
 @@ -507,10 +508,15 @@ void br_fdb_update(struct net_bridge *br, struct 
 net_bridge_port *source,
   source-dev-name);
   } else {
   /* fastpath: update of existing entry */
 - fdb-dst = source;
 + if (unlikely(source != fdb-dst)) {
 + fdb-dst = source;
 + fdb_modified = 1;
 + }

This looks over-indented.

-vlad

   fdb-updated = jiffies;
   if (unlikely(added_by_user))
   fdb-added_by_user = 1;
 + if (unlikely(fdb_modified))
 + fdb_notify(br, fdb, RTM_NEWNEIGH);
   }
   } else {
   spin_lock(br-hash_lock);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: skbuff truesize incorrect.

2014-05-22 Thread Vlad Yasevich

On 05/22/2014 03:07 PM, Jim Baxter wrote:
> Hi, I was hoping you can help me with some questions.
> 
> I have been investigating a network issue with bursts of network traffic
> over USB CDC-NCM, the issue is that the kernel is dropping packets
> because sk_rcvqueues_full() returns true due to skb2->truesize is always
> 32960 instead of SKB_TRUESIZE(skb2->len) which is about 1800.
> 
> The code I am trying to fix is this code below, it is splitting a set of
> multiple network packets compressed into a single 16k packet into
> individual skb's and sending them up the network stack.
> 
> skb2 = skb_clone(skb, GFP_ATOMIC);
> if (skb2 == NULL)
> goto err;
> 
> if (!skb_pull(skb2, index)) {
> ret = -EOVERFLOW;
> goto err;
> }

This assumes that you original 16K packet is linear.  Is that
always the case?

> 
> skb_trim(skb2, dg_len - crc_len);
> 
> My questions are:
> 
> 1) Which buffer size does truesize represent, is it the total buffer or
> just the data related to the relevant skb?

Total buffer size (including the struct sk_buff).

> 
> 2) If truesize is for the skb it is contained within should it be
> updated during the call to skb_trim?

No, because the the buffer/memory is still there.  skb_trim just
sets the tail pointer and adjusts skb->len.


> 
> 3) Why does the truesize default to 32960?

Probably because that's how much buffer space was actually allocated
for the original skb.  When you cloned it, you inherited the truesize
since a clone is just the struct sk_buff that points into the data
of the original.

This is the very same problem that I ran into with SCTP since it
has similar code in it.  You can play games with truesize manually,
but you have to be very careful here.

-vlad
> 
> 
> Thank you for any help,
> Jim Baxter.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: skbuff truesize incorrect.

2014-05-22 Thread Vlad Yasevich

On 05/22/2014 03:07 PM, Jim Baxter wrote:
 Hi, I was hoping you can help me with some questions.
 
 I have been investigating a network issue with bursts of network traffic
 over USB CDC-NCM, the issue is that the kernel is dropping packets
 because sk_rcvqueues_full() returns true due to skb2-truesize is always
 32960 instead of SKB_TRUESIZE(skb2-len) which is about 1800.
 
 The code I am trying to fix is this code below, it is splitting a set of
 multiple network packets compressed into a single 16k packet into
 individual skb's and sending them up the network stack.
 
 skb2 = skb_clone(skb, GFP_ATOMIC);
 if (skb2 == NULL)
 goto err;
 
 if (!skb_pull(skb2, index)) {
 ret = -EOVERFLOW;
 goto err;
 }

This assumes that you original 16K packet is linear.  Is that
always the case?

 
 skb_trim(skb2, dg_len - crc_len);
 
 My questions are:
 
 1) Which buffer size does truesize represent, is it the total buffer or
 just the data related to the relevant skb?

Total buffer size (including the struct sk_buff).

 
 2) If truesize is for the skb it is contained within should it be
 updated during the call to skb_trim?

No, because the the buffer/memory is still there.  skb_trim just
sets the tail pointer and adjusts skb-len.


 
 3) Why does the truesize default to 32960?

Probably because that's how much buffer space was actually allocated
for the original skb.  When you cloned it, you inherited the truesize
since a clone is just the struct sk_buff that points into the data
of the original.

This is the very same problem that I ran into with SCTP since it
has similar code in it.  You can play games with truesize manually,
but you have to be very careful here.

-vlad
 
 
 Thank you for any help,
 Jim Baxter.
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)

2014-05-20 Thread Vlad Yasevich

On 05/20/2014 12:55 AM, valdis.kletni...@vt.edu wrote:
> On Mon, 19 May 2014 23:49:22 +0930, David Newall said:
> 
>> How does a packet get fragmented in this case?  Does it only happen when
>> bridging to a device with smaller MTU?  That scenario sounds quite
>> un-bridge-like.  It also sounds like something that can be handled by
>> real routing.
> 
> Which doesn't change the fact that you *will* get clowns who take a box that
> has a 10G card on a jumbogram-enabled subnet that's running with an MTU of
> 9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route.
> (Did you know that you can actually mount an NFS filesystem across that? And
> that ls and cat and friends will work *just fine*? Until you hit a file that's
> more than 1.5 in size, that is. And when you do a traceroute to the wedged
> client, it tells you it's on the 10G network, so you have no idea why you're
> seeing an MTU issue.  Don't ask how I know this - let's just say that
> supporting HPC users is never boring. :)
> 
> So yes, we *do* need to do something sensible there - either frag the packet
> on the way out, or something.  It *would* be nice if we could drop the
> packet and send an ICMP Frag Needed back - except it's unclear what IP
> you use as the source address for the ICMP
> 

If there is no netfilter, then the bridge will just drop the packet
(see br_dev_queue_push_xmit).  It should probably also do that with
netfilter.

On the question of ICMP, I've also debated about sending ICMP Frag
Needed, but that's really beyond the scope of the bridge device.

Recording a stat might be sufficient to help troubleshoot these types
of issues.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the net tree

2014-05-20 Thread Vlad Yasevich

On 05/19/2014 08:59 PM, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the net tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> 
> In file included from arch/powerpc/net/bpf_jit_comp.c:16:0:
> include/linux/if_vlan.h: In function 'vlan_get_encap_level':
> include/linux/if_vlan.h:491:2: error: implicit declaration of function 
> 'vlan_dev_priv' [-Werror=implicit-function-declaration]
>   return vlan_dev_priv(dev)->nest_level;
>   ^
> include/linux/if_vlan.h:491:27: error: invalid type argument of '->' (have 
> 'int')
>   return vlan_dev_priv(dev)->nest_level;
>^
> 
> Caused by commit 44a4085538c8 ("bonding: Fix stacked device detection
> in arp monitoring").
> 
> vlan_dev_priv() is protected by:
> 
> #if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
> 
> but the newly added call in vlan_get_encap_level() is not.
> 
> I have used the net tree from next-20140519 for today.
> 

Right.  Apologies.  I missed the above fact.  Fix forthcoming.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the net tree

2014-05-20 Thread Vlad Yasevich

On 05/19/2014 08:59 PM, Stephen Rothwell wrote:
 Hi all,
 
 After merging the net tree, today's linux-next build (powerpc
 ppc64_defconfig) failed like this:
 
 
 In file included from arch/powerpc/net/bpf_jit_comp.c:16:0:
 include/linux/if_vlan.h: In function 'vlan_get_encap_level':
 include/linux/if_vlan.h:491:2: error: implicit declaration of function 
 'vlan_dev_priv' [-Werror=implicit-function-declaration]
   return vlan_dev_priv(dev)-nest_level;
   ^
 include/linux/if_vlan.h:491:27: error: invalid type argument of '-' (have 
 'int')
   return vlan_dev_priv(dev)-nest_level;
^
 
 Caused by commit 44a4085538c8 (bonding: Fix stacked device detection
 in arp monitoring).
 
 vlan_dev_priv() is protected by:
 
 #if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
 
 but the newly added call in vlan_get_encap_level() is not.
 
 I have used the net tree from next-20140519 for today.
 

Right.  Apologies.  I missed the above fact.  Fix forthcoming.

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)

2014-05-20 Thread Vlad Yasevich

On 05/20/2014 12:55 AM, valdis.kletni...@vt.edu wrote:
 On Mon, 19 May 2014 23:49:22 +0930, David Newall said:
 
 How does a packet get fragmented in this case?  Does it only happen when
 bridging to a device with smaller MTU?  That scenario sounds quite
 un-bridge-like.  It also sounds like something that can be handled by
 real routing.
 
 Which doesn't change the fact that you *will* get clowns who take a box that
 has a 10G card on a jumbogram-enabled subnet that's running with an MTU of
 9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route.
 (Did you know that you can actually mount an NFS filesystem across that? And
 that ls and cat and friends will work *just fine*? Until you hit a file that's
 more than 1.5 in size, that is. And when you do a traceroute to the wedged
 client, it tells you it's on the 10G network, so you have no idea why you're
 seeing an MTU issue.  Don't ask how I know this - let's just say that
 supporting HPC users is never boring. :)
 
 So yes, we *do* need to do something sensible there - either frag the packet
 on the way out, or something.  It *would* be nice if we could drop the
 packet and send an ICMP Frag Needed back - except it's unclear what IP
 you use as the source address for the ICMP
 

If there is no netfilter, then the bridge will just drop the packet
(see br_dev_queue_push_xmit).  It should probably also do that with
netfilter.

On the question of ICMP, I've also debated about sending ICMP Frag
Needed, but that's really beyond the scope of the bridge device.

Recording a stat might be sufficient to help troubleshoot these types
of issues.

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] macvlan: Fix checksum errors when ip_summed is CHECKSUM_PARTIAL

2014-05-19 Thread Vlad Yasevich

On 05/19/2014 01:24 PM, Michael Spang wrote:
> Changing ip_summed from CHECKSUM_PARTIAL to CHECKSUM_UNNECESSARY
> will result in an incorrect checksum if the packet is sent off the box.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Michael Spang 
> ---
>  drivers/net/macvlan.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 753a8c2..806b56a 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -267,7 +267,9 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>  
>   if (vlan->mode == MACVLAN_MODE_BRIDGE) {
>   const struct ethhdr *eth = (void *)skb->data;
> - skb->ip_summed = CHECKSUM_UNNECESSARY;
> +
> + if (skb->ip_summed == CHECKSUM_NONE)
> + skb->ip_summed = CHECKSUM_UNNECESSARY;
>  
>   /* send to other bridge ports directly */
>   if (is_multicast_ether_addr(eth->h_dest)) {
> 

Already done.  See commit f114890cdf84d753f6b41cd0cc44ba51d16313da

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] macvlan: Fix checksum errors when ip_summed is CHECKSUM_PARTIAL

2014-05-19 Thread Vlad Yasevich

On 05/19/2014 01:24 PM, Michael Spang wrote:
 Changing ip_summed from CHECKSUM_PARTIAL to CHECKSUM_UNNECESSARY
 will result in an incorrect checksum if the packet is sent off the box.
 
 Cc: sta...@vger.kernel.org
 Signed-off-by: Michael Spang sp...@chromium.org
 ---
  drivers/net/macvlan.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
 index 753a8c2..806b56a 100644
 --- a/drivers/net/macvlan.c
 +++ b/drivers/net/macvlan.c
 @@ -267,7 +267,9 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
 net_device *dev)
  
   if (vlan-mode == MACVLAN_MODE_BRIDGE) {
   const struct ethhdr *eth = (void *)skb-data;
 - skb-ip_summed = CHECKSUM_UNNECESSARY;
 +
 + if (skb-ip_summed == CHECKSUM_NONE)
 + skb-ip_summed = CHECKSUM_UNNECESSARY;
  
   /* send to other bridge ports directly */
   if (is_multicast_ether_addr(eth-h_dest)) {
 

Already done.  See commit f114890cdf84d753f6b41cd0cc44ba51d16313da

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes

2014-04-30 Thread Vlad Yasevich

On 04/30/2014 06:59 PM, Luis R. Rodriguez wrote:
> On Wed, Apr 30, 2014 at 04:04:34PM -0400, Vlad Yasevich wrote:
>> On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote:
>>> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>>> index 54d207d..dcd9378 100644
>>> --- a/net/bridge/br_if.c
>>> +++ b/net/bridge/br_if.c
>>> @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct 
>>> net_bridge *br,
>>> features &= ~NETIF_F_ONE_FOR_ALL;
>>>  
>>> list_for_each_entry(p, >port_list, list) {
>>> +   if (p->flags & BR_ROOT_BLOCK)
>>> +   continue;
>>> features = netdev_increment_features(features,
>>>  p->dev->features, mask);
>>> }
>>>
>> Hi Luis
>>
>> The hunk above isn't right.  Just because you set ROOT_BLOCK on the port
>> doesn't mean that you should ignore it's device features.  If the device
>> you just added happens to disable or enable some device offload feature,
>> you should take that into account.
> 
> OK thanks, how about this part:
> 
> On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote:
>> On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote:
>> On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote:
>>> On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez  wrote:
>>>> On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote:
>>>>> On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez
>>>>>  wrote:
>>>>>> spin_unlock_bh(>br->lock);
>>>>>> +   if (changed)
>>>>>> +   
>>>>>> call_netdevice_notifiers(NETDEV_CHANGEADDR,
>>>>>> +p->br->dev);
>>>>>> +   netdev_update_features(p->br->dev);
>>>>>

This is actually just a part of it.  You also need to handle the sysfs
changing the flag.

Look at the first 2 patches in this series:
http://www.spinics.net/lists/netdev/msg280863.html

You might need that functionality.

-vlad


>>>>> I think this is supposed to be in netdev event handler of br->dev
>>>>> instead of here.
>>>>
>>>> Do you mean netdev_update_features() ? I mimic'd what was being done on
>>>> br_del_if() given that root blocking is doing something similar. If
>>>> we need to change something for the above then I suppose it means we need
>>>> to change br_del_if() too. Let me know if you see any reason for something
>>>> else.
>>>>
>>>
>>> Yeah, for me it looks like it's better to call netdev_update_features()
>>> in the event handler of br->dev, rather than where calling
>>> call_netdevice_notifiers(..., br->dev);.
>>
>> I still don't see why, in fact trying to verify this I am wondering now
>> if instead we should actually fix br_features_recompute() to take into
>> consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features()
>> is called above even if the MAC address did not change, just as is done
>> on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more
>> appropriate we just call
>>
>> call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p->br->dev)
>>
>> for both the above then and also br_del_if()?
> 
>   Luis
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes

2014-04-30 Thread Vlad Yasevich

On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote:
> On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote:
>> On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote:
>>> On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez  wrote:
 On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote:
> On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez
>  wrote:
>> spin_unlock_bh(>br->lock);
>> +   if (changed)
>> +   
>> call_netdevice_notifiers(NETDEV_CHANGEADDR,
>> +p->br->dev);
>> +   netdev_update_features(p->br->dev);
>
> I think this is supposed to be in netdev event handler of br->dev
> instead of here.

 Do you mean netdev_update_features() ? I mimic'd what was being done on
 br_del_if() given that root blocking is doing something similar. If
 we need to change something for the above then I suppose it means we need
 to change br_del_if() too. Let me know if you see any reason for something
 else.

>>>
>>> Yeah, for me it looks like it's better to call netdev_update_features()
>>> in the event handler of br->dev, rather than where calling
>>> call_netdevice_notifiers(..., br->dev);.
>>
>> I still don't see why, in fact trying to verify this I am wondering now
>> if instead we should actually fix br_features_recompute() to take into
>> consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features()
>> is called above even if the MAC address did not change, just as is done
>> on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more
>> appropriate we just call
>>
>> call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p->br->dev)
>>
>> for both the above then and also br_del_if()? How about the below
>> change?
>>
>> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>> index 54d207d..dcd9378 100644
>> --- a/net/bridge/br_if.c
>> +++ b/net/bridge/br_if.c
>> @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct 
>> net_bridge *br,
>>  features &= ~NETIF_F_ONE_FOR_ALL;
>>  
>>  list_for_each_entry(p, >port_list, list) {
>> +if (p->flags & BR_ROOT_BLOCK)
>> +continue;
>>  features = netdev_increment_features(features,
>>   p->dev->features, mask);
>>  }
> 
> Cong, can you provide feedback on this? I tried to grow confidence on the
> hunk above but its not clear but the other points still hold and I'd love
> your feedback on those.
> 

Hi Luis

The hunk above isn't right.  Just because you set ROOT_BLOCK on the port
doesn't mean that you should ignore it's device features.  If the device
you just added happens to disable or enable some device offload feature,
you should take that into account.

Thanks
-vlad

>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] bridge: preserve random init MAC address

2014-04-30 Thread Vlad Yasevich

On 04/22/2014 03:41 PM, Luis R. Rodriguez wrote:
> On Wed, Mar 19, 2014 at 7:05 PM, Luis R. Rodriguez  wrote:
>> On Tue, Mar 18, 2014 at 08:10:56PM -0700, Stephen Hemminger wrote:
>>> On Wed, 12 Mar 2014 20:15:25 -0700
>>> "Luis R. Rodriguez"  wrote:
>>>
 As it is now if you add create a bridge it gets started
 with a random MAC address and if you then add a net_device
 as a slave but later kick it out you end up with a zero
 MAC address. Instead preserve the original random MAC
 address and use it.
>>>
>>> What is supposed to happen is that the recalculate chooses
>>> the lowest MAC address of the slaves. If there are no slaves
>>> it might as well just calculate a new random value. There is
>>> not great merit in preserving the original defunct address.
>>>
>>> Or something like this
>>> --- a/net/bridge/br_stp_if.c  2014-02-12 08:21:56.733857356 -0800
>>> +++ b/net/bridge/br_stp_if.c  2014-03-18 20:09:09.334388826 -0700
>>> @@ -235,6 +235,9 @@ bool br_stp_recalculate_bridge_id(struct
>>>   addr = p->dev->dev_addr;
>>>
>>>   }
>>> +
>>> + if (addr == br_mac_zero)
>>> + return false;  /* keep original address */
>>>
>>>   if (ether_addr_equal(br->bridge_id.addr, addr))
>>>   return false;   /* no change */
>>>
>>> that just keeps the old value.
>>
>> The old value could be a port which got root blocked, I think
>> it can be confusing to see that happen. Either way feel free to
>> make the call, I'll provide more details below on perhaps one reason
>> to keep the original MAC address.
> 
> Stephen, I'd like to respin this series to address all pending
> feedback, I'd still like your feedback / call / judgement on this
> part. I'm fine either way, just wanted to ensure I highlight the
> reasoning of why I kept the original random MAC address. Please keep
> in mind that at this point I'm convinced bridging is the *wrong*
> solution to networking with guests but it is being used in a lot of
> current topologies, this would just help with smoothing out corner
> cases.
> 

Hi Luis

I took at this series again, I think it might make to generate
a new random address instead of storing the original.
There is not much point in storing the original since it's been
gone, and most likely not advertised anywhere anyway.

As for virtualization configuration you described, a lot of times
there is a single port in the down state that remains.  At least
that's how libvirt manages it's bridged networks.

Thanks
-vlad

>>> The bridge is in a meaningless state when there are no ports,
>>
>> Some virtualization topologies may want a backend with no link (or
>> perhaps one which is dynamic, with the option to have none) to the
>> internet but just a bridge so guests sharing the bridge can talk to
>> each other. In this case bridging can be used only to link the
>> batch of guests.
>>
>> In this case the bridge simply acts as a switch, but also can be used as the
>> interface for DHCP, for example. In such a case the guests will be doing
>> ARP to get to the DHCP server. There's a flurry of ways one can try to get
>> all this meshed together including enablign an ARP proxy but I'm looking
>> at ways to optimize this further -- but I'd like to address the current
>> usage cases first.
>>
>>> and when the first port is added back it will be used as the
>>> new bridge id.
>>
>> Sure. Let me know how you think we should proceed with this patch based
>> on the above.
> 
> Thanks in advance.
> 
>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] bridge: preserve random init MAC address

2014-04-30 Thread Vlad Yasevich

On 04/22/2014 03:41 PM, Luis R. Rodriguez wrote:
 On Wed, Mar 19, 2014 at 7:05 PM, Luis R. Rodriguez mcg...@suse.com wrote:
 On Tue, Mar 18, 2014 at 08:10:56PM -0700, Stephen Hemminger wrote:
 On Wed, 12 Mar 2014 20:15:25 -0700
 Luis R. Rodriguez mcg...@do-not-panic.com wrote:

 As it is now if you add create a bridge it gets started
 with a random MAC address and if you then add a net_device
 as a slave but later kick it out you end up with a zero
 MAC address. Instead preserve the original random MAC
 address and use it.

 What is supposed to happen is that the recalculate chooses
 the lowest MAC address of the slaves. If there are no slaves
 it might as well just calculate a new random value. There is
 not great merit in preserving the original defunct address.

 Or something like this
 --- a/net/bridge/br_stp_if.c  2014-02-12 08:21:56.733857356 -0800
 +++ b/net/bridge/br_stp_if.c  2014-03-18 20:09:09.334388826 -0700
 @@ -235,6 +235,9 @@ bool br_stp_recalculate_bridge_id(struct
   addr = p-dev-dev_addr;

   }
 +
 + if (addr == br_mac_zero)
 + return false;  /* keep original address */

   if (ether_addr_equal(br-bridge_id.addr, addr))
   return false;   /* no change */

 that just keeps the old value.

 The old value could be a port which got root blocked, I think
 it can be confusing to see that happen. Either way feel free to
 make the call, I'll provide more details below on perhaps one reason
 to keep the original MAC address.
 
 Stephen, I'd like to respin this series to address all pending
 feedback, I'd still like your feedback / call / judgement on this
 part. I'm fine either way, just wanted to ensure I highlight the
 reasoning of why I kept the original random MAC address. Please keep
 in mind that at this point I'm convinced bridging is the *wrong*
 solution to networking with guests but it is being used in a lot of
 current topologies, this would just help with smoothing out corner
 cases.
 

Hi Luis

I took at this series again, I think it might make to generate
a new random address instead of storing the original.
There is not much point in storing the original since it's been
gone, and most likely not advertised anywhere anyway.

As for virtualization configuration you described, a lot of times
there is a single port in the down state that remains.  At least
that's how libvirt manages it's bridged networks.

Thanks
-vlad

 The bridge is in a meaningless state when there are no ports,

 Some virtualization topologies may want a backend with no link (or
 perhaps one which is dynamic, with the option to have none) to the
 internet but just a bridge so guests sharing the bridge can talk to
 each other. In this case bridging can be used only to link the
 batch of guests.

 In this case the bridge simply acts as a switch, but also can be used as the
 interface for DHCP, for example. In such a case the guests will be doing
 ARP to get to the DHCP server. There's a flurry of ways one can try to get
 all this meshed together including enablign an ARP proxy but I'm looking
 at ways to optimize this further -- but I'd like to address the current
 usage cases first.

 and when the first port is added back it will be used as the
 new bridge id.

 Sure. Let me know how you think we should proceed with this patch based
 on the above.
 
 Thanks in advance.
 
   Luis
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes

2014-04-30 Thread Vlad Yasevich

On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote:
 On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote:
 On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote:
 On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez mcg...@suse.com wrote:
 On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote:
 On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez
 mcg...@do-not-panic.com wrote:
 spin_unlock_bh(p-br-lock);
 +   if (changed)
 +   
 call_netdevice_notifiers(NETDEV_CHANGEADDR,
 +p-br-dev);
 +   netdev_update_features(p-br-dev);

 I think this is supposed to be in netdev event handler of br-dev
 instead of here.

 Do you mean netdev_update_features() ? I mimic'd what was being done on
 br_del_if() given that root blocking is doing something similar. If
 we need to change something for the above then I suppose it means we need
 to change br_del_if() too. Let me know if you see any reason for something
 else.


 Yeah, for me it looks like it's better to call netdev_update_features()
 in the event handler of br-dev, rather than where calling
 call_netdevice_notifiers(..., br-dev);.

 I still don't see why, in fact trying to verify this I am wondering now
 if instead we should actually fix br_features_recompute() to take into
 consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features()
 is called above even if the MAC address did not change, just as is done
 on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more
 appropriate we just call

 call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p-br-dev)

 for both the above then and also br_del_if()? How about the below
 change?

 diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
 index 54d207d..dcd9378 100644
 --- a/net/bridge/br_if.c
 +++ b/net/bridge/br_if.c
 @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct 
 net_bridge *br,
  features = ~NETIF_F_ONE_FOR_ALL;
  
  list_for_each_entry(p, br-port_list, list) {
 +if (p-flags  BR_ROOT_BLOCK)
 +continue;
  features = netdev_increment_features(features,
   p-dev-features, mask);
  }
 
 Cong, can you provide feedback on this? I tried to grow confidence on the
 hunk above but its not clear but the other points still hold and I'd love
 your feedback on those.
 

Hi Luis

The hunk above isn't right.  Just because you set ROOT_BLOCK on the port
doesn't mean that you should ignore it's device features.  If the device
you just added happens to disable or enable some device offload feature,
you should take that into account.

Thanks
-vlad

   Luis
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] bridge: trigger a bridge calculation upon port changes

2014-04-30 Thread Vlad Yasevich

On 04/30/2014 06:59 PM, Luis R. Rodriguez wrote:
 On Wed, Apr 30, 2014 at 04:04:34PM -0400, Vlad Yasevich wrote:
 On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote:
 diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
 index 54d207d..dcd9378 100644
 --- a/net/bridge/br_if.c
 +++ b/net/bridge/br_if.c
 @@ -315,6 +315,8 @@ netdev_features_t br_features_recompute(struct 
 net_bridge *br,
 features = ~NETIF_F_ONE_FOR_ALL;
  
 list_for_each_entry(p, br-port_list, list) {
 +   if (p-flags  BR_ROOT_BLOCK)
 +   continue;
 features = netdev_increment_features(features,
  p-dev-features, mask);
 }

 Hi Luis

 The hunk above isn't right.  Just because you set ROOT_BLOCK on the port
 doesn't mean that you should ignore it's device features.  If the device
 you just added happens to disable or enable some device offload feature,
 you should take that into account.
 
 OK thanks, how about this part:
 
 On 04/22/2014 03:43 PM, Luis R. Rodriguez wrote:
 On Tue, Mar 18, 2014 at 02:22:43PM -0700, Luis R. Rodriguez wrote:
 On Tue, Mar 18, 2014 at 01:46:49PM -0700, Cong Wang wrote:
 On Fri, Mar 14, 2014 at 6:39 PM, Luis R. Rodriguez mcg...@suse.com wrote:
 On Thu, Mar 13, 2014 at 11:26:25AM -0700, Cong Wang wrote:
 On Wed, Mar 12, 2014 at 8:15 PM, Luis R. Rodriguez
 mcg...@do-not-panic.com wrote:
 spin_unlock_bh(p-br-lock);
 +   if (changed)
 +   
 call_netdevice_notifiers(NETDEV_CHANGEADDR,
 +p-br-dev);
 +   netdev_update_features(p-br-dev);


This is actually just a part of it.  You also need to handle the sysfs
changing the flag.

Look at the first 2 patches in this series:
http://www.spinics.net/lists/netdev/msg280863.html

You might need that functionality.

-vlad


 I think this is supposed to be in netdev event handler of br-dev
 instead of here.

 Do you mean netdev_update_features() ? I mimic'd what was being done on
 br_del_if() given that root blocking is doing something similar. If
 we need to change something for the above then I suppose it means we need
 to change br_del_if() too. Let me know if you see any reason for something
 else.


 Yeah, for me it looks like it's better to call netdev_update_features()
 in the event handler of br-dev, rather than where calling
 call_netdevice_notifiers(..., br-dev);.

 I still don't see why, in fact trying to verify this I am wondering now
 if instead we should actually fix br_features_recompute() to take into
 consideration BR_ROOT_BLOCK as below. Notice how netdev_update_features()
 is called above even if the MAC address did not change, just as is done
 on br_del_if(). There is an NETDEV_FEAT_CHANGE event so would it be more
 appropriate we just call

 call_netdevice_notifiers(NETDEV_FEAT_CHANGE, p-br-dev)

 for both the above then and also br_del_if()?
 
   Luis
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] sctp: reset flowi4_oif parameter on route lookup

2014-04-25 Thread Vlad Yasevich

On 04/25/2014 04:55 AM, Xufeng Zhang wrote:
> commit 813b3b5db83 (ipv4: Use caller's on-stack flowi as-is
> in output route lookups.) introduces another regression which
> is very similar to the problem of commit e6b45241c (ipv4: reset
> flowi parameters on route connect) wants to fix:
> Before we call ip_route_output_key() in sctp_v4_get_dst() to
> get a dst that matches a bind address as the source address,
> we have already called this function previously and the flowi
> parameters have been initialized including flowi4_oif, so when
> we call this function again, the process in __ip_route_output_key()
> will be different because of the setting of flowi4_oif, and we'll
> get a networking device which corresponds to the inputted flowi4_oif
> as the output device, this is wrong because we'll never hit this
> place if the previously returned source address of dst match one
> of the bound addresses.
> 
> To reproduce this problem, a vlan setting is enough:
>   # ifconfig eth0 up
>   # route del default
>   # vconfig add eth0 2
>   # vconfig add eth0 3
>   # ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0
>   # route add default gw 10.0.1.254 dev eth0.2
>   # ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0
>   # ip rule add from 10.0.0.14 table 4
>   # ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3
>   # sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I
> You'll detect that all the flow are routed to eth0.2(10.0.1.254).
> 
> Signed-off-by: Xufeng Zhang 
> Signed-off-by: Julian Anastasov 

Acked-by: Vlad Yasevich 

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] sctp: reset flowi4_oif parameter on route lookup

2014-04-25 Thread Vlad Yasevich

On 04/25/2014 04:55 AM, Xufeng Zhang wrote:
 commit 813b3b5db83 (ipv4: Use caller's on-stack flowi as-is
 in output route lookups.) introduces another regression which
 is very similar to the problem of commit e6b45241c (ipv4: reset
 flowi parameters on route connect) wants to fix:
 Before we call ip_route_output_key() in sctp_v4_get_dst() to
 get a dst that matches a bind address as the source address,
 we have already called this function previously and the flowi
 parameters have been initialized including flowi4_oif, so when
 we call this function again, the process in __ip_route_output_key()
 will be different because of the setting of flowi4_oif, and we'll
 get a networking device which corresponds to the inputted flowi4_oif
 as the output device, this is wrong because we'll never hit this
 place if the previously returned source address of dst match one
 of the bound addresses.
 
 To reproduce this problem, a vlan setting is enough:
   # ifconfig eth0 up
   # route del default
   # vconfig add eth0 2
   # vconfig add eth0 3
   # ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0
   # route add default gw 10.0.1.254 dev eth0.2
   # ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0
   # ip rule add from 10.0.0.14 table 4
   # ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3
   # sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I
 You'll detect that all the flow are routed to eth0.2(10.0.1.254).
 
 Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com
 Signed-off-by: Julian Anastasov j...@ssi.bg

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-03-03 Thread Vlad Yasevich

On 03/03/2014 04:13 AM, Christian Borntraeger wrote:
> On 02/03/14 02:21, Vlad Yasevich wrote:
>> On 03/01/2014 02:27 PM, Vlad Yasevich wrote:
>>> On 03/01/2014 06:15 AM, Christian Borntraeger wrote:
>>>> On 28/02/14 23:14, Vlad Yasevich wrote:
>>>>> On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
>>>>>> Vlad,
>>>>>>
>>>>>> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
>>>>>> macvtap: Add support of packet capture on macvtap device.
>>>>>>
>>>>>> causes a performance regression for iperf traffic between two KVM guests
>>>>>> on my s390 system. Both guests are connected via two macvtaps on the 
>>>>>> same OSA
>>>>>> network card.
>>>>>> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
>>>>>> ~4Gbit/sec
>>>>>>
>>>>>> Latency seems to be unchanges (uperf 1byte ping pong).
>>>>>>
>>>>>> According to ifconfig in the guest, I have ~ 1500 bytes per packet with 
>>>>>> this
>>>>>> patch and ~  4 bytes without. So for some reason this patch causes 
>>>>>> the
>>>>>> network stack to do segmentation. (the guest kernel stays the same, only 
>>>>>> host 
>>>>>> kernel is changed).
>>>>>>
>>>>>> Any ideas?
>>>>>
>>>>> I am looking.  It shouldn't cause addition segmentations and when I ran
>>>>> netperf on the code I didn't see any difference in the throughput.
>>>>
>>>> Dont know if the different bytes/packets ratio is really the reason or
>>>> just a side effect. As a hint: the underlying network device does not 
>>>> support
>>>> segmentation, but this should not matter for traffic between to guests.
>>>
>>> Could you post 'ethtool -k' output for both lower-level device and the
>>> macvtap device?
>>>
>>> Thanks
>>> -vlad
>>>
>>
>> Ok.  I think I see what's happening.  Since you turn off offloads on
>> lower device, that's propagated to macvlan device.  As a result, when
>> when we call dev_queue_xmit on the vlan->dev, we end up segmenting since
>> lower level says it does support segmentation.
>>
>> One way to fix this is to never disable offloads on macvlan.  macvlan
>> will always try to use __dev_queue_xmit() with it's lower device, so any
>> segmentation can happen there.
> 
> If you have anything that I should test, let me know.

Hi Christian

Just sent out a patch to fix this.  I tried it with namespaces and
kvm guests and it seems to restore performance for me.

Please give it a try.

Thanks
-vlad
> 
> Christian
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-03-03 Thread Vlad Yasevich

On 03/03/2014 04:13 AM, Christian Borntraeger wrote:
 On 02/03/14 02:21, Vlad Yasevich wrote:
 On 03/01/2014 02:27 PM, Vlad Yasevich wrote:
 On 03/01/2014 06:15 AM, Christian Borntraeger wrote:
 On 28/02/14 23:14, Vlad Yasevich wrote:
 On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
 Vlad,

 commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 macvtap: Add support of packet capture on macvtap device.

 causes a performance regression for iperf traffic between two KVM guests
 on my s390 system. Both guests are connected via two macvtaps on the 
 same OSA
 network card.
 Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
 ~4Gbit/sec

 Latency seems to be unchanges (uperf 1byte ping pong).

 According to ifconfig in the guest, I have ~ 1500 bytes per packet with 
 this
 patch and ~  4 bytes without. So for some reason this patch causes 
 the
 network stack to do segmentation. (the guest kernel stays the same, only 
 host 
 kernel is changed).

 Any ideas?

 I am looking.  It shouldn't cause addition segmentations and when I ran
 netperf on the code I didn't see any difference in the throughput.

 Dont know if the different bytes/packets ratio is really the reason or
 just a side effect. As a hint: the underlying network device does not 
 support
 segmentation, but this should not matter for traffic between to guests.

 Could you post 'ethtool -k' output for both lower-level device and the
 macvtap device?

 Thanks
 -vlad


 Ok.  I think I see what's happening.  Since you turn off offloads on
 lower device, that's propagated to macvlan device.  As a result, when
 when we call dev_queue_xmit on the vlan-dev, we end up segmenting since
 lower level says it does support segmentation.

 One way to fix this is to never disable offloads on macvlan.  macvlan
 will always try to use __dev_queue_xmit() with it's lower device, so any
 segmentation can happen there.
 
 If you have anything that I should test, let me know.

Hi Christian

Just sent out a patch to fix this.  I tried it with namespaces and
kvm guests and it seems to restore performance for me.

Please give it a try.

Thanks
-vlad
 
 Christian
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-03-01 Thread Vlad Yasevich

On 03/01/2014 02:27 PM, Vlad Yasevich wrote:
> On 03/01/2014 06:15 AM, Christian Borntraeger wrote:
>> On 28/02/14 23:14, Vlad Yasevich wrote:
>>> On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
>>>> Vlad,
>>>>
>>>> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
>>>> macvtap: Add support of packet capture on macvtap device.
>>>>
>>>> causes a performance regression for iperf traffic between two KVM guests
>>>> on my s390 system. Both guests are connected via two macvtaps on the same 
>>>> OSA
>>>> network card.
>>>> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
>>>> ~4Gbit/sec
>>>>
>>>> Latency seems to be unchanges (uperf 1byte ping pong).
>>>>
>>>> According to ifconfig in the guest, I have ~ 1500 bytes per packet with 
>>>> this
>>>> patch and ~  4 bytes without. So for some reason this patch causes the
>>>> network stack to do segmentation. (the guest kernel stays the same, only 
>>>> host 
>>>> kernel is changed).
>>>>
>>>> Any ideas?
>>>
>>> I am looking.  It shouldn't cause addition segmentations and when I ran
>>> netperf on the code I didn't see any difference in the throughput.
>>
>> Dont know if the different bytes/packets ratio is really the reason or
>> just a side effect. As a hint: the underlying network device does not support
>> segmentation, but this should not matter for traffic between to guests.
> 
> Could you post 'ethtool -k' output for both lower-level device and the
> macvtap device?
> 
> Thanks
> -vlad
> 

Ok.  I think I see what's happening.  Since you turn off offloads on
lower device, that's propagated to macvlan device.  As a result, when
when we call dev_queue_xmit on the vlan->dev, we end up segmenting since
lower level says it does support segmentation.

One way to fix this is to never disable offloads on macvlan.  macvlan
will always try to use __dev_queue_xmit() with it's lower device, so any
segmentation can happen there.

-vlad

>>
>> Maybe you remember, we had a similar situation with commit 
>> 3e4f8b787370978733ca6cae452720a4f0c296b8
>> (macvtap: Perform GSO on forwarding path), the setup is basically the same.
>>
>>
>> Christian
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-03-01 Thread Vlad Yasevich

On 03/01/2014 06:15 AM, Christian Borntraeger wrote:
> On 28/02/14 23:14, Vlad Yasevich wrote:
>> On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
>>> Vlad,
>>>
>>> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
>>> macvtap: Add support of packet capture on macvtap device.
>>>
>>> causes a performance regression for iperf traffic between two KVM guests
>>> on my s390 system. Both guests are connected via two macvtaps on the same 
>>> OSA
>>> network card.
>>> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
>>> ~4Gbit/sec
>>>
>>> Latency seems to be unchanges (uperf 1byte ping pong).
>>>
>>> According to ifconfig in the guest, I have ~ 1500 bytes per packet with this
>>> patch and ~  4 bytes without. So for some reason this patch causes the
>>> network stack to do segmentation. (the guest kernel stays the same, only 
>>> host 
>>> kernel is changed).
>>>
>>> Any ideas?
>>
>> I am looking.  It shouldn't cause addition segmentations and when I ran
>> netperf on the code I didn't see any difference in the throughput.
> 
> Dont know if the different bytes/packets ratio is really the reason or
> just a side effect. As a hint: the underlying network device does not support
> segmentation, but this should not matter for traffic between to guests.

Could you post 'ethtool -k' output for both lower-level device and the
macvtap device?

Thanks
-vlad

> 
> Maybe you remember, we had a similar situation with commit 
> 3e4f8b787370978733ca6cae452720a4f0c296b8
> (macvtap: Perform GSO on forwarding path), the setup is basically the same.
> 
> 
> Christian
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-03-01 Thread Vlad Yasevich

On 03/01/2014 06:15 AM, Christian Borntraeger wrote:
 On 28/02/14 23:14, Vlad Yasevich wrote:
 On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
 Vlad,

 commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 macvtap: Add support of packet capture on macvtap device.

 causes a performance regression for iperf traffic between two KVM guests
 on my s390 system. Both guests are connected via two macvtaps on the same 
 OSA
 network card.
 Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
 ~4Gbit/sec

 Latency seems to be unchanges (uperf 1byte ping pong).

 According to ifconfig in the guest, I have ~ 1500 bytes per packet with this
 patch and ~  4 bytes without. So for some reason this patch causes the
 network stack to do segmentation. (the guest kernel stays the same, only 
 host 
 kernel is changed).

 Any ideas?

 I am looking.  It shouldn't cause addition segmentations and when I ran
 netperf on the code I didn't see any difference in the throughput.
 
 Dont know if the different bytes/packets ratio is really the reason or
 just a side effect. As a hint: the underlying network device does not support
 segmentation, but this should not matter for traffic between to guests.

Could you post 'ethtool -k' output for both lower-level device and the
macvtap device?

Thanks
-vlad

 
 Maybe you remember, we had a similar situation with commit 
 3e4f8b787370978733ca6cae452720a4f0c296b8
 (macvtap: Perform GSO on forwarding path), the setup is basically the same.
 
 
 Christian
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-03-01 Thread Vlad Yasevich

On 03/01/2014 02:27 PM, Vlad Yasevich wrote:
 On 03/01/2014 06:15 AM, Christian Borntraeger wrote:
 On 28/02/14 23:14, Vlad Yasevich wrote:
 On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
 Vlad,

 commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 macvtap: Add support of packet capture on macvtap device.

 causes a performance regression for iperf traffic between two KVM guests
 on my s390 system. Both guests are connected via two macvtaps on the same 
 OSA
 network card.
 Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
 ~4Gbit/sec

 Latency seems to be unchanges (uperf 1byte ping pong).

 According to ifconfig in the guest, I have ~ 1500 bytes per packet with 
 this
 patch and ~  4 bytes without. So for some reason this patch causes the
 network stack to do segmentation. (the guest kernel stays the same, only 
 host 
 kernel is changed).

 Any ideas?

 I am looking.  It shouldn't cause addition segmentations and when I ran
 netperf on the code I didn't see any difference in the throughput.

 Dont know if the different bytes/packets ratio is really the reason or
 just a side effect. As a hint: the underlying network device does not support
 segmentation, but this should not matter for traffic between to guests.
 
 Could you post 'ethtool -k' output for both lower-level device and the
 macvtap device?
 
 Thanks
 -vlad
 

Ok.  I think I see what's happening.  Since you turn off offloads on
lower device, that's propagated to macvlan device.  As a result, when
when we call dev_queue_xmit on the vlan-dev, we end up segmenting since
lower level says it does support segmentation.

One way to fix this is to never disable offloads on macvlan.  macvlan
will always try to use __dev_queue_xmit() with it's lower device, so any
segmentation can happen there.

-vlad


 Maybe you remember, we had a similar situation with commit 
 3e4f8b787370978733ca6cae452720a4f0c296b8
 (macvtap: Perform GSO on forwarding path), the setup is basically the same.


 Christian

 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-02-28 Thread Vlad Yasevich

On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
> Vlad,
> 
> commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
> macvtap: Add support of packet capture on macvtap device.
> 
> causes a performance regression for iperf traffic between two KVM guests
> on my s390 system. Both guests are connected via two macvtaps on the same OSA
> network card.
> Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
> ~4Gbit/sec
> 
> Latency seems to be unchanges (uperf 1byte ping pong).
> 
> According to ifconfig in the guest, I have ~ 1500 bytes per packet with this
> patch and ~  4 bytes without. So for some reason this patch causes the
> network stack to do segmentation. (the guest kernel stays the same, only host 
> kernel is changed).
> 
> Any ideas?

I am looking.  It shouldn't cause addition segmentations and when I ran
netperf on the code I didn't see any difference in the throughput.

-vlad

> 
> 
> Christian
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: macvtap performance regression (bisected) between 3.13 and 3.14-rc1

2014-02-28 Thread Vlad Yasevich

On 02/27/2014 03:52 PM, Christian Borntraeger wrote:
 Vlad,
 
 commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 macvtap: Add support of packet capture on macvtap device.
 
 causes a performance regression for iperf traffic between two KVM guests
 on my s390 system. Both guests are connected via two macvtaps on the same OSA
 network card.
 Before that patch I get ~20 Gbit/sec between two guests, afterwards I get
 ~4Gbit/sec
 
 Latency seems to be unchanges (uperf 1byte ping pong).
 
 According to ifconfig in the guest, I have ~ 1500 bytes per packet with this
 patch and ~  4 bytes without. So for some reason this patch causes the
 network stack to do segmentation. (the guest kernel stays the same, only host 
 kernel is changed).
 
 Any ideas?

I am looking.  It shouldn't cause addition segmentations and when I ran
netperf on the code I didn't see any difference in the throughput.

-vlad

 
 
 Christian
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval

2014-02-18 Thread Vlad Yasevich

On 02/18/2014 10:39 AM, David Laight wrote:
>>> +
>>> +   /* Update the heartbeat timer immediately. */
>>> +   if (!mod_timer(>hb_timer,
>>> +   sctp_transport_timeout(trans)))
>>> +   sctp_transport_hold(trans);
>>
>> This is causes a rather inconsistent behavior in that it doesn't
>> account of the amount of time left on the hb timer.  Thus it doesn't
>> always do what commit log implies.  If the remaining time is shorter
>> then the new timeout value, it will take longer till the next HB
>> the application expects.
> 
> Being able to stop heartbeats being sent by repeatedly configuring
> the interval is certainly not a good idea!
> 
>> If the application has very strict timing requirement on the HBs, it
>> should trigger the HB manually.
>>
>> We could rework the code to allow the combination of
>> SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows:
>>   1) update the timer to the new value
>>   2) Send an immediate HB
>>  a) Update the hb timeout.
>>
>> 2a should probably be done regardless since it can cause 2 HB to be
>> outstanding at the same time.
> 
> Sending a heartbeat 'on demand' might be problematic if one
> has also just been sent (from the timer).

Yes, you are right.  This will trigger an rto doubling since we'll
be in a state where we have an outstanding HB without an ack.

> 
> I'd have thought that you wouldn't want to send a heartbeat while
> still expecting a response from the previous one (this might require
> splitting the time interval in two).

This one is interesting.  Technically, a user can request an immediate
HB any time, and if the user requests it, we should honor the request.
Also, the user may not know that there is an outstanding HB.  I'll need
to think about this a bit.

Thanks for bringing this up.
-vlad
> 
>   David
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval

2014-02-18 Thread Vlad Yasevich

On 02/18/2014 12:56 AM, Xufeng Zhang wrote:
> For an established association, if user has updated the HB.interval
> parameter by setsockopt(), this new heartbeat interval will not
> take effect until:
>   - the expiry of the heartbeat timer and new hearbeat is sent.
>   - DATA chunk has been sent and the transport resets the timer.
> This could not meet the requirement of the user who need to
> get HEARTBEAT sent at the specified time.
> 
> Thus, we need to update the heartbeat timer immediately after
> user has changed HB.interval.
> 
> Signed-off-by: Xufeng Zhang 
> ---
>  net/sctp/socket.c |5 +
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 9e91d6e..699ae1e 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2344,6 +2344,11 @@ static int sctp_apply_peer_addr_params(struct 
> sctp_paddrparams *params,
>   if (trans) {
>   trans->hbinterval =
>   msecs_to_jiffies(params->spp_hbinterval);
> +
> + /* Update the heartbeat timer immediately. */
> + if (!mod_timer(>hb_timer,
> + sctp_transport_timeout(trans)))
> + sctp_transport_hold(trans);

This is causes a rather inconsistent behavior in that it doesn't
account of the amount of time left on the hb timer.  Thus it doesn't
always do what commit log implies.  If the remaining time is shorter
then the new timeout value, it will take longer till the next HB
the application expects.

If the application has very strict timing requirement on the HBs, it
should trigger the HB manually.

We could rework the code to allow the combination of
SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows:
  1) update the timer to the new value
  2) Send an immediate HB
 a) Update the hb timeout.

2a should probably be done regardless since it can cause 2 HB to be
outstanding at the same time.

-vlad

>   } else if (asoc) {
>   asoc->hbinterval =
>   msecs_to_jiffies(params->spp_hbinterval);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval

2014-02-18 Thread Vlad Yasevich

On 02/18/2014 12:56 AM, Xufeng Zhang wrote:
 For an established association, if user has updated the HB.interval
 parameter by setsockopt(), this new heartbeat interval will not
 take effect until:
   - the expiry of the heartbeat timer and new hearbeat is sent.
   - DATA chunk has been sent and the transport resets the timer.
 This could not meet the requirement of the user who need to
 get HEARTBEAT sent at the specified time.
 
 Thus, we need to update the heartbeat timer immediately after
 user has changed HB.interval.
 
 Signed-off-by: Xufeng Zhang xufeng.zh...@windriver.com
 ---
  net/sctp/socket.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)
 
 diff --git a/net/sctp/socket.c b/net/sctp/socket.c
 index 9e91d6e..699ae1e 100644
 --- a/net/sctp/socket.c
 +++ b/net/sctp/socket.c
 @@ -2344,6 +2344,11 @@ static int sctp_apply_peer_addr_params(struct 
 sctp_paddrparams *params,
   if (trans) {
   trans-hbinterval =
   msecs_to_jiffies(params-spp_hbinterval);
 +
 + /* Update the heartbeat timer immediately. */
 + if (!mod_timer(trans-hb_timer,
 + sctp_transport_timeout(trans)))
 + sctp_transport_hold(trans);

This is causes a rather inconsistent behavior in that it doesn't
account of the amount of time left on the hb timer.  Thus it doesn't
always do what commit log implies.  If the remaining time is shorter
then the new timeout value, it will take longer till the next HB
the application expects.

If the application has very strict timing requirement on the HBs, it
should trigger the HB manually.

We could rework the code to allow the combination of
SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows:
  1) update the timer to the new value
  2) Send an immediate HB
 a) Update the hb timeout.

2a should probably be done regardless since it can cause 2 HB to be
outstanding at the same time.

-vlad


   } else if (asoc) {
   asoc-hbinterval =
   msecs_to_jiffies(params-spp_hbinterval);
 



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval

2014-02-18 Thread Vlad Yasevich

On 02/18/2014 10:39 AM, David Laight wrote:
 +
 +   /* Update the heartbeat timer immediately. */
 +   if (!mod_timer(trans-hb_timer,
 +   sctp_transport_timeout(trans)))
 +   sctp_transport_hold(trans);

 This is causes a rather inconsistent behavior in that it doesn't
 account of the amount of time left on the hb timer.  Thus it doesn't
 always do what commit log implies.  If the remaining time is shorter
 then the new timeout value, it will take longer till the next HB
 the application expects.
 
 Being able to stop heartbeats being sent by repeatedly configuring
 the interval is certainly not a good idea!
 
 If the application has very strict timing requirement on the HBs, it
 should trigger the HB manually.

 We could rework the code to allow the combination of
 SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows:
   1) update the timer to the new value
   2) Send an immediate HB
  a) Update the hb timeout.

 2a should probably be done regardless since it can cause 2 HB to be
 outstanding at the same time.
 
 Sending a heartbeat 'on demand' might be problematic if one
 has also just been sent (from the timer).

Yes, you are right.  This will trigger an rto doubling since we'll
be in a state where we have an outstanding HB without an ack.

 
 I'd have thought that you wouldn't want to send a heartbeat while
 still expecting a response from the previous one (this might require
 splitting the time interval in two).

This one is interesting.  Technically, a user can request an immediate
HB any time, and if the user requests it, we should honor the request.
Also, the user may not know that there is an outstanding HB.  I'll need
to think about this a bit.

Thanks for bringing this up.
-vlad
 
   David
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap

2014-01-10 Thread Vlad Yasevich

On 01/10/2014 02:06 AM, Jason Wang wrote:
> On 01/10/2014 05:39 AM, Stephen Hemminger wrote:
>> On Thu, 09 Jan 2014 16:55:07 +0800
>> Jason Wang  wrote:
>>
>>> What if use do want a qdisc and want to change the its queue length for
>>> tun/macvlan? And the the name tx_queue_length is misleading. For tun it
>>> may make sense since it was used in transmission path. For macvtap it
>>> was not. So maybe what we need is just a new ioctl for both tun/macvtap
>>> and a new feature flag. If user create the device with new feature flag,
>>> the socket receive queue length could be changed by ioctl instead of
>>> dev->tx_queue_length. If not, the old behaviour could be kept.
>> The overloading of tx_queue_len in macvtap was the original design mistake.
>> Can't this just be undone and expose rx_queue_len as sysfs attribute?
> 
> That works. But we current allow user to change the socket sndbuf
> through TUNSNDBUF. Maybe we need a similar one for receive.
> 

That would make sense.  Since the user interacts with tun fd almost as a
socket and there is actually a socket hiding in the kernel, it almost
begs for actual SO_SNDBUF/SO_RCVBUF support :)

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap

2014-01-10 Thread Vlad Yasevich

On 01/10/2014 02:06 AM, Jason Wang wrote:
 On 01/10/2014 05:39 AM, Stephen Hemminger wrote:
 On Thu, 09 Jan 2014 16:55:07 +0800
 Jason Wang jasow...@redhat.com wrote:

 What if use do want a qdisc and want to change the its queue length for
 tun/macvlan? And the the name tx_queue_length is misleading. For tun it
 may make sense since it was used in transmission path. For macvtap it
 was not. So maybe what we need is just a new ioctl for both tun/macvtap
 and a new feature flag. If user create the device with new feature flag,
 the socket receive queue length could be changed by ioctl instead of
 dev-tx_queue_length. If not, the old behaviour could be kept.
 The overloading of tx_queue_len in macvtap was the original design mistake.
 Can't this just be undone and expose rx_queue_len as sysfs attribute?
 
 That works. But we current allow user to change the socket sndbuf
 through TUNSNDBUF. Maybe we need a similar one for receive.
 

That would make sense.  Since the user interacts with tun fd almost as a
socket and there is actually a socket hiding in the kernel, it almost
begs for actual SO_SNDBUF/SO_RCVBUF support :)

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch net-next v3] ipv6: log autoconfiguration failures

2013-12-13 Thread Vlad Yasevich

On 12/13/2013 10:45 AM, Denys Vlasenko wrote:
> If ipv6 auto-configuration does not work, currently it's hard
> to track what's going on. This change adds log messages
> (at debug level) on every code path where ipv6 autoconf fails.
> 
> v3: changed pr_debug's to pr_warn's.
> 
> Signed-off-by: Denys Vlasenko 
> ---
>  net/ipv6/addrconf.c | 45 -
>  1 file changed, 36 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 3c3425e..0b354f0 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -1694,8 +1694,11 @@ static void addrconf_leave_anycast(struct inet6_ifaddr 
> *ifp)
>  
>  static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
>  {
> - if (dev->addr_len != ETH_ALEN)
> + if (dev->addr_len != ETH_ALEN) {
> + pr_warn("IPv6 addrconf: %s: address length %d != %s\n",
> + dev->name, dev->addr_len, "ETH_ALEN");
>   return -1;
> + }
>   memcpy(eui, dev->dev_addr, 3);
>   memcpy(eui + 5, dev->dev_addr + 3, 3);
>  
> @@ -1725,8 +1728,11 @@ static int addrconf_ifid_eui48(u8 *eui, struct 
> net_device *dev)
>  
>  static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev)
>  {
> - if (dev->addr_len != IEEE802154_ADDR_LEN)
> + if (dev->addr_len != IEEE802154_ADDR_LEN) {
> + pr_warn("IPv6 addrconf: %s: address length %d != %s\n",
> + dev->name, dev->addr_len, "IEEE802154_ADDR_LEN");
>   return -1;
> + }
>   memcpy(eui, dev->dev_addr, 8);
>   eui[0] ^= 2;
>   return 0;
> @@ -1736,8 +1742,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct 
> net_device *dev)
>  {
>   union fwnet_hwaddr *ha;
>  
> - if (dev->addr_len != FWNET_ALEN)
> + if (dev->addr_len != FWNET_ALEN) {
> + pr_warn("IPv6 addrconf: %s: address length %d != %s\n",
> + dev->name, dev->addr_len, "FWNET_ALEN");
>   return -1;
> + }
>  
>   ha = (union fwnet_hwaddr *)dev->dev_addr;
>  
> @@ -1749,8 +1758,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct 
> net_device *dev)
>  static int addrconf_ifid_arcnet(u8 *eui, struct net_device *dev)
>  {
>   /* XXX: inherit EUI-64 from other interface -- yoshfuji */
> - if (dev->addr_len != ARCNET_ALEN)
> + if (dev->addr_len != ARCNET_ALEN) {
> + pr_warn("IPv6 addrconf: %s: address length %d != %s\n",
> + dev->name, dev->addr_len, "ARCNET_ALEN");
>   return -1;
> + }
>   memset(eui, 0, 7);
>   eui[7] = *(u8 *)dev->dev_addr;
>   return 0;
> @@ -1758,17 +1770,25 @@ static int addrconf_ifid_arcnet(u8 *eui, struct 
> net_device *dev)
>  
>  static int addrconf_ifid_infiniband(u8 *eui, struct net_device *dev)
>  {
> - if (dev->addr_len != INFINIBAND_ALEN)
> + if (dev->addr_len != INFINIBAND_ALEN) {
> + pr_warn("IPv6 addrconf: %s: address length %d != %s\n",
> + dev->name, dev->addr_len, "INFINIBAND_ALEN");
>   return -1;
> + }
>   memcpy(eui, dev->dev_addr + 12, 8);
>   eui[0] |= 2;
>   return 0;
>  }
>  
> -static int __ipv6_isatap_ifid(u8 *eui, __be32 addr)
> +static int __ipv6_isatap_ifid(u8 *eui, struct net_device *dev)
>  {
> - if (addr == 0)
> + __be32 addr = *(__be32 *)dev->dev_addr;
> +
> + if (addr == 0) {
> + pr_warn("IPv6 addrconf: %s: bad dev_addr %pM\n",
> + dev->name, dev->dev_addr);
>   return -1;
> + }
>   eui[0] = (ipv4_is_zeronet(addr) || ipv4_is_private_10(addr) ||
> ipv4_is_loopback(addr) || ipv4_is_linklocal_169(addr) ||
> ipv4_is_private_172(addr) || ipv4_is_test_192(addr) ||
> @@ -1785,13 +1805,14 @@ static int __ipv6_isatap_ifid(u8 *eui, __be32 addr)
>  static int addrconf_ifid_sit(u8 *eui, struct net_device *dev)
>  {
>   if (dev->priv_flags & IFF_ISATAP)
> - return __ipv6_isatap_ifid(eui, *(__be32 *)dev->dev_addr);
> + return __ipv6_isatap_ifid(eui, dev);
> + pr_warn("IPv6 addrconf: %s: IFF_ISATAP is unset\n", dev->name);
>   return -1;
>  }
>  
>  static int addrconf_ifid_gre(u8 *eui, struct net_device *dev)
>  {
> - return __ipv6_isatap_ifid(eui, *(__be32 *)dev->dev_addr);
> + return __ipv6_isatap_ifid(eui, dev);
>  }
>  
>  static int addrconf_ifid_ip6tnl(u8 *eui, struct net_device *dev)
> @@ -1825,6 +1846,8 @@ static int ipv6_generate_eui64(u8 *eui, struct 
> net_device *dev)
>   case ARPHRD_TUNNEL6:
>   return addrconf_ifid_ip6tnl(eui, dev);
>   }
> + pr_warn("IPv6 addrconf: %s: dev->type %d is not supported\n",
> + dev->name, dev->type);
>   return -1;
>  }

This one should probably be a pr_debug or counter.  This is not a
critical issue and it makes no sense to spam the log if IPv6 is not
supported on a particular interface.

>  
> @@ -1842,6 +1865,10 @@

Re: [patch net-next v3] ipv6: log autoconfiguration failures

2013-12-13 Thread Vlad Yasevich

On 12/13/2013 10:45 AM, Denys Vlasenko wrote:
 If ipv6 auto-configuration does not work, currently it's hard
 to track what's going on. This change adds log messages
 (at debug level) on every code path where ipv6 autoconf fails.
 
 v3: changed pr_debug's to pr_warn's.
 
 Signed-off-by: Denys Vlasenko dvlas...@redhat.com
 ---
  net/ipv6/addrconf.c | 45 -
  1 file changed, 36 insertions(+), 9 deletions(-)
 
 diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
 index 3c3425e..0b354f0 100644
 --- a/net/ipv6/addrconf.c
 +++ b/net/ipv6/addrconf.c
 @@ -1694,8 +1694,11 @@ static void addrconf_leave_anycast(struct inet6_ifaddr 
 *ifp)
  
  static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
  {
 - if (dev-addr_len != ETH_ALEN)
 + if (dev-addr_len != ETH_ALEN) {
 + pr_warn(IPv6 addrconf: %s: address length %d != %s\n,
 + dev-name, dev-addr_len, ETH_ALEN);
   return -1;
 + }
   memcpy(eui, dev-dev_addr, 3);
   memcpy(eui + 5, dev-dev_addr + 3, 3);
  
 @@ -1725,8 +1728,11 @@ static int addrconf_ifid_eui48(u8 *eui, struct 
 net_device *dev)
  
  static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev)
  {
 - if (dev-addr_len != IEEE802154_ADDR_LEN)
 + if (dev-addr_len != IEEE802154_ADDR_LEN) {
 + pr_warn(IPv6 addrconf: %s: address length %d != %s\n,
 + dev-name, dev-addr_len, IEEE802154_ADDR_LEN);
   return -1;
 + }
   memcpy(eui, dev-dev_addr, 8);
   eui[0] ^= 2;
   return 0;
 @@ -1736,8 +1742,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct 
 net_device *dev)
  {
   union fwnet_hwaddr *ha;
  
 - if (dev-addr_len != FWNET_ALEN)
 + if (dev-addr_len != FWNET_ALEN) {
 + pr_warn(IPv6 addrconf: %s: address length %d != %s\n,
 + dev-name, dev-addr_len, FWNET_ALEN);
   return -1;
 + }
  
   ha = (union fwnet_hwaddr *)dev-dev_addr;
  
 @@ -1749,8 +1758,11 @@ static int addrconf_ifid_ieee1394(u8 *eui, struct 
 net_device *dev)
  static int addrconf_ifid_arcnet(u8 *eui, struct net_device *dev)
  {
   /* XXX: inherit EUI-64 from other interface -- yoshfuji */
 - if (dev-addr_len != ARCNET_ALEN)
 + if (dev-addr_len != ARCNET_ALEN) {
 + pr_warn(IPv6 addrconf: %s: address length %d != %s\n,
 + dev-name, dev-addr_len, ARCNET_ALEN);
   return -1;
 + }
   memset(eui, 0, 7);
   eui[7] = *(u8 *)dev-dev_addr;
   return 0;
 @@ -1758,17 +1770,25 @@ static int addrconf_ifid_arcnet(u8 *eui, struct 
 net_device *dev)
  
  static int addrconf_ifid_infiniband(u8 *eui, struct net_device *dev)
  {
 - if (dev-addr_len != INFINIBAND_ALEN)
 + if (dev-addr_len != INFINIBAND_ALEN) {
 + pr_warn(IPv6 addrconf: %s: address length %d != %s\n,
 + dev-name, dev-addr_len, INFINIBAND_ALEN);
   return -1;
 + }
   memcpy(eui, dev-dev_addr + 12, 8);
   eui[0] |= 2;
   return 0;
  }
  
 -static int __ipv6_isatap_ifid(u8 *eui, __be32 addr)
 +static int __ipv6_isatap_ifid(u8 *eui, struct net_device *dev)
  {
 - if (addr == 0)
 + __be32 addr = *(__be32 *)dev-dev_addr;
 +
 + if (addr == 0) {
 + pr_warn(IPv6 addrconf: %s: bad dev_addr %pM\n,
 + dev-name, dev-dev_addr);
   return -1;
 + }
   eui[0] = (ipv4_is_zeronet(addr) || ipv4_is_private_10(addr) ||
 ipv4_is_loopback(addr) || ipv4_is_linklocal_169(addr) ||
 ipv4_is_private_172(addr) || ipv4_is_test_192(addr) ||
 @@ -1785,13 +1805,14 @@ static int __ipv6_isatap_ifid(u8 *eui, __be32 addr)
  static int addrconf_ifid_sit(u8 *eui, struct net_device *dev)
  {
   if (dev-priv_flags  IFF_ISATAP)
 - return __ipv6_isatap_ifid(eui, *(__be32 *)dev-dev_addr);
 + return __ipv6_isatap_ifid(eui, dev);
 + pr_warn(IPv6 addrconf: %s: IFF_ISATAP is unset\n, dev-name);
   return -1;
  }
  
  static int addrconf_ifid_gre(u8 *eui, struct net_device *dev)
  {
 - return __ipv6_isatap_ifid(eui, *(__be32 *)dev-dev_addr);
 + return __ipv6_isatap_ifid(eui, dev);
  }
  
  static int addrconf_ifid_ip6tnl(u8 *eui, struct net_device *dev)
 @@ -1825,6 +1846,8 @@ static int ipv6_generate_eui64(u8 *eui, struct 
 net_device *dev)
   case ARPHRD_TUNNEL6:
   return addrconf_ifid_ip6tnl(eui, dev);
   }
 + pr_warn(IPv6 addrconf: %s: dev-type %d is not supported\n,
 + dev-name, dev-type);
   return -1;
  }

This one should probably be a pr_debug or counter.  This is not a
critical issue and it makes no sense to spam the log if IPv6 is not
supported on a particular interface.

  
 @@ -1842,6 +1865,10 @@ static int ipv6_inherit_eui64(u8 *eui, struct 
 inet6_dev *idev)
   }
   }
   read_unlock_bh(idev-lock);
 + if (err)
 + pr_warn(IPv6

Re: [patch net-next v2] ipv6: log autoconfiguration failures

2013-12-12 Thread Vlad Yasevich

On 12/12/2013 12:22 PM, David Miller wrote:
> From: Denys Vlasenko 
> Date: Thu, 12 Dec 2013 12:17:42 +0100
> 
>> I can easily imagine their frustration. Kernel _knows_ why
>> it didn't work, and it's not expected to normally pappen,
>> why didn't it tell anything about it?
> 
> Packets are dropped silently, ARP fails and entries go stale silently,
> none of this is logged with kernel messages, why is ipv6 autoconf so
> unique and important to justify different behavior?

But most of the changes from V2 report an actual error condition in
the kernel: addr_len is not set correctly for the device type.
These changes are worth keeping and not just as pr_debug.  Having
a counter of these failures will not provide much useful data if you
happen to have multiple such device and where one happens to work and
the other doesn't (very unlikely, I know).

> 
> Give it statistics just like we have for every other kind of similar
> event.

The fact that link-local address could not be generated for a specific
device type does not lend itself well to counting.  There is no way
from to tell from the counter which device does not support IPv6 autoconfig.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch net-next v2] ipv6: log autoconfiguration failures

2013-12-12 Thread Vlad Yasevich

On 12/12/2013 12:22 PM, David Miller wrote:
 From: Denys Vlasenko dvlas...@redhat.com
 Date: Thu, 12 Dec 2013 12:17:42 +0100

 I can easily imagine their frustration. Kernel _knows_ why
 it didn't work, and it's not expected to normally pappen,
 why didn't it tell anything about it?

 Packets are dropped silently, ARP fails and entries go stale silently,
 none of this is logged with kernel messages, why is ipv6 autoconf so
 unique and important to justify different behavior?

But most of the changes from V2 report an actual error condition in
the kernel: addr_len is not set correctly for the device type.
These changes are worth keeping and not just as pr_debug.  Having
a counter of these failures will not provide much useful data if you
happen to have multiple such device and where one happens to work and
the other doesn't (very unlikely, I know).

 Give it statistics just like we have for every other kind of similar
 event.

The fact that link-local address could not be generated for a specific
device type does not lend itself well to counting.  There is no way
from to tell from the counter which device does not support IPv6 autoconfig.

-vlad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net V3 2/2] macvtap: signal truncated packets

2013-12-11 Thread Vlad Yasevich

On 12/11/2013 12:08 AM, Jason Wang wrote:
> macvtap_put_user() never return a value grater than iov length, this in fact
> bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
> returning the size of packet plus the possible vlan header to let the trunca
> checking work.
> 
> Cc: Vlad Yasevich 
> Cc: Zhi Yong Wu 
> Cc: Michael S. Tsirkin 
> Signed-off-by: Jason Wang 

Acked-by: Vlad Yasevich 

-vlad
> ---
> Changes from V2:
> - use total as the bytes of packet returned
> Changes from V1:
> - increase total unconditionally
> - do not move the structure veth out of the vlan handling block
> ---
>  drivers/net/macvtap.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 957cc5c..2a89da0 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -770,7 +770,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>   int ret;
>   int vnet_hdr_len = 0;
>   int vlan_offset = 0;
> - int copied;
> + int copied, total;
>  
>   if (q->flags & IFF_VNET_HDR) {
>   struct virtio_net_hdr vnet_hdr;
> @@ -785,7 +785,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>   if (memcpy_toiovecend(iv, (void *)_hdr, 0, 
> sizeof(vnet_hdr)))
>   return -EFAULT;
>   }
> - copied = vnet_hdr_len;
> + total = copied = vnet_hdr_len;
> + total += skb->len;
>  
>   if (!vlan_tx_tag_present(skb))
>   len = min_t(int, skb->len, len);
> @@ -800,6 +801,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>  
>   vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
>   len = min_t(int, skb->len + VLAN_HLEN, len);
> + total += VLAN_HLEN;
>  
>   copy = min_t(int, vlan_offset, len);
>   ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy);
> @@ -817,10 +819,9 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>   }
>  
>   ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len);
> - copied += len;
>  
>  done:
> - return ret ? ret : copied;
> + return ret ? ret : total;
>  }
>  
>  static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
> @@ -875,7 +876,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const 
> struct iovec *iv,
>   }
>  
>   ret = macvtap_do_read(q, iocb, iv, len, file->f_flags & O_NONBLOCK);
> - ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */
> + ret = min_t(ssize_t, ret, len);
>   if (ret > 0)
>   iocb->ki_pos = ret;
>  out:
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net V3 2/2] macvtap: signal truncated packets

2013-12-11 Thread Vlad Yasevich

On 12/11/2013 12:08 AM, Jason Wang wrote:
 macvtap_put_user() never return a value grater than iov length, this in fact
 bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
 returning the size of packet plus the possible vlan header to let the trunca
 checking work.
 
 Cc: Vlad Yasevich vyasev...@gmail.com
 Cc: Zhi Yong Wu wu...@linux.vnet.ibm.com
 Cc: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad
 ---
 Changes from V2:
 - use total as the bytes of packet returned
 Changes from V1:
 - increase total unconditionally
 - do not move the structure veth out of the vlan handling block
 ---
  drivers/net/macvtap.c | 11 ++-
  1 file changed, 6 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
 index 957cc5c..2a89da0 100644
 --- a/drivers/net/macvtap.c
 +++ b/drivers/net/macvtap.c
 @@ -770,7 +770,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
   int ret;
   int vnet_hdr_len = 0;
   int vlan_offset = 0;
 - int copied;
 + int copied, total;
  
   if (q-flags  IFF_VNET_HDR) {
   struct virtio_net_hdr vnet_hdr;
 @@ -785,7 +785,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
   if (memcpy_toiovecend(iv, (void *)vnet_hdr, 0, 
 sizeof(vnet_hdr)))
   return -EFAULT;
   }
 - copied = vnet_hdr_len;
 + total = copied = vnet_hdr_len;
 + total += skb-len;
  
   if (!vlan_tx_tag_present(skb))
   len = min_t(int, skb-len, len);
 @@ -800,6 +801,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
  
   vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
   len = min_t(int, skb-len + VLAN_HLEN, len);
 + total += VLAN_HLEN;
  
   copy = min_t(int, vlan_offset, len);
   ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy);
 @@ -817,10 +819,9 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
   }
  
   ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len);
 - copied += len;
  
  done:
 - return ret ? ret : copied;
 + return ret ? ret : total;
  }
  
  static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
 @@ -875,7 +876,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const 
 struct iovec *iv,
   }
  
   ret = macvtap_do_read(q, iocb, iv, len, file-f_flags  O_NONBLOCK);
 - ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */
 + ret = min_t(ssize_t, ret, len);
   if (ret  0)
   iocb-ki_pos = ret;
  out:
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()

2013-12-10 Thread Vlad Yasevich

On 12/09/2013 08:36 PM, David Miller wrote:
> From: Zhi Yong Wu 
> Date: Sat,  7 Dec 2013 04:55:00 +0800
> 
>> From: Zhi Yong Wu 
>>
>> By checking related codes, it is impossible that ret > len or total_len,
>> so we should remove some useless codes in both above functions.
>>
>> Signed-off-by: Zhi Yong Wu 
> 
> Applied.

Wait a sec.  We want to be able to return a value bigger then len
to trigger a MSG_TRUNC.  Jason has patches for to fix this.  If you
apply this, we'll have to re-introduce this code back in.

Same goes for patch 1/2.

-vlad

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()

2013-12-10 Thread Vlad Yasevich

On 12/09/2013 08:36 PM, David Miller wrote:
 From: Zhi Yong Wu zwu.ker...@gmail.com
 Date: Sat,  7 Dec 2013 04:55:00 +0800

 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

 By checking related codes, it is impossible that ret  len or total_len,
 so we should remove some useless codes in both above functions.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com

 Applied.

Wait a sec.  We want to be able to return a value bigger then len
to trigger a MSG_TRUNC.  Jason has patches for to fix this.  If you
apply this, we'll have to re-introduce this code back in.

Same goes for patch 1/2.

-vlad

 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net 1/2] tun: unbreak truncated packet signalling

2013-12-09 Thread Vlad Yasevich

On 12/09/2013 05:55 AM, Michael S. Tsirkin wrote:
> On Mon, Dec 09, 2013 at 06:25:16PM +0800, Jason Wang wrote:
>> Commit 6680ec68eff47d36f67b4351bc9836fd6cba9532
>> (tuntap: hardware vlan tx support) breaks the truncated packet signal
by never
>> return a length greater than iov length in tun_put_user(). This patch
fixes this
>> by always return the length of packet plus possible vlan header.
Caller can
>> detect the truncated packet by comparing the return value and the
size of iov
>> length.
>>
>> Reported-by: Vlad Yasevich 
>> Cc: Vlad Yasevich 
>> Cc: Zhi Yong Wu 
>> Signed-off-by: Jason Wang 
>
> So writer gets back a value greater than what was written?
>
>> ---
>> The patch is needed for stable.
>> ---
>>  drivers/net/tun.c | 23 ---
>>  1 file changed, 12 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index e26cbea..dd1bd7a 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1183,7 +1183,11 @@ static ssize_t tun_put_user(struct tun_struct
*tun,
>>  const struct iovec *iv, int len)
>>  {
>>  struct tun_pi pi = { 0, skb->protocol };
>> -ssize_t total = 0;
>> +struct {
>> +__be16 h_vlan_proto;
>> +__be16 h_vlan_TCI;
>> +} veth;
>> +ssize_t total = 0, off = 0;
>
> Why off = 0 here?
> We initialize it to total unconditionally, don't we?
>
>>  int vlan_offset = 0;
>>
>>  if (!(tun->flags & TUN_NO_PI)) {
>> @@ -1248,14 +1252,11 @@ static ssize_t tun_put_user(struct tun_struct
*tun,
>>  total += tun->vnet_hdr_sz;
>>  }
>>
>> +off = total;
>>  if (!vlan_tx_tag_present(skb)) {
>>  len = min_t(int, skb->len, len);
>>  } else {
>>  int copy, ret;
>> -struct {
>> -__be16 h_vlan_proto;
>> -__be16 h_vlan_TCI;
>> -} veth;
>>
>>  veth.h_vlan_proto = skb->vlan_proto;
>>  veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb));
>> @@ -1264,22 +1265,22 @@ static ssize_t tun_put_user(struct tun_struct
*tun,
>>  len = min_t(int, skb->len + VLAN_HLEN, len);
>>
>>  copy = min_t(int, vlan_offset, len);
>> -ret = skb_copy_datagram_const_iovec(skb, 0, iv, total, copy);
>> +ret = skb_copy_datagram_const_iovec(skb, 0, iv, off, copy);
>>  len -= copy;
>> -total += copy;
>> +off += copy;
>>  if (ret || !len)
>>  goto done;
>>
>>  copy = min_t(int, sizeof(veth), len);
>> -ret = memcpy_toiovecend(iv, (void *), total, copy);
>> +ret = memcpy_toiovecend(iv, (void *), off, copy);
>>  len -= copy;
>> -total += copy;
>> +off += copy;
>>  if (ret || !len)
>>  goto done;
>
> This seems wrong: if one of the branches above is taken, total is
> never incremented.
>
>>  }
>>
>> -skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len);
>> -total += len;
>> +skb_copy_datagram_const_iovec(skb, vlan_offset, iv, off, len);
>> +total += skb->len + (vlan_offset ? sizeof(veth) : 0);
>>
>>  done:
>>  tun->dev->stats.tx_packets++;
>
> I also think it's inelegant that the veth struct is now in the
> outside scope, and the extra ? is also ugly.
>
> Here's a smaller patch to fix all these problems - what do you think?
>
>
>
> Signed-off-by: Michael S. Tsirkin 
>
> ---
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 782e38b..3297e41 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1183,7 +1183,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>   const struct iovec *iv, int len)
>  {
>   struct tun_pi pi = { 0, skb->protocol };
> - ssize_t total = 0;
> + ssize_t total = 0, offset;
>   int vlan_offset = 0;
>
>   if (!(tun->flags & TUN_NO_PI)) {
> @@ -1248,6 +1248,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>   total += tun->vnet_hdr_sz;
>   }
>
> + offset = total;
> + total += skb->len;
>   if (!vlan_tx_tag_present(skb)) {
>   len = min_t(int, skb->len, len);
>   } else {
> @@ -1257,6 +1259,8 @@ static ssize_t tun_put_user(struct tun_s

Re: [PATCH net 1/2] tun: unbreak truncated packet signalling

2013-12-09 Thread Vlad Yasevich

On 12/09/2013 05:55 AM, Michael S. Tsirkin wrote:
 On Mon, Dec 09, 2013 at 06:25:16PM +0800, Jason Wang wrote:
 Commit 6680ec68eff47d36f67b4351bc9836fd6cba9532
 (tuntap: hardware vlan tx support) breaks the truncated packet signal
by never
 return a length greater than iov length in tun_put_user(). This patch
fixes this
 by always return the length of packet plus possible vlan header.
Caller can
 detect the truncated packet by comparing the return value and the
size of iov
 length.

 Reported-by: Vlad Yasevich vyasev...@gmail.com
 Cc: Vlad Yasevich vyasev...@gmail.com
 Cc: Zhi Yong Wu wu...@linux.vnet.ibm.com
 Signed-off-by: Jason Wang jasow...@redhat.com

 So writer gets back a value greater than what was written?

 ---
 The patch is needed for stable.
 ---
  drivers/net/tun.c | 23 ---
  1 file changed, 12 insertions(+), 11 deletions(-)

 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index e26cbea..dd1bd7a 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c
 @@ -1183,7 +1183,11 @@ static ssize_t tun_put_user(struct tun_struct
*tun,
  const struct iovec *iv, int len)
  {
  struct tun_pi pi = { 0, skb-protocol };
 -ssize_t total = 0;
 +struct {
 +__be16 h_vlan_proto;
 +__be16 h_vlan_TCI;
 +} veth;
 +ssize_t total = 0, off = 0;

 Why off = 0 here?
 We initialize it to total unconditionally, don't we?

  int vlan_offset = 0;

  if (!(tun-flags  TUN_NO_PI)) {
 @@ -1248,14 +1252,11 @@ static ssize_t tun_put_user(struct tun_struct
*tun,
  total += tun-vnet_hdr_sz;
  }

 +off = total;
  if (!vlan_tx_tag_present(skb)) {
  len = min_t(int, skb-len, len);
  } else {
  int copy, ret;
 -struct {
 -__be16 h_vlan_proto;
 -__be16 h_vlan_TCI;
 -} veth;

  veth.h_vlan_proto = skb-vlan_proto;
  veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb));
 @@ -1264,22 +1265,22 @@ static ssize_t tun_put_user(struct tun_struct
*tun,
  len = min_t(int, skb-len + VLAN_HLEN, len);

  copy = min_t(int, vlan_offset, len);
 -ret = skb_copy_datagram_const_iovec(skb, 0, iv, total, copy);
 +ret = skb_copy_datagram_const_iovec(skb, 0, iv, off, copy);
  len -= copy;
 -total += copy;
 +off += copy;
  if (ret || !len)
  goto done;

  copy = min_t(int, sizeof(veth), len);
 -ret = memcpy_toiovecend(iv, (void *)veth, total, copy);
 +ret = memcpy_toiovecend(iv, (void *)veth, off, copy);
  len -= copy;
 -total += copy;
 +off += copy;
  if (ret || !len)
  goto done;

 This seems wrong: if one of the branches above is taken, total is
 never incremented.

  }

 -skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len);
 -total += len;
 +skb_copy_datagram_const_iovec(skb, vlan_offset, iv, off, len);
 +total += skb-len + (vlan_offset ? sizeof(veth) : 0);

  done:
  tun-dev-stats.tx_packets++;

 I also think it's inelegant that the veth struct is now in the
 outside scope, and the extra ? is also ugly.

 Here's a smaller patch to fix all these problems - what do you think?



 Signed-off-by: Michael S. Tsirkin m...@redhat.com

 ---

 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index 782e38b..3297e41 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c
 @@ -1183,7 +1183,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
   const struct iovec *iv, int len)
  {
   struct tun_pi pi = { 0, skb-protocol };
 - ssize_t total = 0;
 + ssize_t total = 0, offset;
   int vlan_offset = 0;

   if (!(tun-flags  TUN_NO_PI)) {
 @@ -1248,6 +1248,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
   total += tun-vnet_hdr_sz;
   }

 + offset = total;
 + total += skb-len;
   if (!vlan_tx_tag_present(skb)) {
   len = min_t(int, skb-len, len);
   } else {
 @@ -1257,6 +1259,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
   __be16 h_vlan_TCI;
   } veth;

 + total += sizeof(veth);
 +
   veth.h_vlan_proto = skb-vlan_proto;
   veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb));

 @@ -1279,7 +1283,6 @@ static ssize_t tun_put_user(struct tun_struct *tun,
   }

   skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len);
 - total += len;

  done:
   tun-dev-stats.tx_packets++;



You have to use 'offset' instead of 'total' when doing skb_copy and
adjust offset as you write the vlan header.

I think something like this will fix it:

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 782e38b..d71c393 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1183,7 +1183,7 @@ static ssize_t

Re: [PATCH 1/2] macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()

2013-12-06 Thread Vlad Yasevich

On 12/06/2013 03:54 PM, Zhi Yong Wu wrote:
> From: Zhi Yong Wu 
> 
> By checking related codes, it is impossible that ret > len or total_len,
> so we should remove some useless coeds in both above functions.

Looks like commit 6680ec68eff47d36f67b4351bc9836fd6cba9532
Author: Jason Wang 
Date:   Thu Jul 25 13:00:33 2013 +0800

tuntap: hardware vlan tx support

Introduced a change in tun_put_user() where we can
never return a length longer then len or total_len.
This has an effect that is now impossible to signal
truncated status.  It seems like a potential loss
of functionality and it might make sense to restore it.

-vlad
> 
> Signed-off-by: Zhi Yong Wu 
> ---
>  drivers/net/macvtap.c |5 -
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 4c6f84c..7f4ccdd 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -871,7 +871,6 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const 
> struct iovec *iv,
>   }
>  
>   ret = macvtap_do_read(q, iv, len, file->f_flags & O_NONBLOCK);
> - ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */
>  out:
>   return ret;
>  }
> @@ -1104,10 +1103,6 @@ static int macvtap_recvmsg(struct kiocb *iocb, struct 
> socket *sock,
>   return -EINVAL;
>   ret = macvtap_do_read(q, m->msg_iov, total_len,
> flags & MSG_DONTWAIT);
> - if (ret > total_len) {
> - m->msg_flags |= MSG_TRUNC;
> - ret = flags & MSG_TRUNC ? ret : total_len;
> - }
>   return ret;
>  }
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] tun: update file current position

2013-12-06 Thread Vlad Yasevich

On 12/06/2013 12:45 PM, David Miller wrote:
> From: Zhi Yong Wu 
> Date: Fri,  6 Dec 2013 17:08:50 +0800
> 
>> From: Zhi Yong Wu 
>>
>> Signed-off-by: Zhi Yong Wu 
> 
> Also applied and queued up for -stable, thanks.
> 
> I noticed in these two cases that that min_t() adjustment of 'ret'
> seems strange.  I can't understand why it's needed.
> 
> If, for example, tun_do_read() really did read more than 'len'
> bytes:
> 
> 1) That would write past the end of the buffer.
> 
> 2) Writing a different value to the ->ki_pos would mean
>that ->ki_pos is now inaccurate.
> 
> Unless someone can explain why the min_t() is needed, we should remove
> it.

So, back when that code was added, it was actually possible for the
tun_do_read to return a value larger then user specified length, but
the copy would only be done the length bytes.  This was used to signal
MSG_TRUNC.

Specifically we had the following code in tun_put_user()
  len = min_t(int, skb->len, len);

  skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
  total += skb->len;

This has since changed to:
  len = min_t(int, skb->len, len);
  ...
  skb_copy_datagram_const_iovec(skb, vlan_offset, iv, total, len);
  total += len;

So, no it seems impossible to return a value larger then user specified
length, so MSG_TRUNC will never be set.  It probably makes sense to
signal message truncation, but that seems broken right now.

-vlad

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 338 matches

Mail list logo