Re: [PATCH V2 net-next 4/6] tun: Add support for SCTP checksum offload

2018-05-02 Thread Vlad Yasevich
On 05/02/2018 10:56 AM, Marcelo Ricardo Leitner wrote:
> On Wed, May 02, 2018 at 11:53:47AM -0300, Marcelo Ricardo Leitner wrote:
>> On Tue, May 01, 2018 at 10:07:37PM -0400, Vladislav Yasevich wrote:
>>> Adds a new tun offload flag to allow for SCTP checksum offload.
>>> The flag has to be set by the user and defaults to "no offload".
>>
>> I'm confused here:
>>
>>> +++ b/drivers/net/tun.c
>>> @@ -216,7 +216,7 @@ struct tun_struct {
>>> struct net_device   *dev;
>>> netdev_features_t   set_features;
>>>  #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
>>> - NETIF_F_TSO6)
>>> + NETIF_F_TSO6|NETIF_F_SCTP_CRC)
>>
>> Doesn't adding it here mean it defaults to "offload", instead?
>>
>> later on, it does:
>> dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>TUN_USER_FEATURES | 
>> NETIF_F_HW_VLAN_CTAG_TX |
>>NETIF_F_HW_VLAN_STAG_TX;
> 
> Missed to paste the next line too:
> dev->features = dev->hw_features | NETIF_F_LLTX;
> 

Yes, as a software device, we can default it to on.  However, qemu will 0-out
the features and then set them up based on the guest (just like regular 
checksum).

-vlad




Re: [PATCH V2 net-next 5/6] macvlan/macvtap: Add support for SCTP checksum offload.

2018-05-02 Thread Vlad Yasevich
On 05/02/2018 10:17 AM, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 10:00:14AM -0400, Vlad Yasevich wrote:
>> On 05/02/2018 09:46 AM, Michael S. Tsirkin wrote:
>>> On Wed, May 02, 2018 at 09:27:00AM -0400, Vlad Yasevich wrote:
>>>> On 05/01/2018 11:24 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, May 01, 2018 at 10:07:38PM -0400, Vladislav Yasevich wrote:
>>>>>> Since we now have support for software CRC32c offload, turn it on
>>>>>> for macvlan and macvtap devices so that guests can take advantage
>>>>>> of offload SCTP checksums to the host or host hardware.
>>>>>>
>>>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>>>>> ---
>>>>>>  drivers/net/macvlan.c | 5 +++--
>>>>>>  drivers/net/tap.c | 8 +---
>>>>>>  2 files changed, 8 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
>>>>>> index 725f4b4..646b730 100644
>>>>>> --- a/drivers/net/macvlan.c
>>>>>> +++ b/drivers/net/macvlan.c
>>>>>> @@ -834,7 +834,7 @@ static struct lock_class_key 
>>>>>> macvlan_netdev_addr_lock_key;
>>>>>>  
>>>>>>  #define ALWAYS_ON_OFFLOADS \
>>>>>>  (NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
>>>>>> - NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
>>>>>> + NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
>>>>>>  
>>>>>>  #define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
>>>>>>  
>>>>>> @@ -842,7 +842,8 @@ static struct lock_class_key 
>>>>>> macvlan_netdev_addr_lock_key;
>>>>>>  (NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | 
>>>>>> NETIF_F_FRAGLIST | \
>>>>>>   NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
>>>>>>   NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM 
>>>>>> | \
>>>>>> - NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
>>>>>> + NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
>>>>>> + NETIF_F_SCTP_CRC)
>>>>>>  
>>>>>>  #define MACVLAN_STATE_MASK \
>>>>>>  ((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>> index 9b6cb78..2c8512b 100644
>>>>>> --- a/drivers/net/tap.c
>>>>>> +++ b/drivers/net/tap.c
>>>>>> @@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff 
>>>>>> **pskb)
>>>>>>   *check, we either support them all or none.
>>>>>>   */
>>>>>>  if (skb->ip_summed == CHECKSUM_PARTIAL &&
>>>>>> -!(features & NETIF_F_CSUM_MASK) &&
>>>>>> -skb_checksum_help(skb))
>>>>>> +skb_csum_hwoffload_help(skb, features))
>>>>>>  goto drop;
>>>>>>  if (ptr_ring_produce(>ring, skb))
>>>>>>  goto drop;
>>>>>> @@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned 
>>>>>> long arg)
>>>>>>  }
>>>>>>  }
>>>>>>  
>>>>>> +if (arg & TUN_F_SCTP_CSUM)
>>>>>> +feature_mask |= NETIF_F_SCTP_CRC;
>>>>>> +
>>>>>
>>>>> so this still affects TX, shouldn't this affect RX instead?
>>>>
>>>> There is no bit to set on the RX path just like there is no bit to set on 
>>>> the RX patch
>>>> for TUN_F_CSUM.
>>>>
>>>> We only invert TSO offloads, not checksum offloads as the comment below 
>>>> states.
>>>> For checksum,  macvtap has to compute the checksum itself in 
>>>> tap_handle_frame() above.
>>>> It uses tx feature bits to see if needs do to the checksum.
>>>>
>>>> If you think we need another flag to macvtap to control RXCSUM, that would 
>>>> need to be
>

Re: [PATCH V2 net-next 5/6] macvlan/macvtap: Add support for SCTP checksum offload.

2018-05-02 Thread Vlad Yasevich
On 05/02/2018 09:46 AM, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 09:27:00AM -0400, Vlad Yasevich wrote:
>> On 05/01/2018 11:24 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 01, 2018 at 10:07:38PM -0400, Vladislav Yasevich wrote:
>>>> Since we now have support for software CRC32c offload, turn it on
>>>> for macvlan and macvtap devices so that guests can take advantage
>>>> of offload SCTP checksums to the host or host hardware.
>>>>
>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>>> ---
>>>>  drivers/net/macvlan.c | 5 +++--
>>>>  drivers/net/tap.c | 8 +---
>>>>  2 files changed, 8 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
>>>> index 725f4b4..646b730 100644
>>>> --- a/drivers/net/macvlan.c
>>>> +++ b/drivers/net/macvlan.c
>>>> @@ -834,7 +834,7 @@ static struct lock_class_key 
>>>> macvlan_netdev_addr_lock_key;
>>>>  
>>>>  #define ALWAYS_ON_OFFLOADS \
>>>>(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
>>>> -   NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
>>>> +   NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
>>>>  
>>>>  #define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
>>>>  
>>>> @@ -842,7 +842,8 @@ static struct lock_class_key 
>>>> macvlan_netdev_addr_lock_key;
>>>>(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
>>>> NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
>>>> NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
>>>> -   NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
>>>> +   NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
>>>> +   NETIF_F_SCTP_CRC)
>>>>  
>>>>  #define MACVLAN_STATE_MASK \
>>>>((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>> index 9b6cb78..2c8512b 100644
>>>> --- a/drivers/net/tap.c
>>>> +++ b/drivers/net/tap.c
>>>> @@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff 
>>>> **pskb)
>>>> *check, we either support them all or none.
>>>> */
>>>>if (skb->ip_summed == CHECKSUM_PARTIAL &&
>>>> -  !(features & NETIF_F_CSUM_MASK) &&
>>>> -  skb_checksum_help(skb))
>>>> +  skb_csum_hwoffload_help(skb, features))
>>>>goto drop;
>>>>if (ptr_ring_produce(>ring, skb))
>>>>goto drop;
>>>> @@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned 
>>>> long arg)
>>>>}
>>>>}
>>>>  
>>>> +  if (arg & TUN_F_SCTP_CSUM)
>>>> +  feature_mask |= NETIF_F_SCTP_CRC;
>>>> +
>>>
>>> so this still affects TX, shouldn't this affect RX instead?
>>
>> There is no bit to set on the RX path just like there is no bit to set on 
>> the RX patch
>> for TUN_F_CSUM.
>>
>> We only invert TSO offloads, not checksum offloads as the comment below 
>> states.
>> For checksum,  macvtap has to compute the checksum itself in 
>> tap_handle_frame() above.
>> It uses tx feature bits to see if needs do to the checksum.
>>
>> If you think we need another flag to macvtap to control RXCSUM, that would 
>> need to be
>> separate and cover standard TCP checksum as well.
>>
>> -vlad
> 
> Confused. What is the meaning of TUN_F_SCTP_CSUM? I assume this is
> a way for userspace to tell tun device: "I can handle
> packets without SCTP checksum, pls send them my way".

Yes,  just as TUN_F_CSUM means that tun device can handle packets with
partial tcp/udp checksum.

> 
> Now what is the implication for macvtap? 

The implication is exactly the same as for TUN_F_CSUM.  If the
flag is set on the macvtap device, the TX checksum feature is
turned on.

> And why  are
> you setting NETIF_F_SCTP_CRC which is a flag
> that affects packets sent by guest to host?

Mainly its because we are using just 1 flag to control checksum
offloading and we need to be able control both tx and rx paths.

What you are suggesting that we either invert what TUN_F_CSUM
is doing in macvtap case, or have another flag that lets us control
TX and RX paths separately.

Either case, that would be separate work.
-vlad

> 
> 
>>>
>>>
>>>>/* tun/tap driver inverts the usage for TSO offloads, where
>>>> * setting the TSO bit means that the userspace wants to
>>>> * accept TSO frames and turning it off means that user space
>>>> @@ -1077,7 +1079,7 @@ static long tap_ioctl(struct file *file, unsigned 
>>>> int cmd,
>>>>case TUNSETOFFLOAD:
>>>>/* let the user check for future flags */
>>>>if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
>>>> -  TUN_F_TSO_ECN | TUN_F_UFO))
>>>> +  TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_SCTP_CSUM))
>>>>return -EINVAL;
>>>>  
>>>>rtnl_lock();
>>>> -- 
>>>> 2.9.5



Re: [PATCH V2 net-next 5/6] macvlan/macvtap: Add support for SCTP checksum offload.

2018-05-02 Thread Vlad Yasevich
On 05/01/2018 11:24 PM, Michael S. Tsirkin wrote:
> On Tue, May 01, 2018 at 10:07:38PM -0400, Vladislav Yasevich wrote:
>> Since we now have support for software CRC32c offload, turn it on
>> for macvlan and macvtap devices so that guests can take advantage
>> of offload SCTP checksums to the host or host hardware.
>>
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  drivers/net/macvlan.c | 5 +++--
>>  drivers/net/tap.c | 8 +---
>>  2 files changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
>> index 725f4b4..646b730 100644
>> --- a/drivers/net/macvlan.c
>> +++ b/drivers/net/macvlan.c
>> @@ -834,7 +834,7 @@ static struct lock_class_key 
>> macvlan_netdev_addr_lock_key;
>>  
>>  #define ALWAYS_ON_OFFLOADS \
>>  (NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
>> - NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
>> + NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
>>  
>>  #define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
>>  
>> @@ -842,7 +842,8 @@ static struct lock_class_key 
>> macvlan_netdev_addr_lock_key;
>>  (NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
>>   NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
>>   NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
>> - NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
>> + NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
>> + NETIF_F_SCTP_CRC)
>>  
>>  #define MACVLAN_STATE_MASK \
>>  ((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>> index 9b6cb78..2c8512b 100644
>> --- a/drivers/net/tap.c
>> +++ b/drivers/net/tap.c
>> @@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff 
>> **pskb)
>>   *check, we either support them all or none.
>>   */
>>  if (skb->ip_summed == CHECKSUM_PARTIAL &&
>> -!(features & NETIF_F_CSUM_MASK) &&
>> -skb_checksum_help(skb))
>> +skb_csum_hwoffload_help(skb, features))
>>  goto drop;
>>  if (ptr_ring_produce(>ring, skb))
>>  goto drop;
>> @@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned 
>> long arg)
>>  }
>>  }
>>  
>> +if (arg & TUN_F_SCTP_CSUM)
>> +feature_mask |= NETIF_F_SCTP_CRC;
>> +
> 
> so this still affects TX, shouldn't this affect RX instead?

There is no bit to set on the RX path just like there is no bit to set on the 
RX patch
for TUN_F_CSUM.

We only invert TSO offloads, not checksum offloads as the comment below states.
For checksum,  macvtap has to compute the checksum itself in tap_handle_frame() 
above.
It uses tx feature bits to see if needs do to the checksum.

If you think we need another flag to macvtap to control RXCSUM, that would need 
to be
separate and cover standard TCP checksum as well.

-vlad

> 
> 
>>  /* tun/tap driver inverts the usage for TSO offloads, where
>>   * setting the TSO bit means that the userspace wants to
>>   * accept TSO frames and turning it off means that user space
>> @@ -1077,7 +1079,7 @@ static long tap_ioctl(struct file *file, unsigned int 
>> cmd,
>>  case TUNSETOFFLOAD:
>>  /* let the user check for future flags */
>>  if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
>> -TUN_F_TSO_ECN | TUN_F_UFO))
>> +TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_SCTP_CSUM))
>>  return -EINVAL;
>>  
>>  rtnl_lock();
>> -- 
>> 2.9.5



Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support

2018-04-23 Thread Vlad Yasevich
On 04/20/2018 01:22 PM, Marcelo Ricardo Leitner wrote:
> On Wed, Apr 18, 2018 at 05:06:46PM +0300, Michael S. Tsirkin wrote:
>> On Tue, Apr 17, 2018 at 04:35:18PM -0400, Vlad Yasevich wrote:
>>> On 04/02/2018 10:47 AM, Marcelo Ricardo Leitner wrote:
>>>> On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
>>>>> Now that we have SCTP offload capabilities in the kernel, we can add
>>>>> them to virtio as well.  First step is SCTP checksum.
>>>>
>>>> Thanks.
>>>>
>>>>> As for GSO, the way sctp GSO is currently implemented buys us nothing
>>>>> in added support to virtio.  To add true GSO, would require a lot of
>>>>> re-work inside of SCTP and would require extensions to the virtio
>>>>> net header to carry extra sctp data.
>>>>
>>>> Can you please elaborate more on this? Is this because SCTP GSO relies
>>>> on the gso skb format for knowing how to segment it instead of having
>>>> a list of sizes?
>>>>
>>>
>>> it's mainly because all the true segmentation, placing data into chunks,
>>> has already happened.  All that GSO does is allow for higher bundling
>>> rate between VMs. If that is all SCTP GSO ever going to do, that fine,
>>> but the goal is to do real GSO eventually and potentially reduce the
>>> amount of memory copying we are doing.
>>> If we do that, any current attempt at GSO in virtio would have to be
>>> depricated and we'd need GSO2 or something like that.
>>
>> Batching helps virtualization *a lot* though.
> 
> Yep. The results posted by Xin in the other email give good insights
> on it.
> 
>> Are there actual plans for GSO2? Is it just for SCTP?
> 
> No plans. In this context, at least, yes, just for SCTP.
> 
> It was a supposition in case we start doing a different GSO for SCTP,
> one more like what we have for TCP.
> 
> Currently, as the SCTP GSO code doesn't leave the system, we can
> update it if we want. But by the moment we add support for it in
> virtio, we will have to be backwards compatible if we end up doing
> SCTP GSO differently.

So, just because the linux code doesn't do it differently doesn't mean
that someone else doesn't.  Since the device has to work across different
possible implementations, it needs to be generic enough.  If we simply
document the current linux practice, that may not be ideal on the future.

I was hesitant to introduce this without studying the feasibility of
doing late segmentation.

-vlad

> 
> But again, I don't think such approach for SCTP GSO would be neither
> feasible or worth. The complexity for it, to work across stream
> schedules and late TSN allocation, would do more harm then good IMO.
> 
>>
>>>
>>> This is why, after doing the GSO support, I decided not to include it.
>>>
>>> -vlad
>>>>   Marcelo
>>>>



Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support

2018-04-17 Thread Vlad Yasevich
On 04/02/2018 10:47 AM, Marcelo Ricardo Leitner wrote:
> On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
>> Now that we have SCTP offload capabilities in the kernel, we can add
>> them to virtio as well.  First step is SCTP checksum.
> 
> Thanks.
> 
>> As for GSO, the way sctp GSO is currently implemented buys us nothing
>> in added support to virtio.  To add true GSO, would require a lot of
>> re-work inside of SCTP and would require extensions to the virtio
>> net header to carry extra sctp data.
> 
> Can you please elaborate more on this? Is this because SCTP GSO relies
> on the gso skb format for knowing how to segment it instead of having
> a list of sizes?
> 

it's mainly because all the true segmentation, placing data into chunks,
has already happened.  All that GSO does is allow for higher bundling
rate between VMs. If that is all SCTP GSO ever going to do, that fine,
but the goal is to do real GSO eventually and potentially reduce the
amount of memory copying we are doing.
If we do that, any current attempt at GSO in virtio would have to be
depricated and we'd need GSO2 or something like that.

This is why, after doing the GSO support, I decided not to include it.

-vlad

>   Marcelo
> 



Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading

2018-04-17 Thread Vlad Yasevich
On 04/16/2018 01:09 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2018 at 09:45:48AM -0400, Vlad Yasevich wrote:
>> On 04/11/2018 06:49 PM, Michael S. Tsirkin wrote:
>>> On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
>>>> To support SCTP checksum offloading, we need to add a new feature
>>>> to virtio_net, so we can negotiate support between the hypervisor
>>>> and the guest.
>>>>
>>>> The signalling to the guest that an alternate checksum needs to
>>>> be used is done via a new flag in the virtio_net_hdr.  If the
>>>> flag is set, the host will know to perform an alternate checksum
>>>> calculation, which right now is only CRC32c.
>>>>
>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>>> ---
>>>>  drivers/net/virtio_net.c| 11 ---
>>>>  include/linux/virtio_net.h  |  6 ++
>>>>  include/uapi/linux/virtio_net.h |  2 ++
>>>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 7b187ec..b601294 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>>>>/* Do we support "hardware" checksums? */
>>>>if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>>>>/* This opens up the world of extra features. */
>>>> -  dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
>>>> +  netdev_features_t sctp = 0;
>>>> +
>>>> +  if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
>>>> +  sctp |= NETIF_F_SCTP_CRC;
>>>> +
>>>> +  dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>>>>if (csum)
>>>> -  dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
>>>> +  dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>>>>  
>>>>if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>>>>dev->hw_features |= NETIF_F_TSO
>>>> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>>>>  };
>>>>  
>>>>  #define VIRTNET_FEATURES \
>>>> -  VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
>>>> +  VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>>>>VIRTIO_NET_F_MAC, \
>>>>VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>>>>VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, 
>>>> VIRTIO_NET_F_GUEST_TSO6, \
>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>>>> index f144216..2e7a64a 100644
>>>> --- a/include/linux/virtio_net.h
>>>> +++ b/include/linux/virtio_net.h
>>>> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
>>>> *skb,
>>>>  
>>>>if (!skb_partial_csum_set(skb, start, off))
>>>>return -EINVAL;
>>>> +
>>>> +  if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
>>>> +  skb->csum_not_inet = 1;
>>>>}
>>>>  
>>>>if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
>>>> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct 
>>>> sk_buff *skb,
>>>>hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>>>>} /* else everything is zero */
>>>>  
>>>> +  if (skb->csum_not_inet)
>>>> +  hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
>>>> +
>>>>return 0;
>>>>  }
>>>>  
>>>> diff --git a/include/uapi/linux/virtio_net.h 
>>>> b/include/uapi/linux/virtio_net.h
>>>> index 5de6ed3..3f279c8 100644
>>>> --- a/include/uapi/linux/virtio_net.h
>>>> +++ b/include/uapi/linux/virtio_net.h
>>>> @@ -36,6 +36,7 @@
>>>>  #define VIRTIO_NET_F_GUEST_CSUM   1   /* Guest handles pkts w/ 
>>>> partial csum */
>>>>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload 
>>>> configuration. */
>>>>  #define VIRTIO_NET_F_MTU  3   /* Initial MTU advice */
>>>> +#define VIRTIO_NET_F_SCTP_CSUM  4 /* SCTP

Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading

2018-04-16 Thread Vlad Yasevich
On 04/11/2018 06:49 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
>> To support SCTP checksum offloading, we need to add a new feature
>> to virtio_net, so we can negotiate support between the hypervisor
>> and the guest.
>>
>> The signalling to the guest that an alternate checksum needs to
>> be used is done via a new flag in the virtio_net_hdr.  If the
>> flag is set, the host will know to perform an alternate checksum
>> calculation, which right now is only CRC32c.
>>
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  drivers/net/virtio_net.c| 11 ---
>>  include/linux/virtio_net.h  |  6 ++
>>  include/uapi/linux/virtio_net.h |  2 ++
>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 7b187ec..b601294 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>>  /* Do we support "hardware" checksums? */
>>  if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>>  /* This opens up the world of extra features. */
>> -dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
>> +netdev_features_t sctp = 0;
>> +
>> +if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
>> +sctp |= NETIF_F_SCTP_CRC;
>> +
>> +dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>>  if (csum)
>> -dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
>> +dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>>  
>>  if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>>  dev->hw_features |= NETIF_F_TSO
>> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>>  };
>>  
>>  #define VIRTNET_FEATURES \
>> -VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
>> +VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>>  VIRTIO_NET_F_MAC, \
>>  VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>>  VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, 
>> VIRTIO_NET_F_GUEST_TSO6, \
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index f144216..2e7a64a 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
>> *skb,
>>  
>>  if (!skb_partial_csum_set(skb, start, off))
>>  return -EINVAL;
>> +
>> +if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
>> +skb->csum_not_inet = 1;
>>  }
>>  
>>  if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
>> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct 
>> sk_buff *skb,
>>  hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>>  } /* else everything is zero */
>>  
>> +if (skb->csum_not_inet)
>> +hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
>> +
>>  return 0;
>>  }
>>  
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index 5de6ed3..3f279c8 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -36,6 +36,7 @@
>>  #define VIRTIO_NET_F_GUEST_CSUM 1   /* Guest handles pkts w/ 
>> partial csum */
>>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload 
>> configuration. */
>>  #define VIRTIO_NET_F_MTU3   /* Initial MTU advice */
>> +#define VIRTIO_NET_F_SCTP_CSUM  4   /* SCTP checksum offload support */
>>  #define VIRTIO_NET_F_MAC5   /* Host has given MAC address. */
>>  #define VIRTIO_NET_F_GUEST_TSO4 7   /* Guest can handle TSOv4 in. */
>>  #define VIRTIO_NET_F_GUEST_TSO6 8   /* Guest can handle TSOv6 in. */
> 
> Is this a guest or a host checksum? We should differenciate between the
> two.

I suppose this is HOST checksum, since it behaves like VIRTIO_NET_F_CSUM only 
for
SCTP.  I couldn't find the use for the GUEST side flag, since there technically
isn't any validations and there is no additional mappings from VIRTIO flag to a
NETIF flag.

If the feature is negotiated, the guest ends up generating partially 
check-summed
packets, and the host turns on appropriate flags on it's side.   The host will
also make sure the checksum if fixed up if the guest doesn't support it.
So 1 flag is currently all that is needed.

-vlad

> 
> 
>> @@ -101,6 +102,7 @@ struct virtio_net_config {
>>  struct virtio_net_hdr_v1 {
>>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1   /* Use csum_start, csum_offset 
>> */
>>  #define VIRTIO_NET_HDR_F_DATA_VALID 2   /* Csum is valid */
>> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4   /* Checksum is not inet */
>>  __u8 flags;
>>  #define VIRTIO_NET_HDR_GSO_NONE 0   /* Not a GSO frame */
>>  #define 

Re: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-26 Thread Vlad Yasevich
On 10/20/2017 08:06 PM, Keller, Jacob E wrote:
>> -Original Message-
>> From: Keller, Jacob E
>> Sent: Friday, October 20, 2017 10:23 AM
>> To: netdev@vger.kernel.org
>> Cc: Malek, Patryk <patryk.ma...@intel.com>; 'Vlad Yasevich'
>> <vyase...@redhat.com>
>> Subject: removing bridge in vlan_filtering mode requests delete of attached
>> ports main MAC address
>>
>> Hi,
>>
>> We've run into an issue with bridges set in vlan_filtering mode. Basically, 
>> if we
>> attach a device to a bridge which has enabled vlan_filtering, and then 
>> remove the
>> bridge, we end up requesting the driver of the attached device to remove its
>> own MAC HW address.
>>
>> In i40e, at least, this causes the driver to actually delete such an address 
>> and then
>> it will no longer receive any traffic.
>>
>> To reproduce this:
>>
>> a) brctl addbr br0
>> b) brctl addif br0 enp
>> # enable vlan filtering
>> c) echo 1 >/sys/class/net/br0/bridge/vlan_filtering
>> d) brctl delbr br0
>>
>> Specifically this appears to happen because of how we automatically enter 
>> static
>> configuration for routes when vlan_filtering is enabled, and we call
>> br_fdb_unsync_static which will clear all the routes from the fdb table for 
>> the
>> device. See commit 2796d0c648c9 ("bridge: Automatically manage port
>> promiscuous mode.", 2014-05-16) for more details.
>>
>> This happens to include the devices own default address, which results in the
>> bug.
>>
>> I'm not sure if this is a driver bug, or if it's a bug in the bridging code.
>>
>> Who would know more about this and what to do about this?
>>
>> One obvious solution is to hard code the i40e device driver so that it does 
>> not
>> actually delete the HW address from the unicast filter list. This could 
>> work, but
>> seems to me like its papering over the problem. Is this just a known thing 
>> that
>> drivers should be aware of? I don't really know...
>>
>> An alternative solution would be to possibly ignore any fdb addresses which
>> specifically target that port?
>>
>> Any ideas?
> 
> For the record, adding a check to prevent unsync_static from removing 
> addresses which are targetting the specific port does work to resolve this 
> specific issue, but I'm sure it's not the correct solution as I expect that 
> would cause other problems.
> 

Hi Jake

I think adding a !fdb->local should work.  local fdb contain the address of 
assigned to
the ports of the bridge and those shouldn't be directly removed.

If that works,  that looks like the right solution.

-vlad

> Thanks,
> Jake
> 
>>
>> Regards,
>> Jake



Re: [PATCH net 5/6] rtnetlink: check DO_SETLINK_NOTIFY correctly in do_setlink

2017-10-26 Thread Vlad Yasevich
On 10/16/2017 02:20 PM, Nicolas Dichtel wrote:
> Le 16/10/2017 à 03:17, David Ahern a écrit :
>> [ cc'ed Nicolas ]
>>
>> On 10/15/17 4:13 AM, Xin Long wrote:
>>> The check 'status & DO_SETLINK_NOTIFY' in do_setlink doesn't really
>>> work after status & DO_SETLINK_MODIFIED, as:
>>>
>>>   DO_SETLINK_MODIFIED 0x1
>>>   DO_SETLINK_NOTIFY 0x3
>>>
>>> Considering that notifications are suppposed to be sent only when
>>> status have the flag DO_SETLINK_NOTIFY, the right check would be:
>>>
>>>   (status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY
>>>
>>> This would avoid lots of duplicated notifications when setting some
>>> properties of a link.
>>>
>>> Fixes: ba9989069f4e ("rtnl/do_setlink(): notify when a netdev is modified")
>>> Signed-off-by: Xin Long 
> Good catch, thank you.
> 
> Acked-by: Nicolas Dichtel 
> 

So I found this the first timer around when looking at this code, but was told 
that
notification are expected anytime we modified any setting thus the code was 
simply
checking for MODIFIED bit.  Has that thinking changed?

-vlad


Re: [PATCH net] net: account for current skb length when deciding about UFO

2017-06-19 Thread Vlad Yasevich
On 06/19/2017 07:03 AM, Michal Kubecek wrote:
> Our customer encountered stuck NFS writes for blocks starting at specific
> offsets w.r.t. page boundary caused by networking stack sending packets via
> UFO enabled device with wrong checksum. The problem can be reproduced by
> composing a long UDP datagram from multiple parts using MSG_MORE flag:
> 
>   sendto(sd, buff, 1000, MSG_MORE, ...);
>   sendto(sd, buff, 1000, MSG_MORE, ...);
>   sendto(sd, buff, 3000, 0, ...);
> 
> Assume this packet is to be routed via a device with MTU 1500 and
> NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
> this condition is tested (among others) to decide whether to call
> ip_ufo_append_data():
> 
>   ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
> 
> At the moment, we already have skb with 1028 bytes of data which is not
> marked for GSO so that the test is false (fragheaderlen is usually 20).
> Thus we append second 1000 bytes to this skb without invoking UFO. Third
> sendto(), however, has sufficient length to trigger the UFO path so that we
> end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
> uses udp_csum() to calculate the checksum but that assumes all fragments
> have correct checksum in skb->csum which is not true for UFO fragments.
> 
> When checking against MTU, we need to add skb->len to length of new segment
> if we already have a partially filled skb and fragheaderlen only if there
> isn't one.
> 
> In the IPv6 case, skb can only be null if this is the first segment so that
> we have to use headersize (length of the first IPv6 header) rather than
> fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
> 
> Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
> Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for
>   ip6 fragment between __ip6_append_data and ip6_finish_output")
> Signed-off-by: Michal Kubecek <mkube...@suse.cz>

LGTM.

Acked-by: Vlad Yasevich <vyase...@redhat.com>

-vlad

> ---
>  net/ipv4/ip_output.c  | 3 ++-
>  net/ipv6/ip6_output.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7a3fd25e8913..532b36e9ce2a 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -964,7 +964,8 @@ static int __ip_append_data(struct sock *sk,
>   csummode = CHECKSUM_PARTIAL;
>  
>   cork->length += length;
> - if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
> + if length + (skb ? skb->len : fragheaderlen)) > mtu) ||
> +  (skb && skb_is_gso(skb))) &&
>   (sk->sk_protocol == IPPROTO_UDP) &&
>   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
>   (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index bf8a58a1c32d..1699acb2fa2c 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1390,7 +1390,7 @@ static int __ip6_append_data(struct sock *sk,
>*/
>  
>   cork->length += length;
> - if length + fragheaderlen) > mtu) ||
> + if length + (skb ? skb->len : headersize)) > mtu) ||
>(skb && skb_is_gso(skb))) &&
>   (sk->sk_protocol == IPPROTO_UDP) &&
>   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
> 



Re: [PATCH net 4/4] macvlan: Let passthru macvlan correctly restore lower mac address

2017-06-17 Thread Vlad Yasevich
On 06/17/2017 12:35 AM, Girish Moodalbail wrote:
> Sorry, it took sometime to wrap around this patch series since they all 
> change one file
> and at times the same function :).
> 
> 
> On 6/16/17 6:36 AM, Vladislav Yasevich wrote:
>> Passthru macvlans directly change the mac address of the lower
>> level device.  That's OK, but after the macvlan is deleted,
>> the lower device is left with changed address and one needs to
>> reboot to bring back the origina HW addresses.
> 
> s/origina/original/
> 
>>
>> This scenario is actually quite common with passthru macvtap devices.
>>
>> This patch attempts to solve this, by storing the mac address
>> of the lower device in macvlan_port structure and keeping track of
>> it through the changes.
>>
>> After this patch, any changes to the lower device mac address
>> done trough the macvlan device, will be reverted back.  Any
>> changes done directly to the lower device mac address will be kept.
>>
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  drivers/net/macvlan.c | 47 ---
>>  1 file changed, 44 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
>> index eb956ff..c551165 100644
>> --- a/drivers/net/macvlan.c
>> +++ b/drivers/net/macvlan.c
>> @@ -40,6 +40,7 @@
>>  #define MACVLAN_BC_QUEUE_LEN1000
>>
>>  #define MACVLAN_F_PASSTHRU1
>> +#define MACVLAN_F_ADDRCHANGE2
>>
>>  struct macvlan_port {
>>  struct net_device*dev;
>> @@ -51,6 +52,7 @@ struct macvlan_port {
>>  intcount;
>>  struct hlist_headvlan_source_hash[MACVLAN_HASH_SIZE];
>>  DECLARE_BITMAP(mc_filter, MACVLAN_MC_FILTER_SZ);
>> +unsigned char   perm_addr[ETH_ALEN];
>>  };
>>
>>  struct macvlan_source_entry {
>> @@ -78,6 +80,21 @@ static inline void macvlan_set_passthru(struct 
>> macvlan_port *port)
>>  port->flags |= MACVLAN_F_PASSTHRU;
>>  }
>>
>> +static inline bool macvlan_addr_change(const struct macvlan_port *port)
>> +{
>> +return port->flags & MACVLAN_F_ADDRCHANGE;
>> +}
>> +
>> +static inline void macvlan_set_addr_change(struct macvlan_port *port)
>> +{
>> +port->flags |= MACVLAN_F_ADDRCHANGE;
>> +}
>> +
>> +static inline void macvlan_clear_addr_change(struct macvlan_port *port)
>> +{
>> +port->flags &= ~MACVLAN_F_ADDRCHANGE;
>> +}
>> +
>>  /* Hash Ethernet address */
>>  static u32 macvlan_eth_hash(const unsigned char *addr)
>>  {
>> @@ -193,11 +210,11 @@ static void macvlan_hash_change_addr(struct 
>> macvlan_dev *vlan,
>>  static bool macvlan_addr_busy(const struct macvlan_port *port,
>>const unsigned char *addr)
>>  {
>> -/* Test to see if the specified multicast address is
>> +/* Test to see if the specified address is
>>   * currently in use by the underlying device or
>>   * another macvlan.
>>   */
>> -if (!macvlan_passthru(port) &&
>> +if (!macvlan_passthru(port) && !macvlan_addr_change(port) &&
>>  ether_addr_equal_64bits(port->dev->dev_addr, addr))
>>  return true;
>>
>> @@ -685,6 +702,7 @@ static int macvlan_sync_address(struct net_device *dev, 
>> unsigned
>> char *addr)
>>  {
>>  struct macvlan_dev *vlan = netdev_priv(dev);
>>  struct net_device *lowerdev = vlan->lowerdev;
>> +struct macvlan_port *port = vlan->port;
>>  int err;
>>
>>  if (!(dev->flags & IFF_UP)) {
>> @@ -695,7 +713,7 @@ static int macvlan_sync_address(struct net_device *dev, 
>> unsigned
>> char *addr)
>>  if (macvlan_addr_busy(vlan->port, addr))
>>  return -EBUSY;
>>
>> -if (!macvlan_passthru(vlan->port)) {
>> +if (!macvlan_passthru(port)) {
>>  err = dev_uc_add(lowerdev, addr);
>>  if (err)
>>  return err;
>> @@ -705,6 +723,15 @@ static int macvlan_sync_address(struct net_device *dev, 
>> unsigned
>> char *addr)
>>
>>  macvlan_hash_change_addr(vlan, addr);
>>  }
>> +if (macvlan_passthru(port) && !macvlan_addr_change(port)) {
>> +/* Since addr_change isn't set, we are here due to lower
>> + * device change.  Save the lower-dev address so we can
>> + * restore it later.
>> + */
>> +ether_addr_copy(vlan->port->perm_addr,
>> +dev->dev_addr);
> 
> Did you meant to copy `addr' here? Since dev->dev_addr is that of the macvlan 
> device
> whilst `addr' is from the lower parent device.
> 

At this point, it doesn't really matter since dev_addr has already been set in
hash_change_addr().  However, I see your point, and the intent would be 
clarified
if I used lower_dev->addr.

Thanks
-vlad
> 
> Thanks,
> ~Girish
> 
> 



Re: [PATCH V6 net-next iproute] ip: Add support for netdev events to monitor

2017-05-31 Thread Vlad Yasevich
On 05/30/2017 02:26 PM, Vlad Yasevich wrote:
> On 05/30/2017 01:12 PM, Stephen Hemminger wrote:
>> On Sat, 27 May 2017 10:14:36 -0400
>> Vladislav Yasevich <vyasev...@gmail.com> wrote:
>>
>>>  
>>> +static const char *netdev_events[] = {"NONE",
>>> + "REBOOT",
>>> + "FEATURE CHANGE",
>>> + "BONDING FAILOVER",
>>> + "NOTIFY PEERS",
>>> + "RESEND IGMP",
>>> + "BONDING OPTION"};
>>
>> Overall this looks fine, I will pickup the if_link.h from net-next.
>>
>> One stylistic change.
>>
>> Please add simple line break, and initialize by value:
>>
>> static const char *netdev_events[] = {
>>  [IFLA_EVENT_NONE]   = "NONE",
>> ...
>>
>> Do you want some prefix or bounding around the event output?
> 
> Don't really care about output from my side.  If you think some prefix
> would be good, I can surely add it.
> 
>> Also a little concerned that the output format change may break some program
>> could the new output be at the end of the line?
>>
> 
> I can try moving it to the end.
>

Hi Stephen

So, I looked again at this patch and the output change I am proposing would look
something like this 'ip monitor':

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 6500 qdisc noqueue state UNKNOWN group 
default event
FEATURE CHANGE
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00


It's already at the end of the like ant has an 'event' prefix similar to all 
the other
entries on the top line.  Does that look OK?  I don't think that would break 
anything.

Thanks
-vlad


Re: [PATCH V6 net-next iproute] ip: Add support for netdev events to monitor

2017-05-30 Thread Vlad Yasevich
On 05/30/2017 01:12 PM, Stephen Hemminger wrote:
> On Sat, 27 May 2017 10:14:36 -0400
> Vladislav Yasevich  wrote:
> 
>>  
>> +static const char *netdev_events[] = {"NONE",
>> +  "REBOOT",
>> +  "FEATURE CHANGE",
>> +  "BONDING FAILOVER",
>> +  "NOTIFY PEERS",
>> +  "RESEND IGMP",
>> +  "BONDING OPTION"};
> 
> Overall this looks fine, I will pickup the if_link.h from net-next.
> 
> One stylistic change.
> 
> Please add simple line break, and initialize by value:
> 
> static const char *netdev_events[] = {
>   [IFLA_EVENT_NONE]   = "NONE",
> ...
> 
> Do you want some prefix or bounding around the event output?

Don't really care about output from my side.  If you think some prefix
would be good, I can surely add it.

> Also a little concerned that the output format change may break some program
> could the new output be at the end of the line?
> 

I can try moving it to the end.

Thanks
-vlad


Re: [PATCH V5 1/2] rtnl: Add support for netdev event to link messages

2017-05-26 Thread Vlad Yasevich
On 05/26/2017 03:40 PM, David Ahern wrote:
> On 5/25/17 9:31 AM, Vladislav Yasevich wrote:
>> @@ -911,4 +912,14 @@ enum {
>>  
>>  #define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
>>  
>> +enum {
>> +IFLA_EVENT_UNSPEC,
>> +IFLA_EVENT_REBOOT,
>> +IFLA_EVENT_FEAT_CHANGE,
>> +IFLA_EVENT_BONDING_FAILOVER,
>> +IFLA_EVENT_NOTIFY_PEERS,
>> +IFLA_EVENT_RESEND_IGMP,
>> +IFLA_EVENT_CHANGE_INFO_DATA,
>> +};
>> +
>>  #endif /* _UAPI_LINUX_IF_LINK_H */
> 
> I agree these are unique events that userspace might care about.
> 
> I'd prefer better names for the userspace api for a couple of those
> along with a description in the header file so userspace knows why the
> event was generated.
> 
> How about something like this:
> 
> enum {
>   IFLA_EVENT_NONE,
>   IFLA_EVENT_REBOOT,  /* internal reset / reboot */
>   IFLA_EVENT_FEATURES,/* change in offload features */
>   IFLA_EVENT_BONDING_FAILOVER,/* change in active slave */
>   IFLA_EVENT_NOTIFY_PEERS,/* re-sent grat. arp/ndisc */
>   IFLA_EVENT_IGMP_RESEND, /* re-sent IGMP JOIN */
>   IFLA_EVENT_BONDING_OPTIONS, /* change in bonding options */
> };
> 

Ok.  I'll change and re-submit.

> Also, generically the IFLA_EVENT attribute should be considered
> independent of NETDEV_ events.
> 
> For example, userspace should be notified if the speed / duplex for a
> device changes, so we could have another one of these -- e.g.,
> IFLA_EVENT_SPEED -- that does not correlate to NETDEV_SPEED since
> nothing internal to the network stack cares about speed changes, or
> perhaps more generically it is IFLA_EVENT_LINK_SETTING.
> 

Ok.  We could do a translation between netdev event and IFLA_EVENT attribute 
value
earlier (say in rtnetlink_event) and pass that along.  This would allow calls
from other places, assuming proper IFLA_EVENT attribute value and translation
is defined.

Would that address your concerns?

> The rest of the patch looks ok to me.
> 

Thanks
-vlad


Re: [PATCH net v2] sctp: fix ICMP processing if skb is non-linear

2017-05-25 Thread Vlad Yasevich
On 05/25/2017 01:14 PM, Davide Caratti wrote:
> sometimes ICMP replies to INIT chunks are ignored by the client, even if
> the encapsulated SCTP headers match an open socket. This happens when the
> ICMP packet is carried by a paged skb: use skb_header_pointer() to read
> packet contents beyond the SCTP header, so that chunk header and initiate
> tag are validated correctly.
> 
> v2:
> - don't use skb_header_pointer() to read the transport header, since
>   icmp_socket_deliver() already puts these 8 bytes in the linear area.
> - change commit message to make specific reference to INIT chunks.
> 
> Signed-off-by: Davide Caratti <dcara...@redhat.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad


> ---
>  net/sctp/input.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 0e06a27..ba9ad32 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -473,15 +473,14 @@ struct sock *sctp_err_lookup(struct net *net, int 
> family, struct sk_buff *skb,
>struct sctp_association **app,
>struct sctp_transport **tpp)
>  {
> + struct sctp_init_chunk *chunkhdr, _chunkhdr;
>   union sctp_addr saddr;
>   union sctp_addr daddr;
>   struct sctp_af *af;
>   struct sock *sk = NULL;
>   struct sctp_association *asoc;
>   struct sctp_transport *transport = NULL;
> - struct sctp_init_chunk *chunkhdr;
>   __u32 vtag = ntohl(sctphdr->vtag);
> - int len = skb->len - ((void *)sctphdr - (void *)skb->data);
>  
>   *app = NULL; *tpp = NULL;
>  
> @@ -516,13 +515,16 @@ struct sock *sctp_err_lookup(struct net *net, int 
> family, struct sk_buff *skb,
>* discard the packet.
>*/
>   if (vtag == 0) {
> - chunkhdr = (void *)sctphdr + sizeof(struct sctphdr);
> - if (len < sizeof(struct sctphdr) + sizeof(sctp_chunkhdr_t)
> -   + sizeof(__be32) ||
> + /* chunk header + first 4 octects of init header */
> + chunkhdr = skb_header_pointer(skb, skb_transport_offset(skb) +
> +   sizeof(struct sctphdr),
> +   sizeof(struct sctp_chunkhdr) +
> +   sizeof(__be32), &_chunkhdr);
> + if (!chunkhdr ||
>   chunkhdr->chunk_hdr.type != SCTP_CID_INIT ||
> - ntohl(chunkhdr->init_hdr.init_tag) != asoc->c.my_vtag) {
> + ntohl(chunkhdr->init_hdr.init_tag) != asoc->c.my_vtag)
>   goto out;
> - }
> +
>   } else if (vtag != asoc->c.peer_vtag) {
>   goto out;
>   }
> 



Re: [PATCH net 2/2] sctp: set new_asoc temp when processing dupcookie

2017-05-23 Thread Vlad Yasevich
On 05/23/2017 01:28 AM, Xin Long wrote:
> After sctp changed to use transport hashtable, a transport would be
> added into global hashtable when adding the peer to an asoc, then
> the asoc can be got by searching the transport in the hashtbale.
> 
> The problem is when processing dupcookie in sctp_sf_do_5_2_4_dupcook,
> a new asoc would be created. A peer with the same addr and port as
> the one in the old asoc might be added into the new asoc, but fail
> to be added into the hashtable, as they also belong to the same sk.
> 
> It causes that sctp's dupcookie processing can not really work.
> 
> Since the new asoc will be freed after copying it's information to
> the old asoc, it's more like a temp asoc. So this patch is to fix
> it by setting it as a temp asoc to avoid adding it's any transport
> into the hashtable and also avoid allocing assoc_id.
> 
> An extra thing it has to do is to also alloc stream info for any
> temp asoc, as sctp dupcookie process needs it to update old asoc.
> But I don't think it would hurt something, as a temp asoc would
> always be freed after finishing processing cookie echo packet.
> 
> Reported-by: Jianwen Ji <j...@redhat.com>
> Signed-off-by: Xin Long <lucien@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  net/sctp/sm_make_chunk.c | 13 -
>  net/sctp/sm_statefuns.c  |  3 +++
>  2 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 8a08f13..92e332e 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2454,16 +2454,11 @@ int sctp_process_init(struct sctp_association *asoc, 
> struct sctp_chunk *chunk,
>* stream sequence number shall be set to 0.
>*/
>  
> - /* Allocate storage for the negotiated streams if it is not a temporary
> -  * association.
> -  */
> - if (!asoc->temp) {
> - if (sctp_stream_init(asoc, gfp))
> - goto clean_up;
> + if (sctp_stream_init(asoc, gfp))
> + goto clean_up;
>  
> - if (sctp_assoc_set_id(asoc, gfp))
> - goto clean_up;
> - }
> + if (!asoc->temp && sctp_assoc_set_id(asoc, gfp))
> + goto clean_up;
>  
>   /* ADDIP Section 4.1 ASCONF Chunk Procedures
>*
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 4f5e6cf..f863b55 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -2088,6 +2088,9 @@ sctp_disposition_t sctp_sf_do_5_2_4_dupcook(struct net 
> *net,
>   }
>   }
>  
> + /* Set temp so that it won't be added into hashtable */
> + new_asoc->temp = 1;
> +
>   /* Compare the tie_tag in cookie with the verification tag of
>* current association.
>*/
> 



Re: [PATCH net 1/2] sctp: fix stream update when processing dupcookie

2017-05-23 Thread Vlad Yasevich
On 05/23/2017 01:28 AM, Xin Long wrote:
> Since commit 3dbcc105d556 ("sctp: alloc stream info when initializing
> asoc"), stream and stream.out info are always alloced when creating
> an asoc.
> 
> So it's not correct to check !asoc->stream before updating stream
> info when processing dupcookie, but would be better to check asoc
> state instead.
> 
> Fixes: 3dbcc105d556 ("sctp: alloc stream info when initializing asoc")
> Signed-off-by: Xin Long <lucien@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  net/sctp/associola.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index a9708da..9523828 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1176,7 +1176,9 @@ void sctp_assoc_update(struct sctp_association *asoc,
>  
>   asoc->ctsn_ack_point = asoc->next_tsn - 1;
>   asoc->adv_peer_ack_point = asoc->ctsn_ack_point;
> - if (!asoc->stream) {
> +
> + if (sctp_state(asoc, COOKIE_WAIT)) {
> + sctp_stream_free(asoc->stream);
>   asoc->stream = new->stream;
>   new->stream = NULL;
>   }
> 



Re: [PATCH net 1/3] vlan: Fix tcp checksums offloads for Q-in-Q vlan.

2017-05-23 Thread Vlad Yasevich
On 05/22/2017 07:59 PM, David Miller wrote:
> From: Vladislav Yasevich 
> Date: Thu, 18 May 2017 09:31:03 -0400
> 
>> It appears that since commit 8cb65d000, Q-in-Q vlans have been
>> broken.  The series that commit is part of enabled TSO and checksum
>> offloading on Q-in-Q vlans.  However, most HW we support can't handle
>> it.  To work around the issue, the above commit added a function that
>> turns off offloads on Q-in-Q devices, but it left the checksum offload.
>> That will cause issues with most older devices that supprort very basic
>> checksum offload capabilities as well as some newer devices (we've
>> reproduced te problem with both be2net and bnx).
>>
>> To solve this for everyone, turn off checksum offloading feature
>> by default when sending Q-in-Q traffic.  Devices that are proven to
>> work can provided a corrected ndo_features_check implemetation.
>>
>> Fixes: 8cb65d000 ("net: Move check for multiple vlans to drivers")
>> CC: Toshiaki Makita 
>> Signed-off-by: Vladislav Yasevich 
> 
> This is a tough one.  I can certainly sympathize with your frustration
> trying to track this down.
> 
> Clearing NETIF_F_HW_CSUM completely is the most conservative change.
> 
> However, for all the (perhaps many) cards upon which the checksumming
> does work properly in Q-in-Q situations, this change could be
> introducing non-trivial performance regressions.
> 
> So I think Toshiaki's suggestion to drop IP_CSUM and IPV6_CSUM is,
> on balance, the best way forward.
> 

Thanks.  I'll update and re-submit.

-vlad

> Thanks.
> 



Re: [PATCH 3/5] sctp: Fix a typo in a comment line in sctp_init()

2017-05-22 Thread Vlad Yasevich
On 05/22/2017 12:39 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 17:43:44 +0200
> 
> Add a missing character in this description.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 5e7c8a344770..64756c42cec9 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1454,7 +1454,7 @@ static __init int sctp_init(void)
>   }
>  
>   /* Allocate and initialize the SCTP port hash table.
> -  * Note that order is initalized to start at the max sized
> +  * Note that order is initialized to start at the max sized
>* table we want to support.  If we can't get that many pages
>    * reduce the order and try again
>*/
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com.

-vlad


Re: [PATCH 5/5] sctp: Adjust one function call together with a variable assignment

2017-05-22 Thread Vlad Yasevich
On 05/22/2017 12:41 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 18:15:12 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> ERROR: do not use assignment in if condition
> 
> Thus fix the affected source code place.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 057479b7bd72..be2fe3ebae78 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -141,7 +141,8 @@ static void sctp_v4_copy_addrlist(struct list_head 
> *addrlist,
>   struct sctp_sockaddr_entry *addr;
>  
>   rcu_read_lock();
> - if ((in_dev = __in_dev_get_rcu(dev)) == NULL) {
> + in_dev = __in_dev_get_rcu(dev);
> +     if (!in_dev) {
>   rcu_read_unlock();
>   return;
>   }
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad


Re: [PATCH 4/5] sctp: Improve a size determination in sctp_inetaddr_event()

2017-05-22 Thread Vlad Yasevich
On 05/22/2017 12:40 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 18:08:24 +0200
> 
> Replace the specification of a data structure by a pointer dereference
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 64756c42cec9..057479b7bd72 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -784,7 +784,7 @@ static int sctp_inetaddr_event(struct notifier_block 
> *this, unsigned long ev,
>  
>   switch (ev) {
>   case NETDEV_UP:
> - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
> + addr = kmalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v4.sin_family = AF_INET;
>   addr->a.v4.sin_port = 0;
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad


Re: [PATCH 2/5] sctp: Delete an error message for a failed memory allocation in sctp_init()

2017-05-22 Thread Vlad Yasevich
On 05/22/2017 12:38 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 17:28:14 +0200
> 
> Omit an extra message for a memory allocation failure in this function.
> 
> This issue was detected by using the Coccinelle software.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  net/sctp/protocol.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 2b1a6215bd2f..5e7c8a344770 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1447,5 +1447,4 @@ static __init int sctp_init(void)
>   if (!sctp_ep_hashtable) {
> - pr_err("Failed endpoint_hash alloc\n");
>       status = -ENOMEM;
>   goto err_ehash_alloc;
>   }
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

At the time this was written, it was patterned after TCP.  Since then TCP 
changed
significantly.  We can surely clean-up the pr_err() here and possibly update the
code as well later.

-vlad


Re: [PATCH 1/5] sctp: Use kmalloc_array() in sctp_init()

2017-05-22 Thread Vlad Yasevich
On 05/22/2017 12:37 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 22 May 2017 17:20:11 +0200
> 
> * A multiplication for the size determination of a memory allocation
>   indicated that an array data structure should be processed.
>   Thus use the corresponding function "kmalloc_array".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of a data structure by a pointer dereference
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  net/sctp/protocol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 989a900383b5..2b1a6215bd2f 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1442,6 +1442,6 @@ static __init int sctp_init(void)
>  
>   /* Allocate and initialize the endpoint hash table.  */
>   sctp_ep_hashsize = 64;
> - sctp_ep_hashtable =
> - kmalloc(64 * sizeof(struct sctp_hashbucket), GFP_KERNEL);
> + sctp_ep_hashtable = kmalloc_array(64, sizeof(*sctp_ep_hashtable),
> +   GFP_KERNEL);
>   if (!sctp_ep_hashtable) {
> 



Re: [PATCH net 1/3] vlan: Fix tcp checksums offloads for Q-in-Q vlan.

2017-05-19 Thread Vlad Yasevich
On 05/19/2017 04:16 AM, Toshiaki Makita wrote:
> On 2017/05/19 16:09, Vlad Yasevich wrote:
>> On 05/18/2017 10:13 PM, Toshiaki Makita wrote:
>>> On 2017/05/18 22:31, Vladislav Yasevich wrote:
>>>> It appears that since commit 8cb65d000, Q-in-Q vlans have been
>>>> broken.  The series that commit is part of enabled TSO and checksum
>>>> offloading on Q-in-Q vlans.  However, most HW we support can't handle
>>>> it.  To work around the issue, the above commit added a function that
>>>> turns off offloads on Q-in-Q devices, but it left the checksum offload.
>>>> That will cause issues with most older devices that supprort very basic
>>>> checksum offload capabilities as well as some newer devices (we've
>>>> reproduced te problem with both be2net and bnx).
>>>>
>>>> To solve this for everyone, turn off checksum offloading feature
>>>> by default when sending Q-in-Q traffic.  Devices that are proven to
>>>> work can provided a corrected ndo_features_check implemetation.
>>>>
>>>> Fixes: 8cb65d000 ("net: Move check for multiple vlans to drivers")
>>>> CC: Toshiaki Makita <makita.toshi...@lab.ntt.co.jp>
>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>>> ---
>>>>  include/linux/if_vlan.h | 1 -
>>>>  1 file changed, 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
>>>> index 8d5fcd6..ae537f0 100644
>>>> --- a/include/linux/if_vlan.h
>>>> +++ b/include/linux/if_vlan.h
>>>> @@ -619,7 +619,6 @@ static inline netdev_features_t 
>>>> vlan_features_check(const struct sk_buff *skb,
>>>> NETIF_F_SG |
>>>> NETIF_F_HIGHDMA |
>>>> NETIF_F_FRAGLIST |
>>>> -   NETIF_F_HW_CSUM |
>>>> NETIF_F_HW_VLAN_CTAG_TX |
>>>> NETIF_F_HW_VLAN_STAG_TX);
>>>>  
>>>
>>> I guess HW_CSUM theoretically can handle Q-in-Q packets and the problem
>>> is IP_CSUM and IPV6_CSUM.
>>> So wouldn't it be better to leave HW_CSUM and drop IP_CSUM/IPV6_CSUM,
>>> i.e. change intersection into bitwise AND?
>>>
>>
>> It wasn't really a problem before accelerations got enabled on q-in-q
>> vlans.
> 
> Right for stacked vlan device.
> But I think the check was there for packets from guests forwarded by
> bridge to vlan device so it was a problem before 8cb65d000.

Not really, since stacked vlans in guests wouldn't have accelerations on.
Haven't really tried a new guest on old hosts.  It might be an issue there...

> 
>>> The intersection was introduced in db115037bb57 ("net: fix checksum
>>> features handling in netif_skb_features()"), but I guess for this
>>> particular check the intersection was not needed.
>>>
>>
>> So, to put it another way, leave the intersection with HW_CSUM in the mask,
>> and then do:
>>
>>   return features & ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>>
>> This might work, but it assumes that everyone who announce HW_CSUM can
>> do q-in-q vlans.  It's been a bit of a pain tracking this down and I'd rather
>> fix it for everyone and let individual driver authors verify that Q-in-Q 
>> works
>> correctly with HW checksum.  However, I am willing to do the above if
>> that's what people want.
> 
> At least HW_CSUM in the check was introduced intentionally.
> https://www.spinics.net/lists/netdev/msg152016.html
> 
> And I think HW_CSUM should work with any packets.
> You know, include/linux/skbuff.h says
>>  *   NETIF_F_HW_CSUM - The driver (or its device) is able to compute one
>>  * IP (one's complement) checksum for any combination
>>  * of protocols or protocol layering.
> 
> We should be able to safely enable it.
> 
> ...But you are so worried about layer2 protocol handling of HW_CSUM
> devices, I'm ok with disabling it for now.
> 

It's a concern after running across this issue.  Granted, the few devices
we've seen this bug on use IP/IPV6 checksum features.  I am hoping someone
else might weigh in here.

-vlad


Re: [PATCH net 1/3] vlan: Fix tcp checksums offloads for Q-in-Q vlan.

2017-05-19 Thread Vlad Yasevich
On 05/18/2017 10:13 PM, Toshiaki Makita wrote:
> On 2017/05/18 22:31, Vladislav Yasevich wrote:
>> It appears that since commit 8cb65d000, Q-in-Q vlans have been
>> broken.  The series that commit is part of enabled TSO and checksum
>> offloading on Q-in-Q vlans.  However, most HW we support can't handle
>> it.  To work around the issue, the above commit added a function that
>> turns off offloads on Q-in-Q devices, but it left the checksum offload.
>> That will cause issues with most older devices that supprort very basic
>> checksum offload capabilities as well as some newer devices (we've
>> reproduced te problem with both be2net and bnx).
>>
>> To solve this for everyone, turn off checksum offloading feature
>> by default when sending Q-in-Q traffic.  Devices that are proven to
>> work can provided a corrected ndo_features_check implemetation.
>>
>> Fixes: 8cb65d000 ("net: Move check for multiple vlans to drivers")
>> CC: Toshiaki Makita 
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  include/linux/if_vlan.h | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
>> index 8d5fcd6..ae537f0 100644
>> --- a/include/linux/if_vlan.h
>> +++ b/include/linux/if_vlan.h
>> @@ -619,7 +619,6 @@ static inline netdev_features_t 
>> vlan_features_check(const struct sk_buff *skb,
>>   NETIF_F_SG |
>>   NETIF_F_HIGHDMA |
>>   NETIF_F_FRAGLIST |
>> - NETIF_F_HW_CSUM |
>>   NETIF_F_HW_VLAN_CTAG_TX |
>>   NETIF_F_HW_VLAN_STAG_TX);
>>  
> 
> I guess HW_CSUM theoretically can handle Q-in-Q packets and the problem
> is IP_CSUM and IPV6_CSUM.
> So wouldn't it be better to leave HW_CSUM and drop IP_CSUM/IPV6_CSUM,
> i.e. change intersection into bitwise AND?
> 

It wasn't really a problem before accelerations got enabled on q-in-q
vlans.

> The intersection was introduced in db115037bb57 ("net: fix checksum
> features handling in netif_skb_features()"), but I guess for this
> particular check the intersection was not needed.
> 

So, to put it another way, leave the intersection with HW_CSUM in the mask,
and then do:

  return features & ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);

This might work, but it assumes that everyone who announce HW_CSUM can
do q-in-q vlans.  It's been a bit of a pain tracking this down and I'd rather
fix it for everyone and let individual driver authors verify that Q-in-Q works
correctly with HW checksum.  However, I am willing to do the above if
that's what people want.

-vlad


Re: [PATCH] vlan: Keep NETIF_F_HW_CSUM similar to other software devices

2017-05-05 Thread Vlad Yasevich
On 05/05/2017 04:01 PM, Alexander Duyck wrote:
> On Fri, May 5, 2017 at 12:20 PM, Vladislav Yasevich  
> wrote:
>> Vlan devices, like all other software devices, enable
>> NETIF_F_HW_CSUM feature.  However, unlike all the othe other
>> software devices, vlans will switch to using IP|IPV6_CSUM
>> features, if the underlying devices uses them.  In these situations,
>> checksum offload features on the vlan device can't be controlled
>> via ethtool.
>>
>> This patch makes vlans keep HW_CSUM feature if the underlying
>> device supports checksum offloading.  This makes vlan devices
>> behave like other software devices, and restores control to the
>> user.
>>
>> A side-effect is that some offload settings (typically UFO)
>> may be enabled on the vlan device while being disabled on the HW.
>> However, the GSO code will correctly process the packets. This
>> actually results in slightly better raw throughput.
>>
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  net/8021q/vlan_dev.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
>> index 9ee5787..ffc8167 100644
>> --- a/net/8021q/vlan_dev.c
>> +++ b/net/8021q/vlan_dev.c
>> @@ -626,10 +626,16 @@ static netdev_features_t vlan_dev_fix_features(struct 
>> net_device *dev,
>>  {
>> struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
>> netdev_features_t old_features = features;
>> +   netdev_features_t real_dev_features = real_dev->features;
>>
>> -   features = netdev_intersect_features(features, 
>> real_dev->vlan_features);
>> +   features = netdev_intersect_features(features,
>> +(real_dev->vlan_features |
>> + NETIF_F_HW_CSUM));
> 
> You might want to change the ordering on all this.
> 
> You could start out with a value based on the intersection of
> real_dev->features and real_dev->vlan_features. Then you don't need to
> mess around with this extra piece where you are having OR in HW_CSUM.

You know,  I did that and that was the patch I meant to send... I had
3 different versions of this thing trying to find the best way...

Let me repost, since some of the rest of the changes go away.

-vlad

> That way you don't risk losing track of the state of the hardware
> checksum offload in terms of vlan_features as it should completely
> clear all of the checksums if none of them are supported in hardware.
> 
>> features |= NETIF_F_RXCSUM;
> 
> This line would probably need to be changed to OR NETIF_F_RXCSUM with
> the real_dev->vlan_features when we perform the first intersect test.
> That way we are guaranteed to report RXCSUM if the underlying device
> supports it. It might just be worth while to force this into the
> vlan_features for all devices in register_netdevice() then we wouldn't
> need this line at all and it probably makes sense since it would allow
> us to save a few extra cycles/instructions by combining it with the
> line that was adding high dma support.
> 
>> -   features = netdev_intersect_features(features, real_dev->features);
>> +   if (real_dev_features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))
>> +   real_dev_features |= NETIF_F_HW_CSUM;
>> +
>> +   features = netdev_intersect_features(features, real_dev_features);
> 
> This part all looks good.
> 
> My only advice like I said would be to record the vlan_features
> intersection first based on the real_dev. That way you don't risk
> losing state data from real device if for some reason it doesn't
> support checksum offload with VLAN tagged frames.
> 
>> features |= old_features & (NETIF_F_SOFT_FEATURES | 
>> NETIF_F_GSO_SOFTWARE);
>> features |= NETIF_F_LLTX;
>> --
>> 2.7.4
>>



Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages

2017-05-01 Thread Vlad Yasevich
On 04/28/2017 12:38 PM, Jiri Pirko wrote:
> Thu, Apr 27, 2017 at 09:59:38PM CEST, d...@cumulusnetworks.com wrote:
>> On 4/27/17 1:43 PM, Vlad Yasevich wrote:
>>>> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
>>>> about the name suggests it is a bonding notification. This one was added
>>>> specifically to notify userspace (d4261e5650004), yet seems to happen
>>>> only during a changelink and that already generates a RTM_NEWLINK
>>>> message via do_setlink. Since the rtnetlink_event message does not
>>>> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
>>>> really serve besides duplicating netlink messages to userspace.
>>>>
>>>
>>> I am not sure about this one, but if you have an app trying to monitor
>>> for this event, it can't really since there is no info in the netlink 
>>> message.
>>
>> I cc'ed Jiri on this thread hoping he would explain the intent.
>>
>> I propose it gets removed.
> 
> Hmm, I don't really recall. But looking at it now, I agree it is
> redundant.
> 

So, it looks like the notifier might be there to account for the ioctl/sysfs
interfaces.

Additionally, the message is not generated from do_setlink() if the devices is
down, so notifier accounts for that as well.

I guess, basic question is whether data carried in NETDEV_CHANGEINFODATA is 
useful
to user space?  If yes (I can possibly see some managements apps that might be 
interested
in it), then we shouldn't just remove it.  Possible solutions to eliminate 
duplicates
would be to move the notifier call into non-rtnl code paths...

Also, may be netdev_state_change() should call rtmsg_ifinfo() unconditionally?

-vlad


Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages

2017-04-27 Thread Vlad Yasevich
On 04/24/2017 11:14 AM, Roopa Prabhu wrote:
> On Sun, Apr 23, 2017 at 6:07 PM, David Ahern  wrote:
>>
>> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>>> @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct 
>>> net_device *dev)
>>>   return err;
>>>  }
>>>
>>> +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
>>> +{
>>> + u32 rtnl_event;
>>> +
>>> + switch (event) {
>>> + case NETDEV_REBOOT:
>>> + rtnl_event = IFLA_EVENT_REBOOT;
>>> + break;
>>> + case NETDEV_FEAT_CHANGE:
>>> + rtnl_event = IFLA_EVENT_FEAT_CHANGE;
>>> + break;
>>> + case NETDEV_BONDING_FAILOVER:
>>> + rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>>> + break;
>>> + case NETDEV_NOTIFY_PEERS:
>>> + rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
>>> + break;
>>> + case NETDEV_RESEND_IGMP:
>>> + rtnl_event = IFLA_EVENT_RESEND_IGMP;
>>> + break;
>>> + case NETDEV_CHANGEINFODATA:
>>> + rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>>> + break;
>>> + default:
>>> + return 0;
>>> + }
>>> +
>>> + return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
>>> +}
>>> +
>>
>> I still have doubts about encoding kernel events into a uapi.
> 
> agree. I don't see why user-space will need NETDEV_CHANGEINFODATA and
> others david listed.
> 

Well, I am not sure about CHANGEINFODATA as well, but I can see use
cases for others.

> My other concerns are, once we have this exposed to user-space and
> user-space starts relying on it, it will need accurate information and
> will expect to have this event information all the time.
> IIUC, we cannot cover multiple events in a single notification and not
> all link notifications will contain an IFLA_EVENT attribute.

Uhm...  If the rtnetlink message was a result of an event, it will have
an IFLA_EVENT.  If a message is something else, then it will not have
an event.  That's the point.  Not all netlink attributes are in every
netlink message.

> In other
> words, we will be telling user-space to not expect that the kernel
> will send IFLA_EVENT every time.
> 

No, we are telling the user that if it is interested in a specific event
(let's say NOTIFY_PEERS or RESEND_IGMP), then it now can monitor netlink
traffic for those events.
As things stand right now, that's not possible.

I've done this specifically for all events for which we currently generate
a netlink message.

The only concern I have is that if in the future we remove a certain netdev
event, it may impact applications.  But we may be doing it right now as well,
only silently, and the apps may have to find some ways to work around it.

There is also a potential to improve libnl caching and not invalidate the
cached data for certain events.

-vlad
> 
> 
>>
>> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
>> about the name suggests it is a bonding notification. This one was added
>> specifically to notify userspace (d4261e5650004), yet seems to happen
>> only during a changelink and that already generates a RTM_NEWLINK
>> message via do_setlink. Since the rtnetlink_event message does not
>> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
>> really serve besides duplicating netlink messages to userspace.
>>
>> The REBOOT, IGMP, FEAT_CHANGE and BONDING_FAILOVER seem to be unique
>> messages (code analysis only) which I get for notifying userspace.
>>
>> NETDEV_NOTIFY_PEERS is not so clear in how often it duplicates other
>> messages.



Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages

2017-04-27 Thread Vlad Yasevich
On 04/23/2017 09:07 PM, David Ahern wrote:
> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>> @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct 
>> net_device *dev)
>>  return err;
>>  }
>>  
>> +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
>> +{
>> +u32 rtnl_event;
>> +
>> +switch (event) {
>> +case NETDEV_REBOOT:
>> +rtnl_event = IFLA_EVENT_REBOOT;
>> +break;
>> +case NETDEV_FEAT_CHANGE:
>> +rtnl_event = IFLA_EVENT_FEAT_CHANGE;
>> +break;
>> +case NETDEV_BONDING_FAILOVER:
>> +rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>> +break;
>> +case NETDEV_NOTIFY_PEERS:
>> +rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
>> +break;
>> +case NETDEV_RESEND_IGMP:
>> +rtnl_event = IFLA_EVENT_RESEND_IGMP;
>> +break;
>> +case NETDEV_CHANGEINFODATA:
>> +rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>> +break;
>> +default:
>> +return 0;
>> +}
>> +
>> +return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
>> +}
>> +
> 
> I still have doubts about encoding kernel events into a uapi.
> 
> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
> about the name suggests it is a bonding notification. This one was added
> specifically to notify userspace (d4261e5650004), yet seems to happen
> only during a changelink and that already generates a RTM_NEWLINK
> message via do_setlink. Since the rtnetlink_event message does not
> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
> really serve besides duplicating netlink messages to userspace.
> 

I am not sure about this one, but if you have an app trying to monitor
for this event, it can't really since there is no info in the netlink message.

> The REBOOT, IGMP, FEAT_CHANGE and BONDING_FAILOVER seem to be unique
> messages (code analysis only) which I get for notifying userspace.
> 
> NETDEV_NOTIFY_PEERS is not so clear in how often it duplicates other
> messages.
> 

This one sometimes happens in addition to bonding failover, but not always
(it depends on bonding mode).
For me, having access to this particular event is important as it will
used to trigger a guest announcements.

-vlad



Re: [PATCH net-next 1/2] rtnetlink: Disable notification for NETDEV_NAMECHANGE event

2017-04-27 Thread Vlad Yasevich
On 04/21/2017 02:08 PM, David Ahern wrote:
> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>> The data signaling name change is already provided at
>> the end of do_setlink().  This event handler just generates
>> a duplicate announcement.  Disable it.
>>
>> CC: David Ahern 
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  net/core/rtnetlink.c | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index 0ee5479..e8e6816 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -4123,7 +4123,6 @@ static int rtnetlink_event(struct notifier_block 
>> *this, unsigned long event, voi
>>  
>>  switch (event) {
>>  case NETDEV_REBOOT:
>> -case NETDEV_CHANGENAME:
>>  case NETDEV_FEAT_CHANGE:
>>  case NETDEV_BONDING_FAILOVER:
>>  case NETDEV_NOTIFY_PEERS:
>>
> 
> 
> I only see one using the ip monitor.
> 
> $ ip li set foobar name fubar
> 
> generates these 3 messages:
> 
> [LINK]12: fubar:  mtu 1500 qdisc noqueue state DOWN
> group default
> link/ether 76:cd:72:dd:2a:cb brd ff:ff:ff:ff:ff:ff
> Unknown message: type=0x0051(81) flags=0x(0)len=0x001c(28)
> [NETCONF]ipv4 dev dummy2 forwarding on rp_filter off mc_forwarding off
> proxy_neigh off ignore_routes_with_linkdown off
> Unknown message: type=0x0051(81) flags=0x(0)len=0x001c(28)
> [NETCONF]ipv6 dev dummy2 forwarding on mc_forwarding off proxy_neigh off
> ignore_routes_with_linkdown off
> 
> do_setlink only sets DO_SETLINK_MODIFIED so a name change alone will not
> generate 2 messages.
> 

Actually, it has nothing to do with above flag.  Setting DO_SETLINK_MODIFIED
will still generate notifications, but only if the device is UP.  However,
it looks like link name change can only be done when the link is down.  As
a result, netdev_state_change will not report it, so we only see the 'event'
one.

So this is patch isn't needed, but only as a kind-of side-effect..

-vlad


Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions

2017-04-21 Thread Vlad Yasevich
On 04/21/2017 12:05 AM, Jason Wang wrote:
> 
> 
> On 2017年04月20日 23:34, Vlad Yasevich wrote:
>> On 04/17/2017 11:01 PM, Jason Wang wrote:
>>>
>>> On 2017年04月16日 00:38, Vladislav Yasevich wrote:
>>>> Curreclty virtion net header is fixed size and adding things to it is 
>>>> rather
>>>> difficult to do.  This series attempt to add the infrastructure as well as 
>>>> some
>>>> extensions that try to resolve some deficiencies we currently have.
>>>>
>>>> First, vnet header only has space for 16 flags.  This may not be enough
>>>> in the future.  The extensions will provide space for 32 possbile extension
>>>> flags and 32 possible extensions.   These flags will be carried in the
>>>> first pseudo extension header, the presense of which will be determined by
>>>> the flag in the virtio net header.
>>>>
>>>> The extensions themselves will immidiately follow the extension header 
>>>> itself.
>>>> They will be added to the packet in the same order as they appear in the
>>>> extension flags.  No padding is placed between the extensions and any
>>>> extensions negotiated, but not used need by a given packet will convert to
>>>> trailing padding.
>>> Do we need a explicit padding (e.g an extension) which could be controlled 
>>> by each side?
>> I don't think so.  The size of the vnet header is set based on the 
>> extensions negotiated.
>> The one part I am not crazy about is that in the case of packet not using 
>> any extensions,
>> the data is still placed after the entire vnet header, which essentially 
>> adds a lot
>> of padding.  However, that's really no different then if we simply grew the 
>> vnet header.
>>
>> The other thing I've tried before is putting extensions into their own sg 
>> buffer, but that
>> made it slower.h
> 
> Yes.
> 
>>
>>>> For example:
>>>>| vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet 
>>>> data |
>>> Just some rough thoughts:
>>>
>>> - Is this better to use TLV instead of bitmap here? One advantage of TLV is 
>>> that the
>>> length is not limited by the length of bitmap.
>> but the disadvantage is that we add at least 4 bytes per extension of just 
>> TL data.  That
>> makes this thing even longer.
> 
> Yes, and it looks like the length is still limited by e.g the length of T.

Not only that, but it is also limited by the skb->cb as a whole.  So adding 
putting
extensions into a TLV style means we have less extensions for now, until we get 
rid of
skb->cb usage.

> 
>>
>>> - For 1.1, do we really want something like vnet header? AFAIK, it was not 
>>> used by modern
>>> NICs, is this better to pack all meta-data into descriptor itself? This may 
>>> need a some
>>> changes in tun/macvtap, but looks more PCIE friendly.
>> That would really be ideal and I've looked at this.  There are small issues 
>> of exposing
>> the 'net metadata' of the descriptor to taps so they can be filled in.  The 
>> alternative
>> is to use a different control structure for tap->qemu|vhost channel (that 
>> can be
>> implementation specific) and have qemu|vhost populate the 'net metadata' of 
>> the descriptor.
> 
> Yes, this needs some thought. For vhost, things looks a little bit easier, we 
> can probably
> use msg_control.
> 

We can use msg_control in qemu as well, can't we?  It really is a question of 
who is doing
the work and the number of copies.

I can take a closer look of how it would look if we extend the descriptor with 
type
specific data.  I don't know if other users of virtio would benefit from it?

-vlad
> Thanks
> 
>> Thanks
>> -vlad
>>
>>> Thanks
>>>
>>>> Extensions proposed in this series are:
>>>>- IPv6 fragment id extension
>>>>  * Currently, the guest generated fragment id is discarded and the host
>>>>generates an IPv6 fragment id if the packet has to be fragmented.  
>>>> The
>>>>code attempts to add time based perturbation to id generation to 
>>>> make
>>>>it harder to guess the next fragment id to be used.  However, doing 
>>>> this
>>>>on the host may result is less perturbation (due to differnet 
>>>> timing)
>>>>and might make id guessing easier.  Ideally, the ids generated by 
>>>> the
>>>>guest s

Re: [PATCH V2 net] netdevice: Include NETIF_F_HW_CSUM when intersecting features

2017-04-20 Thread Vlad Yasevich
On 04/20/2017 06:31 PM, Alexander Duyck wrote:
> On Thu, Apr 20, 2017 at 12:17 PM, Vladislav Yasevich
>  wrote:
>> While hardware device use either NETIF_F_(IP|IPV6)_CSUM or
>> NETIF_F_HW_CSUM, all of the software devices use HW_CSUM.
>> This results in an interesting situation when the software
>> device is configured on top of hw device using (IP|IPV6)_CSUM.
>> In this situation, the user can't turn off checksum offloading
>> features on the software device.
> 
> Why wouldn't they be able to? It seems like the software device
> shouldn't be setting IP_CSUM or IPV6_CSUM feature flags, but they will
> be set in the intersect features. The result won't have
> NETIF_F_HW_CSUM set, but it should advertise the features of the lower
> device instead.

The can't because the upper software devices typically has HW_CSUM set in
hw_features and features.  When we intersect with a lower device which has
IP|IPV6 set, we end up with a software device that has IP|IPV6 set.  However,
the the hw_features have HW_CSUM, you end with a "fixed" setting of IP|IPV6
which can't be controlled now on the software device.

I've had a situation where trying to debug something, to turn off checksum
offloading on a vlan, I had to turn it of on the hw itself which effects all
traffic now.

We should be able to control it properly.

> 
>> This patch resolves that by adding NETIF_F_HW_CSUM to the mask
>> if a feature set includes only IP|IPV6 csum.  This allows the user
>> to control the upper (software) device checksum, while at the same
>> time correctly propagating lower device changes up.
> 
> You can't go this way. HW_CSUM is an all inclusive feature flag,
> whereas IP_CSUM and IPV6_CSUM specify only specific offload types.
> With your change the lower device could disable IPV6_CSUM for instance
> but you would still end up advertising checksum offload via HW_CSUM.
> 
>> CC: Michal Kubecek 
>> CC: Alexander Duyck 
>> CC: Tom Herbert 
>> Signed-off-by: Vladislav Yasevich 
>>
>> ---
>>
>> V2: Addressed comments from Alex Duyck.  I tested this with hacked virtio
>> device that set IP|IPV6 checksums instead of HW.  Configuring a vlan on
>> top gave the vlan device with 'ip-generic: on' setting (using HW checksum).
>> This allows me to change vlan checksum offloads independent of virt-io nic.
>> Changes to virtio-nic propagated up to vlan, turning off the offloading
>> correctly.
>>
>>  include/linux/netdevice.h | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index b0aa089..81aed2f 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -4009,10 +4009,10 @@ static inline netdev_features_t 
>> netdev_intersect_features(netdev_features_t f1,
>>   netdev_features_t 
>> f2)
>>  {
>> if ((f1 ^ f2) & NETIF_F_HW_CSUM) {
>> -   if (f1 & NETIF_F_HW_CSUM)
>> -   f1 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>> -   else
>> -   f2 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>> +   if (f1 & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM))
>> +   f1 |= NETIF_F_HW_CSUM;
>> +   if(f2 & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM))
>> +   f2 |= NETIF_F_HW_CSUM;
>> }
>>
>> return f1 & f2;
>> --
>> 2.7.4
>>
> 
> This doesn't work. "NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM" doesn't
> equate to "NETIF_F_HW_CSUM". The problem is NETIF_F_HW_CSUM is a
> catch-all, the IP and IPV6 CSUM flags are not. Right now you would
> introduce escapes on devices that enable IP but not IPv6, or the other
> way around.
> 
> Can you point to the exact case where this code is an issue? It seems
> like maybe you are wanting to have NETIF_F_HW_CSUM set if you support
> offloading a given protocol. You might want to look at how we dealt
> with this in the GSO code path so that if we could offload the
> checksum we set NETIF_F_HW_CSUM based on protocol and the CSUM offload
> feature bit for that protocol.
> 

What I'd like to be able to do is control features on the software device
without having to turn them off on the HW.  As it stands right now, we can't
do that.  If you want to try, configure a vlan on top of any device that
sets IP|IPV6 csum features.

I was trying to fix it in the common place.  It actually works correctly
since the software checksum gets computed correctly lower in the stack,
so there is no actual escape.

Having said that, the other alternative is to inherit hw_features from
lower devices.  BTW, bonding I think has a similar "issue" you are
describing since it prefers HW_CSUM if any of the slaves have it set.

-vlad

> - Alex
> 



Re: [PATCH net] netdevice: Prefer NETIF_F_HW_CSUM when intersecting features

2017-04-20 Thread Vlad Yasevich
On 04/20/2017 02:36 PM, Alexander Duyck wrote:
> On Wed, Apr 19, 2017 at 6:12 PM, Vladislav Yasevich  
> wrote:
>> While hardware device use either NETIF_F_(IP|IPV6)_CSUM or
>> NETIF_F_HW_CSUM, all of the software devices use HW_CSUM.
>> This results in an interesting situation when the software
>> device is configured on top of hw device using (IP|IPV6)_CSUM.
>> In this situation, the user can't turn off checksum offloading
>> features on the software device.
>>
>> This patch resolves that by prefering the NETIF_F_HW_CSUM setting
>> when computing a feature intersect.
>>
>> Signed-off-by: Vladislav Yasevich 
>> ---
>>  include/linux/netdevice.h | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 97456b25..3d811c1 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -4019,9 +4019,9 @@ static inline netdev_features_t 
>> netdev_intersect_features(netdev_features_t f1,
>>  {
>> if ((f1 ^ f2) & NETIF_F_HW_CSUM) {
>> if (f1 & NETIF_F_HW_CSUM)
>> -   f1 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>> +   f2 |= NETIF_F_HW_CSUM;
>> else
>> -   f2 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>> +   f1 |= NETIF_F_HW_CSUM;
>> }
>>
>> return f1 & f2;
>> --
>> 2.7.4
>>
> 
> This doesn't seem right to me. I can see promoting NETIF_F_HW_CSUM to
> also include IP_CSUM and IPV6_CSUM but all this seems to do is force
> NETIF_F_HW_CSUM on always if either side has it enabled. That doesn't
> seem like an intersection at all to me.

I think it needs to be the other way around.  At least that's what I've
been observing...


> 
> What this code is supposed to do is make it so that if we are testing
> for NETIF_F_IP_CSUM and NETIF_F_HW_CSUM is set it will return true.
> The code as it is now will always return true if we test for
> NETIF_F_HW_CSUM since it will just set it on the other side for us.
> 

Yes, you are right.  The way the code is now, if we turn off checksum
offload on the lower device, the upper device doesn't change.

What we really want to do is add HW_CSUM to the mask if IP|IPV6 is set.
That way, if lower device has IP|IPV6,  it will end up with IP|IPV6|HW.
This way, controlling IP csums on the lower device will turn off the
checksum on the upper, while at the same time give us the ability to
control the checksum offload on the upper device itself.

Thanks.  I'll resubmit a v2.

-vlad



Re: [PATCH RFC (resend) net-next 0/6] virtio-net: Add support for virtio-net header extensions

2017-04-20 Thread Vlad Yasevich
On 04/17/2017 11:01 PM, Jason Wang wrote:
> 
> 
> On 2017年04月16日 00:38, Vladislav Yasevich wrote:
>> Curreclty virtion net header is fixed size and adding things to it is rather
>> difficult to do.  This series attempt to add the infrastructure as well as 
>> some
>> extensions that try to resolve some deficiencies we currently have.
>>
>> First, vnet header only has space for 16 flags.  This may not be enough
>> in the future.  The extensions will provide space for 32 possbile extension
>> flags and 32 possible extensions.   These flags will be carried in the
>> first pseudo extension header, the presense of which will be determined by
>> the flag in the virtio net header.
>>
>> The extensions themselves will immidiately follow the extension header 
>> itself.
>> They will be added to the packet in the same order as they appear in the
>> extension flags.  No padding is placed between the extensions and any
>> extensions negotiated, but not used need by a given packet will convert to
>> trailing padding.
> 
> Do we need a explicit padding (e.g an extension) which could be controlled by 
> each side?

I don't think so.  The size of the vnet header is set based on the extensions 
negotiated.
The one part I am not crazy about is that in the case of packet not using any 
extensions,
the data is still placed after the entire vnet header, which essentially adds a 
lot
of padding.  However, that's really no different then if we simply grew the 
vnet header.

The other thing I've tried before is putting extensions into their own sg 
buffer, but that
made it slower.

> 
>>
>> For example:
>>   | vnet mrg hdr | ext hdr | ext 1 | ext 2 | ext 5 | .. pad .. | packet data 
>> |
> 
> Just some rough thoughts:
> 
> - Is this better to use TLV instead of bitmap here? One advantage of TLV is 
> that the
> length is not limited by the length of bitmap.

but the disadvantage is that we add at least 4 bytes per extension of just TL 
data.  That
makes this thing even longer.

> - For 1.1, do we really want something like vnet header? AFAIK, it was not 
> used by modern
> NICs, is this better to pack all meta-data into descriptor itself? This may 
> need a some
> changes in tun/macvtap, but looks more PCIE friendly.

That would really be ideal and I've looked at this.  There are small issues of 
exposing
the 'net metadata' of the descriptor to taps so they can be filled in.  The 
alternative
is to use a different control structure for tap->qemu|vhost channel (that can be
implementation specific) and have qemu|vhost populate the 'net metadata' of the 
descriptor.

Thanks
-vlad

> 
> Thanks
> 
>>
>> Extensions proposed in this series are:
>>   - IPv6 fragment id extension
>> * Currently, the guest generated fragment id is discarded and the host
>>   generates an IPv6 fragment id if the packet has to be fragmented.  The
>>   code attempts to add time based perturbation to id generation to make
>>   it harder to guess the next fragment id to be used.  However, doing 
>> this
>>   on the host may result is less perturbation (due to differnet timing)
>>   and might make id guessing easier.  Ideally, the ids generated by the
>>   guest should be used.  One could also argue that we a "violating" the
>>   IPv6 protocol in the if the _strict_ interpretation of the spec.
>>
>>   - VLAN header acceleration
>> * Currently virtio doesn't not do vlan header acceleration and instead
>>   uses software tagging.  One of the first things that the host will do 
>> is
>>   strip the vlan header out.  When passing the packet the a guest the
>>   vlan header is re-inserted in to the packet.  We can skip all that work
>>   if we can pass the vlan data in accelearted format.  Then the host will
>>   not do any extra work.  However, so far, this yeilded a very small
>>   perf bump (only ~1%).  I am still looking into this.
>>
>>   - UDP tunnel offload
>> * Similar to vlan acceleration, with this extension we can pass 
>> additional
>>   data to host for support GSO with udp tunnel and possible other
>>   encapsulations.  This yeilds a significant perfromance improvement
>>  (still testing remote checksum code).
>>
>> An addition extension that is unfinished (due to still testing for any
>> side-effects) is checksum passthrough to support drivers that set
>> CHECKSUM_COMPLETE.  This would eliminate the need for guests to compute
>> the software checksum.
>>
>> This series only takes care of virtio net.  I have addition patches for the
>> host side (vhost and tap/macvtap as well as qemu), but wanted to get feedback
>> on the general approach first.
>>
>> Vladislav Yasevich (6):
>>virtio-net: Remove the use the padded vnet_header structure
>>virtio-net: make header length handling uniform
>>virtio_net: Add basic skeleton for handling vnet header extensions.
>>virtio-net: Add support for IPv6 fragment id vnet header extension.
>>virtio-net: Add support for vlan 

Re: [PATCH net] netdevice: Prefer NETIF_F_HW_CSUM when intersecting features

2017-04-20 Thread Vlad Yasevich
On 04/20/2017 02:13 AM, Michal Kubecek wrote:
> On Wed, Apr 19, 2017 at 09:12:23PM -0400, Vladislav Yasevich wrote:
>> While hardware device use either NETIF_F_(IP|IPV6)_CSUM or
>> NETIF_F_HW_CSUM, all of the software devices use HW_CSUM.
>> This results in an interesting situation when the software
>> device is configured on top of hw device using (IP|IPV6)_CSUM.
>> In this situation, the user can't turn off checksum offloading
>> features on the software device.
>>
>> This patch resolves that by prefering the NETIF_F_HW_CSUM setting
>> when computing a feature intersect.
> 
> The reasoning makes sense to me but perhaps we should rename the helper
> then to make it obvious that it doesn't work like intersection anymore
> (not even in the logical sense).

It's still an intersect in the end since we return (f1 & f2).  We are just
munging checksum so that we get a HW_CSUM in the end if one of the devices 
supported
it.

> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 97456b25..3d811c1 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -4019,9 +4019,9 @@ static inline netdev_features_t 
>> netdev_intersect_features(netdev_features_t f1,
>>  {
>>  if ((f1 ^ f2) & NETIF_F_HW_CSUM) {
>>  if (f1 & NETIF_F_HW_CSUM)
>> -f1 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>> +f2 |= NETIF_F_HW_CSUM;
>>  else
>> -f2 |= (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
>> +f1 |= NETIF_F_HW_CSUM;
>>  }
>>  
>>  return f1 & f2;
> 
> I hope I didn't miss something but it seems we can avoid nested ifs now:
> 
>   if ((f1 ^ f2) & NETIF_F_HW_CSUM) {
>   f1 |= NETIF_F_HW_CSUM;
>   f2 |= NETIF_F_HW_CSUM;
>   }
> 

Yes, we can do that.  We could have done something like this before as well, I 
suppose.

Thanks
-vlad

> Michal Kubecek
> 



Re: [PATCH net-next 1/8] rtnetlink: Do not generate notifications for MTU events

2017-04-14 Thread Vlad Yasevich
On 04/10/2017 11:49 AM, David Ahern wrote:
> On 4/10/17 9:39 AM, Vlad Yasevich wrote:
>> OK, so this will work for the events that are generated as a result of 
>> device state change
>> (like mtu, address, and others).
>>
>> However, the original event data may be needed for other events that may be
>> of use to userspace like NETDEV_NOTIFY_PEERS and NETDEV_RESEND_IGMP 
>> (possibly others...)
> 
> sure. My objection is to multiple messages with identical content.
> 
> I think the rtnetlink_event message is unique for those 2 netdev events,
> so no objections if it has value.
> 

So, I've been looking at adding a bitmap and collecting all modification, 
however
I ran into an interesting issue in do_setlink.

Currently the notifications from do_setlink() don't appear to work as one would
expect and it's somewhat confusing upon deeper inspection.

We have 2 values DO_SETLINK_MODIFIED (1) and DO_SETLINK_NOTIFY (3).  These 2 
'attempt'
to do different jobs, but really fail at it.  The function will generate 
notifications
regardless of which of the above values is used.

Those changes were done in commit ba9989069f4e426b1e0ed7018eacc9e1ba607095 (cc 
Nicolas
just in case he remembers the history)

I am not sure which changes should really trigger a call netdev_state_change(), 
thus this
message.  Right now, all changes done in this function trigger them.  If that's 
how that
should function, that I can simplify the code.  If not, then some of the 
changes may
require us to export the event to the user.

To use the dreaded NETDEV_CHANGEMTU event as an example, we used to generate 3 
messages
(PRECHANGEMTU, CHANGEMTU, and a message from netdev_state_change).  With recent 
changes,
we now only generate a message from netdev_state_change.  However, mtu change 
is tagged
with DO_SETLINK_MODIFIED which doesn't include the notify bit.  So, should 
there be a
NETDEV_CHANGE event associated with this change and a rtnl message (as it is 
now) or not?
It's unclear.

-vlad


Re: [PATCH net-next 1/8] rtnetlink: Do not generate notifications for MTU events

2017-04-10 Thread Vlad Yasevich
On 04/08/2017 02:18 PM, Roopa Prabhu wrote:
> On 4/8/17, 11:13 AM, David Ahern wrote:
>> On 4/8/17 2:06 PM, Roopa Prabhu wrote:
>>> On 4/7/17, 2:25 PM, David Ahern wrote:
 Changing MTU on a link currently causes 3 messages to be sent to userspace:

 [LINK]11: dummy1:  mtu 1490 qdisc noqueue 
 state UNKNOWN group default event PRE_CHANGE_MTU
 link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

 [LINK]11: dummy1:  mtu 1500 qdisc noqueue 
 state UNKNOWN group default event CHANGE_MTU
 link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

 [LINK]11: dummy1:  mtu 1500 qdisc noqueue 
 state UNKNOWN group default
 link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

 Remove the PRE_CHANGE_MTU and CHANGE_MTU messages.


>>> This change is good... multiple notifications for the same event does not 
>>> help in large scale links setups. However, this
>>> reverts what vlad was trying to do with his patchset. Vlad's patch-set 
>>> relies on the rtnl notifications generated from
>>> notifiers (rtnetlink_event) to add  specific event (IFLA_EVENT) in 
>>> notifications.
>>>
>>> The third notification in your example above is the correct one and is an 
>>> aggregate notification for a set of changes, but
>>> it cannot really fill in all types of events in the single IFLA_EVENT 
>>> attribute as it stands today.  IFLA_EVENT should be
>>> a bitmask to include all events in this case (i had indicated this in vlads 
>>> first version).
>>>
>> Agreed. I think it would be best to revert def12888c161 before the UAPI
>> goes out.
>>
>> The change can instead add the IFLA_EVENT as a bitmask mentioned here to
>> note the changes in a setlink. On top of that, remove the notifications
>> for the events I mentioned in this set to reduce the overhead on userspace.
> 
> ack
> 

OK, so this will work for the events that are generated as a result of device 
state change
(like mtu, address, and others).

However, the original event data may be needed for other events that may be
of use to userspace like NETDEV_NOTIFY_PEERS and NETDEV_RESEND_IGMP (possibly 
others...)

-vlad


Re: [PATCH net-next 7/8] rtnetlink: Do not generate notifications for NOTIFY_PEERS event

2017-04-08 Thread Vlad Yasevich
On 04/07/2017 05:25 PM, David Ahern wrote:
> NOTIFY_PEERS is an internal event; do not generate userspace
> notifications.

We actually need this event to support macvtap over bonding
as well as to resolve some issues with VMs using a bonded uplink
on the host.

-vlad

> 
> Signed-off-by: David Ahern 
> ---
>  include/uapi/linux/if_link.h | 1 -
>  net/core/rtnetlink.c | 4 
>  2 files changed, 5 deletions(-)
> 
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 4fa3bf3eb21d..8f23e9dde667 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -906,7 +906,6 @@ enum {
>   IFLA_EVENT_CHANGE_NAME,
>   IFLA_EVENT_FEAT_CHANGE,
>   IFLA_EVENT_BONDING_FAILOVER,
> - IFLA_EVENT_NOTIFY_PEERS,
>   IFLA_EVENT_CHANGE_UPPER,
>   IFLA_EVENT_RESEND_IGMP,
>   IFLA_EVENT_CHANGE_LOWER_STATE,
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index e2b0ec5174e7..d2587aa339c4 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1294,9 +1294,6 @@ static int rtnl_fill_link_event(struct sk_buff *skb, 
> unsigned long event)
>   case NETDEV_BONDING_FAILOVER:
>   rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>   break;
> - case NETDEV_NOTIFY_PEERS:
> - rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
> - break;
>   case NETDEV_CHANGEUPPER:
>   rtnl_event = IFLA_EVENT_CHANGE_UPPER;
>   break;
> @@ -4173,7 +4170,6 @@ static int rtnetlink_event(struct notifier_block *this, 
> unsigned long event, voi
>   case NETDEV_CHANGENAME:
>   case NETDEV_FEAT_CHANGE:
>   case NETDEV_BONDING_FAILOVER:
> - case NETDEV_NOTIFY_PEERS:
>   case NETDEV_CHANGEUPPER:
>   case NETDEV_RESEND_IGMP:
>   case NETDEV_CHANGELOWERSTATE:
> 



Re: [PATCH net-next 1/8] rtnetlink: Do not generate notifications for MTU events

2017-04-08 Thread Vlad Yasevich
On 04/07/2017 05:25 PM, David Ahern wrote:
> Changing MTU on a link currently causes 3 messages to be sent to userspace:
> 
> [LINK]11: dummy1:  mtu 1490 qdisc noqueue state 
> UNKNOWN group default event PRE_CHANGE_MTU
> link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff
> 
> [LINK]11: dummy1:  mtu 1500 qdisc noqueue state 
> UNKNOWN group default event CHANGE_MTU
> link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff
> 
> [LINK]11: dummy1:  mtu 1500 qdisc noqueue state 
> UNKNOWN group default
> link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff
> 
> Remove the PRE_CHANGE_MTU and CHANGE_MTU messages.

Actually, I have plans for the CHANGE_MTU event.  The last message is not an 
event. If you
remove the event, it is much harder to track mtu changes.

-vlad
> 
> Signed-off-by: David Ahern 
> ---
>  include/uapi/linux/if_link.h | 2 --
>  net/core/rtnetlink.c | 8 ---
>  2 files changed, 10 deletions(-)
> 

> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 97f6d302f627..e8b7e9342cc0 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -903,7 +903,6 @@ enum {
>  enum {
>   IFLA_EVENT_UNSPEC,
>   IFLA_EVENT_REBOOT,
> - IFLA_EVENT_CHANGE_MTU,
>   IFLA_EVENT_CHANGE_ADDR,
>   IFLA_EVENT_CHANGE_NAME,
>   IFLA_EVENT_FEAT_CHANGE,
> @@ -912,7 +911,6 @@ enum {
>   IFLA_EVENT_NOTIFY_PEERS,
>   IFLA_EVENT_CHANGE_UPPER,
>   IFLA_EVENT_RESEND_IGMP,
> - IFLA_EVENT_PRE_CHANGE_MTU,
>   IFLA_EVENT_CHANGE_INFO_DATA,
>   IFLA_EVENT_PRE_CHANGE_UPPER,
>   IFLA_EVENT_CHANGE_LOWER_STATE,
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index b2bd4c9ee860..72d365ae14b3 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1285,9 +1285,6 @@ static int rtnl_fill_link_event(struct sk_buff *skb, 
> unsigned long event)
>   case NETDEV_REBOOT:
>   rtnl_event = IFLA_EVENT_REBOOT;
>   break;
> - case NETDEV_CHANGEMTU:
> - rtnl_event = IFLA_EVENT_CHANGE_MTU;
> - break;
>   case NETDEV_CHANGEADDR:
>   rtnl_event = IFLA_EVENT_CHANGE_ADDR;
>   break;
> @@ -1312,9 +1309,6 @@ static int rtnl_fill_link_event(struct sk_buff *skb, 
> unsigned long event)
>   case NETDEV_RESEND_IGMP:
>   rtnl_event = IFLA_EVENT_RESEND_IGMP;
>   break;
> - case NETDEV_PRECHANGEMTU:
> - rtnl_event = IFLA_EVENT_PRE_CHANGE_MTU;
> - break;
>   case NETDEV_CHANGEINFODATA:
>   rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>   break;
> @@ -4191,7 +4185,6 @@ static int rtnetlink_event(struct notifier_block *this, 
> unsigned long event, voi
>  
>   switch (event) {
>   case NETDEV_REBOOT:
> - case NETDEV_CHANGEMTU:
>   case NETDEV_CHANGEADDR:
>   case NETDEV_CHANGENAME:
>   case NETDEV_FEAT_CHANGE:
> @@ -4200,7 +4193,6 @@ static int rtnetlink_event(struct notifier_block *this, 
> unsigned long event, voi
>   case NETDEV_NOTIFY_PEERS:
>   case NETDEV_CHANGEUPPER:
>   case NETDEV_RESEND_IGMP:
> - case NETDEV_PRECHANGEMTU:
>   case NETDEV_CHANGEINFODATA:
>   case NETDEV_PRECHANGEUPPER:
>   case NETDEV_CHANGELOWERSTATE:
> 



Re: [PATCH net-next] rtnl: Add support for netdev event to link messages

2017-03-30 Thread Vlad Yasevich
On 03/30/2017 09:39 AM, Vlad Yasevich wrote:
> On 03/29/2017 03:11 PM, David Ahern wrote:
>> On 3/29/17 11:05 AM, Vlad Yasevich wrote:
>>> On 03/29/2017 12:37 PM, Roopa Prabhu wrote:
>>>> On 3/29/17, 5:23 AM, Vlad Yasevich wrote:
>>>>> [ resending to list.  hit the wrong reply button last time ]
>>>>>
>>>>> On 03/27/2017 06:58 PM, David Miller wrote:
>>>>>> From: Vladislav Yasevich <vyasev...@gmail.com>
>>>>>> Date: Sat, 25 Mar 2017 21:59:47 -0400
>>>>>>
>>>>>>> RTNL currently generates notifications on some netdev notifier events.
>>>>>>> However, user space has no idea what changed.  All it sees is the
>>>>>>> data and has to infer what has changed.  For some events that is not
>>>>>>> possible.
>>>>>>>
>>>>>>> This patch adds a new field to RTM_NEWLINK message called IFLA_EVENT
>>>>>>> that would have an encoding of the which event triggered this
>>>>>>> notification.  Currectly, only 2 events (NETDEV_NOTIFY_PEERS and
>>>>>>> NETDEV_MTUCHANGED) are supported.  These events could be interesting
>>>>>>> in the virt space to trigger additional configuration commands to VMs.
>>>>>>> Other events of interest may be added later.
>>>>>>>
>>>>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>>>>> At what point do we start providing the metadata for the changed
>>>>>> values as well?  You'd probably need to provide both the old and
>>>>>> new values to cover all cases.
>>>>> I don't think if that would be possible because of when events are 
>>>>> triggered.
>>>>> We send these notifications after all the changes have already been made, 
>>>>> so
>>>>> it might be tough to carry old data.
>>>>>
>>>>> Looking at just the two events I am supporting in this patch, we could 
>>>>> actually
>>>>> supply the old mtu data through a NETDEV_PRECHANGEMTU event, if it is 
>>>>> necessary.
>>>>
>>>> But, NETDEV_PRECHANGEMTU will be a unnecessary notification to userspace 
>>>> without
>>>> changes. There are already enough notifications generated for links (I 
>>>> know you are not
>>>> suggesting adding it here)
>>>
>>> Actually, this one already triggers a link notification to userspace.  It 
>>> just has
>>> no event data in it to tell you that. :)
>>
>> Is it intentional or unintentional? perhaps rtnetlink_event should be a
>> whitelist -- events that userspace should be notified about. Seems like
>> NETDEV_ events have been added without rtnetlink_event getting updated.
> 
> I think a 'whitelist' was attempted, but as you mentioned, it hasn't been 
> updated...
> I'll defer the definitive answer to someone else.  It seems Patrick added a 
> comment
> in commit a2835763 to update the white list and it's been a few times.
>

This is actually an interesting point.  Looking at some commits that have added
events to black list in rtnetlink-event, it might have been much easier to debug
those issues if we had the 'event' encoding in the netlink message.

I think it might be worthwhile to add all allowed event types to this new 
encoding
so we can userspace can see just what's its getting.

-vlad

>> For example, does userspace care about NETDEV_UDP_TUNNEL_PUSH_INFO or
>> NETDEV_CHANGE_TX_QUEUE_LEN?
>>
> 
> Probably not the first, but possibly the second.  If txquelen is changed on a 
> device,
> some apps might want to know about it.
>



Re: [PATCH net-next] rtnl: Add support for netdev event to link messages

2017-03-30 Thread Vlad Yasevich
On 03/29/2017 03:11 PM, David Ahern wrote:
> On 3/29/17 11:05 AM, Vlad Yasevich wrote:
>> On 03/29/2017 12:37 PM, Roopa Prabhu wrote:
>>> On 3/29/17, 5:23 AM, Vlad Yasevich wrote:
>>>> [ resending to list.  hit the wrong reply button last time ]
>>>>
>>>> On 03/27/2017 06:58 PM, David Miller wrote:
>>>>> From: Vladislav Yasevich <vyasev...@gmail.com>
>>>>> Date: Sat, 25 Mar 2017 21:59:47 -0400
>>>>>
>>>>>> RTNL currently generates notifications on some netdev notifier events.
>>>>>> However, user space has no idea what changed.  All it sees is the
>>>>>> data and has to infer what has changed.  For some events that is not
>>>>>> possible.
>>>>>>
>>>>>> This patch adds a new field to RTM_NEWLINK message called IFLA_EVENT
>>>>>> that would have an encoding of the which event triggered this
>>>>>> notification.  Currectly, only 2 events (NETDEV_NOTIFY_PEERS and
>>>>>> NETDEV_MTUCHANGED) are supported.  These events could be interesting
>>>>>> in the virt space to trigger additional configuration commands to VMs.
>>>>>> Other events of interest may be added later.
>>>>>>
>>>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>>>> At what point do we start providing the metadata for the changed
>>>>> values as well?  You'd probably need to provide both the old and
>>>>> new values to cover all cases.
>>>> I don't think if that would be possible because of when events are 
>>>> triggered.
>>>> We send these notifications after all the changes have already been made, 
>>>> so
>>>> it might be tough to carry old data.
>>>>
>>>> Looking at just the two events I am supporting in this patch, we could 
>>>> actually
>>>> supply the old mtu data through a NETDEV_PRECHANGEMTU event, if it is 
>>>> necessary.
>>>
>>> But, NETDEV_PRECHANGEMTU will be a unnecessary notification to userspace 
>>> without
>>> changes. There are already enough notifications generated for links (I know 
>>> you are not
>>> suggesting adding it here)
>>
>> Actually, this one already triggers a link notification to userspace.  It 
>> just has
>> no event data in it to tell you that. :)
> 
> Is it intentional or unintentional? perhaps rtnetlink_event should be a
> whitelist -- events that userspace should be notified about. Seems like
> NETDEV_ events have been added without rtnetlink_event getting updated.

I think a 'whitelist' was attempted, but as you mentioned, it hasn't been 
updated...
I'll defer the definitive answer to someone else.  It seems Patrick added a 
comment
in commit a2835763 to update the white list and it's been a few times.

> For example, does userspace care about NETDEV_UDP_TUNNEL_PUSH_INFO or
> NETDEV_CHANGE_TX_QUEUE_LEN?
> 

Probably not the first, but possibly the second.  If txquelen is changed on a 
device,
some apps might want to know about it.

-vlad



Re: [PATCH net-next] rtnl: Add support for netdev event to link messages

2017-03-29 Thread Vlad Yasevich
On 03/29/2017 12:37 PM, Roopa Prabhu wrote:
> On 3/29/17, 5:23 AM, Vlad Yasevich wrote:
>> [ resending to list.  hit the wrong reply button last time ]
>>
>> On 03/27/2017 06:58 PM, David Miller wrote:
>>> From: Vladislav Yasevich <vyasev...@gmail.com>
>>> Date: Sat, 25 Mar 2017 21:59:47 -0400
>>>
>>>> RTNL currently generates notifications on some netdev notifier events.
>>>> However, user space has no idea what changed.  All it sees is the
>>>> data and has to infer what has changed.  For some events that is not
>>>> possible.
>>>>
>>>> This patch adds a new field to RTM_NEWLINK message called IFLA_EVENT
>>>> that would have an encoding of the which event triggered this
>>>> notification.  Currectly, only 2 events (NETDEV_NOTIFY_PEERS and
>>>> NETDEV_MTUCHANGED) are supported.  These events could be interesting
>>>> in the virt space to trigger additional configuration commands to VMs.
>>>> Other events of interest may be added later.
>>>>
>>>> Signed-off-by: Vladislav Yasevich <vyase...@redhat.com>
>>> At what point do we start providing the metadata for the changed
>>> values as well?  You'd probably need to provide both the old and
>>> new values to cover all cases.
>> I don't think if that would be possible because of when events are triggered.
>> We send these notifications after all the changes have already been made, so
>> it might be tough to carry old data.
>>
>> Looking at just the two events I am supporting in this patch, we could 
>> actually
>> supply the old mtu data through a NETDEV_PRECHANGEMTU event, if it is 
>> necessary.
> 
> But, NETDEV_PRECHANGEMTU will be a unnecessary notification to userspace 
> without
> changes. There are already enough notifications generated for links (I know 
> you are not
> suggesting adding it here)

Actually, this one already triggers a link notification to userspace.  It just 
has
no event data in it to tell you that. :)

>> For the use cases I am looking at, it isn't usefull, but easy enough to add.
>>
> Most of the times a single notification can carry multiple changes, this 
> helps user-space..by
> cutting down on notifications in systems with large number of links. I don't 
> see IFLA_EVENT attribute
> handle multiple changes..
> 

No it doesn't handle multiple changes mainly because we already generate a link
notification for a lot of the events.  This patch doesn't add any additional 
user space
notifications.  All it does is add the "event" information to existing ones so 
that user
space may know what happened.

For instance, if you change the mtu on an interface, you get the following:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group 
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

The above is the result when you run
 # ip l s lo mtu 1500

With this patch, you'd be able to tell that the notification 2 above was a 
result of mtu
change.  The first one was a result of "PRECHANGEMTU".  Didn't look to see what 
the
third one is.

> Given the number of attributes for which events are generated, I think a 
> model where user-space
> maintains a cache and diff's the new link object with the old one works best 
> in all cases.
> 

This patch doesn't preclude this.  It doesn't change how many notifications are 
generated.
 All it does is a carry a hint as to why a particular notification is generated.

It's also impossible to tell what happened if the data did not change.  As and 
example,
how does one know that a NETDEV_NOTIFY_PEERS or NETDEV_IGMP_RESEND netdev event 
occurred?

And if you ask why we need those, consider a case where a VM is connected to 
the network
through a bridge or macvtap on top of active-backup bond.  Now, there is 
failover
in the bond and bond generates the above events.  The hypervisor will update 
the switches
with its own mac/multicast groups, but the VM has no idea this happened.

-vlad




Re: [PATCH net-next] rtnl: Add support for netdev event to link messages

2017-03-29 Thread Vlad Yasevich
[ resending to list.  hit the wrong reply button last time ]

On 03/27/2017 06:58 PM, David Miller wrote:
> From: Vladislav Yasevich 
> Date: Sat, 25 Mar 2017 21:59:47 -0400
> 
>> RTNL currently generates notifications on some netdev notifier events.
>> However, user space has no idea what changed.  All it sees is the
>> data and has to infer what has changed.  For some events that is not
>> possible.
>>
>> This patch adds a new field to RTM_NEWLINK message called IFLA_EVENT
>> that would have an encoding of the which event triggered this
>> notification.  Currectly, only 2 events (NETDEV_NOTIFY_PEERS and
>> NETDEV_MTUCHANGED) are supported.  These events could be interesting
>> in the virt space to trigger additional configuration commands to VMs.
>> Other events of interest may be added later.
>>
>> Signed-off-by: Vladislav Yasevich 
> 
> At what point do we start providing the metadata for the changed
> values as well?  You'd probably need to provide both the old and
> new values to cover all cases.

I don't think if that would be possible because of when events are triggered.
We send these notifications after all the changes have already been made, so
it might be tough to carry old data.

Looking at just the two events I am supporting in this patch, we could actually
supply the old mtu data through a NETDEV_PRECHANGEMTU event, if it is necessary.
For the use cases I am looking at, it isn't usefull, but easy enough to add.

> 
>> @@ -4044,6 +4076,7 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct 
>> netlink_callback *cb)
>>  return skb->len;
>>  }
>>  
>> +
>>  /* Process one rtnetlink message. */
>>  
>>  static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
> 
> Please don't add more empty lines between functions, one is enough.
> 

Sorry, got left-over after moving the code around.  Will remove when 
resubmitting.

-vlad


Re: [RFC PATCH net-next] virtio_net: Support UDP Tunnel offloads.

2016-12-19 Thread Vlad Yasevich
On 12/15/2016 02:07 AM, Or Gerlitz wrote:
> On Fri, Nov 18, 2016 at 1:01 AM, Jarno Rajahalme  wrote:
>> This patch is a proof-of-concept I did a few months ago for UDP tunnel
>> offload support in virtio_net interface [..]
> 
> What's the use case you were considering for a guest running a UDP based VTEP?

Two cases that I've been aware of are nested virt or simply a guest acting
as router/bridge with possible different tunnel devices.

-vlad

> 
>> Real implementation needs to extend the virtio_net header rather than
>> piggy-backing on existing fields.  Inner MAC length (or inner network
>> offset) also needs to be passed as a new field.  Control plane (QEMU)
>> also needs to be updated.
>>
>> All testing was done using Geneve, but this should work for all UDP
>> tunnels the same.



Re: [PATCH net 1/3] sctp: fix the transport dead race check by using atomic_add_unless on refcnt

2016-01-22 Thread Vlad Yasevich
On 01/22/2016 12:18 PM, Marcelo Ricardo Leitner wrote:
> On Fri, Jan 22, 2016 at 11:50:20AM -0500, Vlad Yasevich wrote:
>> On 01/21/2016 12:49 PM, Xin Long wrote:
>>> Now when __sctp_lookup_association is running in BH, it will try to
>>> check if t->dead is set, but meanwhile other CPUs may be freeing this
>>> transport and this assoc and if it happens that
>>> __sctp_lookup_association checked t->dead a bit too early, it may think
>>> that the association is still good while it was already freed.
>>>
>>> So we fix this race by using atomic_add_unless in sctp_transport_hold.
>>> After we get one transport from hashtable, we will hold it only when
>>> this transport's refcnt is not 0, so that we can make sure t->asoc
>>> cannot be freed before we hold the asoc again.
>>
>> atomic_add_unless() uses atomic_read() to check the value.  Since there
>> don't appear to be any barriers, what guarantees that the value
>> read will not have been modified in another thread under a proper lock?
>>
> 
> atomic_read() is used only as a starting point.  If it got changed in
> between, the new current value (return of atomic_cmpxchg) will be used
> then.
> 
>>>
>>> Note that sctp association is not freed using RCU so we can't use
>>> atomic_add_unless() with it as it may just be too late for that either.
>>>
>>> Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
>>> Reported-by: Vlad Yasevich <vyasev...@gmail.com>
>>> Signed-off-by: Xin Long <lucien@gmail.com>
>>> ---
>>>  include/net/sctp/structs.h |  2 +-
>>>  net/sctp/input.c   | 17 +++--
>>>  net/sctp/transport.c   |  4 ++--
>>>  3 files changed, 14 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
>>> index 20e7212..344da04 100644
>>> --- a/include/net/sctp/structs.h
>>> +++ b/include/net/sctp/structs.h
>>> @@ -955,7 +955,7 @@ void sctp_transport_route(struct sctp_transport *, 
>>> union sctp_addr *,
>>>  void sctp_transport_pmtu(struct sctp_transport *, struct sock *sk);
>>>  void sctp_transport_free(struct sctp_transport *);
>>>  void sctp_transport_reset_timers(struct sctp_transport *);
>>> -void sctp_transport_hold(struct sctp_transport *);
>>> +int sctp_transport_hold(struct sctp_transport *);
>>>  void sctp_transport_put(struct sctp_transport *);
>>>  void sctp_transport_update_rto(struct sctp_transport *, __u32);
>>>  void sctp_transport_raise_cwnd(struct sctp_transport *, __u32, __u32);
>>> diff --git a/net/sctp/input.c b/net/sctp/input.c
>>> index bf61dfb..49d2cc7 100644
>>> --- a/net/sctp/input.c
>>> +++ b/net/sctp/input.c
>>> @@ -935,15 +935,22 @@ static struct sctp_association 
>>> *__sctp_lookup_association(
>>> struct sctp_transport **pt)
>>>  {
>>> struct sctp_transport *t;
>>> +   struct sctp_association *asoc = NULL;
>>>  
>>> +   rcu_read_lock();
>>> t = sctp_addrs_lookup_transport(net, local, peer);
>>> -   if (!t || t->dead)
>>> -   return NULL;
>>> +   if (!t || !sctp_transport_hold(t))
>>> +   goto out;
>>>  
>>> -   sctp_association_hold(t->asoc);
>>> +   asoc = t->asoc;
>>> +   sctp_association_hold(asoc);
>>
>> I don't think you can modify the reference count on a transport, let alone
>> the association outside of a lock.
> 
> The transport memory is not freed, as it's protected by rcu_read_lock(),
> so we are safe to use it yet.
> atomic_ operations include an embedded lock instruction protecting the
> counter itself, there shouldn't be a need to use another lock around it.
> 
> And in the code above, as we could grab a hold on the transport, means
> the association was not freed yet because transports hold a ref on
> assoc. That's why the dance: hold(transport) hold(assoc) put(transport)
> 

OK,  I see how that holds together, but I think there might be hole wrt icmp
handling.  Some icmp processes assume transport can't disappear on them, but in
this case that last put(transport) may result in a call to 
sctp_transport_destroy()
and that might be bad.  I am looking at it now.

Thanks
-vlad

>   Marcelo
> 
>>
>> -vlad
>>
>>> *pt = t;
>>>  
>>> -   return t->asoc;
>>> +   sctp_transport_put(t);
>>> +
>>> +out:
>>> +   rcu_rea

Re: [PATCH net 1/3] sctp: fix the transport dead race check by using atomic_add_unless on refcnt

2016-01-22 Thread Vlad Yasevich
On 01/21/2016 12:49 PM, Xin Long wrote:
> Now when __sctp_lookup_association is running in BH, it will try to
> check if t->dead is set, but meanwhile other CPUs may be freeing this
> transport and this assoc and if it happens that
> __sctp_lookup_association checked t->dead a bit too early, it may think
> that the association is still good while it was already freed.
> 
> So we fix this race by using atomic_add_unless in sctp_transport_hold.
> After we get one transport from hashtable, we will hold it only when
> this transport's refcnt is not 0, so that we can make sure t->asoc
> cannot be freed before we hold the asoc again.

atomic_add_unless() uses atomic_read() to check the value.  Since there
don't appear to be any barriers, what guarantees that the value
read will not have been modified in another thread under a proper lock?


> 
> Note that sctp association is not freed using RCU so we can't use
> atomic_add_unless() with it as it may just be too late for that either.
> 
> Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
> Reported-by: Vlad Yasevich <vyasev...@gmail.com>
> Signed-off-by: Xin Long <lucien@gmail.com>
> ---
>  include/net/sctp/structs.h |  2 +-
>  net/sctp/input.c   | 17 +++--
>  net/sctp/transport.c   |  4 ++--
>  3 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 20e7212..344da04 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -955,7 +955,7 @@ void sctp_transport_route(struct sctp_transport *, union 
> sctp_addr *,
>  void sctp_transport_pmtu(struct sctp_transport *, struct sock *sk);
>  void sctp_transport_free(struct sctp_transport *);
>  void sctp_transport_reset_timers(struct sctp_transport *);
> -void sctp_transport_hold(struct sctp_transport *);
> +int sctp_transport_hold(struct sctp_transport *);
>  void sctp_transport_put(struct sctp_transport *);
>  void sctp_transport_update_rto(struct sctp_transport *, __u32);
>  void sctp_transport_raise_cwnd(struct sctp_transport *, __u32, __u32);
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index bf61dfb..49d2cc7 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -935,15 +935,22 @@ static struct sctp_association 
> *__sctp_lookup_association(
>   struct sctp_transport **pt)
>  {
>   struct sctp_transport *t;
> + struct sctp_association *asoc = NULL;
>  
> + rcu_read_lock();
>   t = sctp_addrs_lookup_transport(net, local, peer);
> - if (!t || t->dead)
> - return NULL;
> + if (!t || !sctp_transport_hold(t))
> + goto out;
>  
> - sctp_association_hold(t->asoc);
> + asoc = t->asoc;
> + sctp_association_hold(asoc);

I don't think you can modify the reference count on a transport, let alone
the association outside of a lock.

-vlad

>   *pt = t;
>  
> - return t->asoc;
> + sctp_transport_put(t);
> +
> +out:
> + rcu_read_unlock();
> + return asoc;
>  }
>  
>  /* Look up an association. protected by RCU read lock */
> @@ -955,9 +962,7 @@ struct sctp_association *sctp_lookup_association(struct 
> net *net,
>  {
>   struct sctp_association *asoc;
>  
> - rcu_read_lock();
>   asoc = __sctp_lookup_association(net, laddr, paddr, transportp);
> - rcu_read_unlock();
>  
>   return asoc;
>  }
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index aab9e3f..69f3799 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -296,9 +296,9 @@ void sctp_transport_route(struct sctp_transport 
> *transport,
>  }
>  
>  /* Hold a reference to a transport.  */
> -void sctp_transport_hold(struct sctp_transport *transport)
> +int sctp_transport_hold(struct sctp_transport *transport)
>  {
> - atomic_inc(>refcnt);
> + return atomic_add_unless(>refcnt, 1, 0);
>  }
>  
>  /* Release a reference to a transport and clean up
> 



Re: [PATCH net-next 1/5] sctp: add the rhashtable apis for sctp global transport hashtable

2016-01-05 Thread Vlad Yasevich
On 12/30/2015 10:50 AM, Xin Long wrote:
> tranport hashtbale will replace the association hashtable to do the
> lookup for transport, and then get association by t->assoc, rhashtable
> apis will be used because of it's resizable, scalable and using rcu.
> 
> lport + rport + paddr will be the base hashkey to locate the chain,
> with net to protect one netns from another, then plus the laddr to
> compare to get the target.
> 
> this patch will provider the lookup functions:
> - sctp_epaddr_lookup_transport
> - sctp_addrs_lookup_transport
> 
> hash/unhash functions:
> - sctp_hash_transport
> - sctp_unhash_transport
> 
> init/destroy functions:
> - sctp_transport_hashtable_init
> - sctp_transport_hashtable_destroy
> 
> Signed-off-by: Xin Long 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  include/net/sctp/sctp.h|  11 
>  include/net/sctp/structs.h |   5 ++
>  net/sctp/input.c   | 131 
> +
>  3 files changed, 147 insertions(+)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index ce13cf2..7bbdfba 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -143,6 +143,17 @@ void sctp_icmp_proto_unreachable(struct sock *sk,
>struct sctp_transport *t);
>  void sctp_backlog_migrate(struct sctp_association *assoc,
> struct sock *oldsk, struct sock *newsk);
> +int sctp_transport_hashtable_init(void);
> +void sctp_transport_hashtable_destroy(void);
> +void sctp_hash_transport(struct sctp_transport *t);
> +void sctp_unhash_transport(struct sctp_transport *t);
> +struct sctp_transport *sctp_addrs_lookup_transport(
> + struct net *net,
> + const union sctp_addr *laddr,
> + const union sctp_addr *paddr);
> +struct sctp_transport *sctp_epaddr_lookup_transport(
> + const struct sctp_endpoint *ep,
> + const union sctp_addr *paddr);
>  
>  /*
>   * sctp/proc.c
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index eea9bde..4ab87d0 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -48,6 +48,7 @@
>  #define __sctp_structs_h__
>  
>  #include 
> +#include 
>  #include /* linux/in.h needs this!!*/
>  #include /* We get struct sockaddr_in. */
>  #include/* We get struct in6_addr */
> @@ -123,6 +124,8 @@ extern struct sctp_globals {
>   struct sctp_hashbucket *assoc_hashtable;
>   /* This is the sctp port control hash.  */
>   struct sctp_bind_hashbucket *port_hashtable;
> + /* This is the hash of all transports. */
> + struct rhashtable transport_hashtable;
>  
>   /* Sizes of above hashtables. */
>   int ep_hashsize;
> @@ -147,6 +150,7 @@ extern struct sctp_globals {
>  #define sctp_assoc_hashtable (sctp_globals.assoc_hashtable)
>  #define sctp_port_hashsize   (sctp_globals.port_hashsize)
>  #define sctp_port_hashtable  (sctp_globals.port_hashtable)
> +#define sctp_transport_hashtable (sctp_globals.transport_hashtable)
>  #define sctp_checksum_disable(sctp_globals.checksum_disable)
>  
>  /* SCTP Socket type: UDP or TCP style. */
> @@ -753,6 +757,7 @@ static inline int sctp_packet_empty(struct sctp_packet 
> *packet)
>  struct sctp_transport {
>   /* A list of transports. */
>   struct list_head transports;
> + struct rhash_head node;
>  
>   /* Reference counting. */
>   atomic_t refcnt;
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index b6493b3..bac8278 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -782,6 +782,137 @@ hit:
>   return ep;
>  }
>  
> +/* rhashtable for transport */
> +struct sctp_hash_cmp_arg {
> + const union sctp_addr   *laddr;
> + const union sctp_addr   *paddr;
> + const struct net*net;
> +};
> +
> +static inline int sctp_hash_cmp(struct rhashtable_compare_arg *arg,
> + const void *ptr)
> +{
> + const struct sctp_hash_cmp_arg *x = arg->key;
> + const struct sctp_transport *t = ptr;
> + struct sctp_association *asoc = t->asoc;
> + const struct net *net = x->net;
> +
> + if (x->laddr->v4.sin_port != htons(asoc->base.bind_addr.port))
> + return 1;
> + if (!sctp_cmp_addr_exact(>ipaddr, x->paddr))
> + return 1;
> + if (!net_eq(sock_net(asoc->base.sk), net))
> + return 1;
> + if (!sctp_bind_addr_match(>base.bind_addr,
> +   x->laddr, sctp_sk(asoc->base.sk)))
> + return 1;
> +
> + return 0;
> +}
> +
> +static inline u32 sctp_hash_obj(const void *data, u32 len, u32 seed)
> +{
> + const struct sctp_transport *t = data;
> + const union sctp_addr *paddr = >ipaddr;
> + 

Re: [PATCH net-next 2/5] sctp: apply rhashtable api to send/recv path

2016-01-05 Thread Vlad Yasevich
On 12/30/2015 10:50 AM, Xin Long wrote:
> apply lookup apis to two functions, for __sctp_endpoint_lookup_assoc
> and __sctp_lookup_association, it's invoked in the protection of sock
> lock, it will be safe, but sctp_lookup_association need to call
> rcu_read_lock() and to detect the t->dead to protect it.
> 
> Signed-off-by: Xin Long 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/associola.c   |  5 +
>  net/sctp/endpointola.c | 35 ---
>  net/sctp/input.c   | 39 ++-
>  net/sctp/protocol.c|  6 ++
>  4 files changed, 29 insertions(+), 56 deletions(-)
> 
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 559afd0..2bf8ec9 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -383,6 +383,7 @@ void sctp_association_free(struct sctp_association *asoc)
>   list_for_each_safe(pos, temp, >peer.transport_addr_list) {
>   transport = list_entry(pos, struct sctp_transport, transports);
>   list_del_rcu(pos);
> + sctp_unhash_transport(transport);
>   sctp_transport_free(transport);
>   }
>  
> @@ -500,6 +501,8 @@ void sctp_assoc_rm_peer(struct sctp_association *asoc,
>  
>   /* Remove this peer from the list. */
>   list_del_rcu(>transports);
> + /* Remove this peer from the transport hashtable */
> + sctp_unhash_transport(peer);
>  
>   /* Get the first transport of asoc. */
>   pos = asoc->peer.transport_addr_list.next;
> @@ -699,6 +702,8 @@ struct sctp_transport *sctp_assoc_add_peer(struct 
> sctp_association *asoc,
>   /* Attach the remote transport to our asoc.  */
>   list_add_tail_rcu(>transports, >peer.transport_addr_list);
>   asoc->peer.transport_count++;
> + /* Add this peer into the transport hashtable */
> + sctp_hash_transport(peer);

This is actually problematic.  The issue is that transports are unhashed when 
removed.
however, transport removal happens after the association has been declared dead 
and
should have been removed from the hash and marked unreachable.

As a result, with the code above, you can now find and return a dead 
association.
Checking for 'dead' state is racy.

The best solution I've come up with is to hash the transports in 
sctp_hash_established()
and clean-up in __sctp_unhash_established(), and then handle ADD-IP case 
separately.

The above would also remove the necessity to check for temporary associations, 
since they
should never be hashed.

-vlad

>  
>   /* If we do not yet have a primary path, set one.  */
>   if (!asoc->peer.primary_path) {
> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> index 9da76ba..8838bf4 100644
> --- a/net/sctp/endpointola.c
> +++ b/net/sctp/endpointola.c
> @@ -314,8 +314,8 @@ struct sctp_endpoint *sctp_endpoint_is_match(struct 
> sctp_endpoint *ep,
>  }
>  
>  /* Find the association that goes with this chunk.
> - * We do a linear search of the associations for this endpoint.
> - * We return the matching transport address too.
> + * We lookup the transport from hashtable at first, then get association
> + * through t->assoc.
>   */
>  static struct sctp_association *__sctp_endpoint_lookup_assoc(
>   const struct sctp_endpoint *ep,
> @@ -323,12 +323,7 @@ static struct sctp_association 
> *__sctp_endpoint_lookup_assoc(
>   struct sctp_transport **transport)
>  {
>   struct sctp_association *asoc = NULL;
> - struct sctp_association *tmp;
> - struct sctp_transport *t = NULL;
> - struct sctp_hashbucket *head;
> - struct sctp_ep_common *epb;
> - int hash;
> - int rport;
> + struct sctp_transport *t;
>  
>   *transport = NULL;
>  
> @@ -337,26 +332,12 @@ static struct sctp_association 
> *__sctp_endpoint_lookup_assoc(
>*/
>   if (!ep->base.bind_addr.port)
>   goto out;
> + t = sctp_epaddr_lookup_transport(ep, paddr);
> + if (!t || t->asoc->temp)
> + goto out;
>  
> - rport = ntohs(paddr->v4.sin_port);
> -
> - hash = sctp_assoc_hashfn(sock_net(ep->base.sk), ep->base.bind_addr.port,
> -  rport);
> - head = _assoc_hashtable[hash];
> - read_lock(>lock);
> - sctp_for_each_hentry(epb, >chain) {
> - tmp = sctp_assoc(epb);
> - if (tmp->ep != ep || rport != tmp->peer.port)
> - continue;
> -
> - t = sctp_assoc_lookup_paddr(tmp, paddr);
> - if (t) {
> - asoc = tmp;
> - *transport = t;
> - break;
> - }
> - }
> - read_unlock(>lock);
> + *transport = t;
> + asoc = t->asoc;
>  out:
>   return asoc;
>  }
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index bac8278..6f075d8 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -981,38 +981,19 @@ static struct 

Re: [PATCH net] sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close

2015-12-18 Thread Vlad Yasevich
On 12/17/2015 02:33 PM, Vlad Yasevich wrote:
> On 12/17/2015 02:01 PM, Marcelo Ricardo Leitner wrote:
>> Em 17-12-2015 16:29, Vlad Yasevich escreveu:
>>> On 12/17/2015 09:30 AM, Xin Long wrote:
>>>> In sctp_close, sctp_make_abort_user may return NULL because of memory
>>>> allocation failure. If this happens, it will bypass any state change
>>>> and never free the assoc. The assoc has no chance to be freed and it
>>>> will be kept in memory with the state it had even after the socket is
>>>> closed by sctp_close().
>>>>
>>>> So if sctp_make_abort_user fails to allocate memory, we should just
>>>> free the asoc, as there isn't much else that we can do.
>>>>
>>>> Signed-off-by: Xin Long <lucien@gmail.com>
>>>> Acked-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
>>>> ---
>>>>   net/sctp/socket.c | 6 +-
>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>>> index 9b6cc6d..267b8f8 100644
>>>> --- a/net/sctp/socket.c
>>>> +++ b/net/sctp/socket.c
>>>> @@ -1513,8 +1513,12 @@ static void sctp_close(struct sock *sk, long 
>>>> timeout)
>>>>   struct sctp_chunk *chunk;
>>>>
>>>>   chunk = sctp_make_abort_user(asoc, NULL, 0);
>>>> -if (chunk)
>>>> +if (chunk) {
>>>>   sctp_primitive_ABORT(net, asoc, chunk);
>>>> +} else {
>>>> +sctp_unhash_established(asoc);
>>>> +sctp_association_free(asoc);
>>>> +}
>>>
>>> I don't think you can do that for an association that has not been closed.
>>>
>>> I think a cleaner approach might be to update abort primitive handlers
>>> to handle a NULL chunk value and unconditionally call the primitive.
>>>
>>> This guarantees that any timers or waitqueues that might be active are
>>> stopped correctly.
>>
>> sctp_association_free() is the one who does that job, even that way. All in 
>> between the
>> primitive call and then the call to sctp_association_free() is just status 
>> changes and
>> packet xmit, which doing this way we cut out when we are in memory pressure. 
>> pkt xmit or
>> ULP events are likely going to fail too anyway.
>>
>> sctp_sf_do_9_1_prm_abort() -> SCTP_CMD_ASSOC_FAILED ->
>>   sctp_cmd_assoc_failed -> ULP events, send abort, and SCTP_CMD_DELETE_TCB ->
>> sctp_cmd_delete_tcb ->
>>   sctp_unhash_established(asoc);
>>   sctp_association_free(asoc);
>> and returns.
>>
>> There is a check on sctp_cmd_delete_tcb() that avoids calling that on temp 
>> assocs on
>> listening sockets, but that condition is false due to the check on 
>> sk_shutdown so it will
>> call those two functions anyway.
> 
> The condition I am a bit concerned about is one thread waiting in 
> sctp_wait_for_sndbuf
> while another does an abort.
> 
> I think this is OK though.  I need to look a bit more...

I think the only time this ends up biting us is if SO_SNDTIMEO was used and we 
ran out
of send buffer.  It looks to me like schedule_timeout() will wait until timer 
expired and
depending on the timer value, you could wait quite a while.

With this path, since you don't transition state, the asoc->wait wait queue is 
never
notified and it could be hanging around for quite a while.

-vlad   

> 
> -vlad
> 
> 
>>
>>   Marcelo
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close

2015-12-17 Thread Vlad Yasevich
On 12/17/2015 09:30 AM, Xin Long wrote:
> In sctp_close, sctp_make_abort_user may return NULL because of memory
> allocation failure. If this happens, it will bypass any state change
> and never free the assoc. The assoc has no chance to be freed and it
> will be kept in memory with the state it had even after the socket is
> closed by sctp_close().
> 
> So if sctp_make_abort_user fails to allocate memory, we should just
> free the asoc, as there isn't much else that we can do.
> 
> Signed-off-by: Xin Long 
> Acked-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/socket.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 9b6cc6d..267b8f8 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1513,8 +1513,12 @@ static void sctp_close(struct sock *sk, long timeout)
>   struct sctp_chunk *chunk;
>  
>   chunk = sctp_make_abort_user(asoc, NULL, 0);
> - if (chunk)
> + if (chunk) {
>   sctp_primitive_ABORT(net, asoc, chunk);
> + } else {
> + sctp_unhash_established(asoc);
> + sctp_association_free(asoc);
> + }

I don't think you can do that for an association that has not been closed.

I think a cleaner approach might be to update abort primitive handlers
to handle a NULL chunk value and unconditionally call the primitive.

This guarantees that any timers or waitqueues that might be active are
stopped correctly.

-vlad


>   } else
>   sctp_primitive_SHUTDOWN(net, asoc, NULL);
>   }
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close

2015-12-17 Thread Vlad Yasevich
On 12/17/2015 02:01 PM, Marcelo Ricardo Leitner wrote:
> Em 17-12-2015 16:29, Vlad Yasevich escreveu:
>> On 12/17/2015 09:30 AM, Xin Long wrote:
>>> In sctp_close, sctp_make_abort_user may return NULL because of memory
>>> allocation failure. If this happens, it will bypass any state change
>>> and never free the assoc. The assoc has no chance to be freed and it
>>> will be kept in memory with the state it had even after the socket is
>>> closed by sctp_close().
>>>
>>> So if sctp_make_abort_user fails to allocate memory, we should just
>>> free the asoc, as there isn't much else that we can do.
>>>
>>> Signed-off-by: Xin Long <lucien@gmail.com>
>>> Acked-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
>>> ---
>>>   net/sctp/socket.c | 6 +-
>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index 9b6cc6d..267b8f8 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -1513,8 +1513,12 @@ static void sctp_close(struct sock *sk, long timeout)
>>>   struct sctp_chunk *chunk;
>>>
>>>   chunk = sctp_make_abort_user(asoc, NULL, 0);
>>> -if (chunk)
>>> +if (chunk) {
>>>   sctp_primitive_ABORT(net, asoc, chunk);
>>> +} else {
>>> +sctp_unhash_established(asoc);
>>> +sctp_association_free(asoc);
>>> +}
>>
>> I don't think you can do that for an association that has not been closed.
>>
>> I think a cleaner approach might be to update abort primitive handlers
>> to handle a NULL chunk value and unconditionally call the primitive.
>>
>> This guarantees that any timers or waitqueues that might be active are
>> stopped correctly.
> 
> sctp_association_free() is the one who does that job, even that way. All in 
> between the
> primitive call and then the call to sctp_association_free() is just status 
> changes and
> packet xmit, which doing this way we cut out when we are in memory pressure. 
> pkt xmit or
> ULP events are likely going to fail too anyway.
> 
> sctp_sf_do_9_1_prm_abort() -> SCTP_CMD_ASSOC_FAILED ->
>   sctp_cmd_assoc_failed -> ULP events, send abort, and SCTP_CMD_DELETE_TCB ->
> sctp_cmd_delete_tcb ->
>   sctp_unhash_established(asoc);
>   sctp_association_free(asoc);
> and returns.
> 
> There is a check on sctp_cmd_delete_tcb() that avoids calling that on temp 
> assocs on
> listening sockets, but that condition is false due to the check on 
> sk_shutdown so it will
> call those two functions anyway.

The condition I am a bit concerned about is one thread waiting in 
sctp_wait_for_sndbuf
while another does an abort.

I think this is OK though.  I need to look a bit more...

-vlad


> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-14 Thread Vlad Yasevich
On 12/14/2015 04:50 AM, David Laight wrote:
> From: Vlad Yasevich
>> Sent: 11 December 2015 18:38
> ...
>>> Found a similar place in abort primitive handling like in this last
>>> patch update, it's probably the issue you're still triggering.
>>>
>>> Also found another place that may lead to this use after free, in case
>>> we receive a packet with a chunk that has no data.
>>>
>>> Oh my.. :)
>>
>> Yes.  This is what I was worried about...  Anything that triggers
>> a DELTE_TCB command has to return a code that we can trap.
>>
>> The other way is to do what Dmitri suggested, but even there, we
>> need to be very careful.
> 
> I'm always wary of anything that queues actions up for later processing.
> It is far too easy (as found here) to end up processing actions
> in invalid states, or to process actions in 'unusual' orders when
> specific events happen close together.
> 
> I wonder how much fallout there'd be from getting the sctp code
> to immediately action things, instead of queuing the actions for later.
> It would certainly remove a lot of the unusual combinations of events.
> 

We've bandied this idea around for a while, but no one has had the time
to tackle this.  This would be rather time-consuming task, but in the end
might be a good idea.

-vlad

>   David
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [V2 PATCH 1/1] net: sctp: dynamically enable or disable pf state

2015-12-14 Thread Vlad Yasevich
On 12/14/2015 01:22 AM, zyjzyj2...@gmail.com wrote:
> From: Zhu Yanjun <zyjzyj2...@gmail.com>
> 
> As we all know, the value of pf_retrans >= max_retrans_path can
> disable pf state. The variables of pf_retrans and max_retrans_path
> can be changed by the user space application.
> 
> Sometimes the user expects to disable pf state while the 2
> variables are changed to enable pf state. So it is necessary to
> introduce a new variable to disable pf state.
> 
> According to the suggestions from Vlad Yasevich, extra1 and extra2
> are removed. The initialization of pf_enable is added.
> 
> Signed-off-by: Zhu Yanjun <zyjzyj2...@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  include/net/netns/sctp.h |7 +++
>  net/sctp/protocol.c  |3 +++
>  net/sctp/sm_sideeffect.c |5 -
>  net/sctp/sysctl.c|7 +++
>  4 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> index 8ba379f..c501d67 100644
> --- a/include/net/netns/sctp.h
> +++ b/include/net/netns/sctp.h
> @@ -89,6 +89,13 @@ struct netns_sctp {
>   int pf_retrans;
>  
>   /*
> +  * Disable Potentially-Failed feature, the feature is enabled by default
> +  * pf_enable-  0  : disable pf
> +  *  - >0  : enable pf
> +  */
> + int pf_enable;
> +
> + /*
>* Policy for preforming sctp/socket accounting
>* 0   - do socket level accounting, all assocs share sk_sndbuf
>* 1   - do sctp accounting, each asoc may use sk_sndbuf bytes
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 4d9912f..571a631 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1223,6 +1223,9 @@ static int __net_init sctp_defaults_init(struct net 
> *net)
>   /* Max.Burst- 4 */
>   net->sctp.max_burst = SCTP_DEFAULT_MAX_BURST;
>  
> + /* Enable pf state by default */
> + net->sctp.pf_enable = 1;
> +
>   /* Association.Max.Retrans  - 10 attempts
>* Path.Max.Retrans - 5  attempts (per destination address)
>* Max.Init.Retransmits - 8  attempts
> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> index 6098d4c..05cd164 100644
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -477,6 +477,8 @@ static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t 
> *commands,
>struct sctp_transport *transport,
>int is_hb)
>  {
> + struct net *net = sock_net(asoc->base.sk);
> +
>   /* The check for association's overall error counter exceeding the
>* threshold is done in the state function.
>*/
> @@ -503,7 +505,8 @@ static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t 
> *commands,
>* is SCTP_ACTIVE, then mark this transport as Partially Failed,
>* see SCTP Quick Failover Draft, section 5.1
>*/
> - if ((transport->state == SCTP_ACTIVE) &&
> + if (net->sctp.pf_enable &&
> +(transport->state == SCTP_ACTIVE) &&
>  (asoc->pf_retrans < transport->pathmaxrxt) &&
>  (transport->error_count > asoc->pf_retrans)) {
>  
> diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
> index 26d50c5..ccbfc93 100644
> --- a/net/sctp/sysctl.c
> +++ b/net/sctp/sysctl.c
> @@ -308,6 +308,13 @@ static struct ctl_table sctp_net_table[] = {
>   .extra1 = _autoclose_min,
>   .extra2 = _autoclose_max,
>   },
> + {
> + .procname   = "pf_enable",
> + .data   = _net.sctp.pf_enable,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,
> + },
>  
>   { /* sentinel */ }
>  };
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] net: sctp: dynamically enable or disable pf state

2015-12-11 Thread Vlad Yasevich
On 12/11/2015 04:05 AM, zyjzyj2...@gmail.com wrote:
> From: Zhu Yanjun 
> 
> As we all know, the vale of pf_retrans >= max_retrans_path can
> disable pf state. The variables of pf_retrans and max_retrans_path
> can be changed by the user space application.
> 
> Sometimes the user expects to disable pf state while the 2
> variables are changed to enable pf state. So it is necessary to
> introduce a new variable to disable pf state.
> 
> Signed-off-by: Zhu Yanjun 
> ---
>  Documentation/networking/ip-sysctl.txt |   17 +
>  include/net/netns/sctp.h   |7 +++
>  net/sctp/sm_sideeffect.c   |5 -
>  net/sctp/sysctl.c  |9 +
>  4 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index f647900..7195c24 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -1723,6 +1723,23 @@ addip_enable - BOOLEAN
>  
>   Default: 0
>  
> +pf_enable - INTEGER
> + Enable or disable pf state. A value of pf_retrans > path_max_retrans
> + also disables pf state. That is, one of both pf_enable and
> + pf_retrans > path_max_retrans can disable pf state. Since pf_retrans
> + and path_max_retrans can be changed by userspace application, sometimes
> + user expects to disable pf state by the value of
> + pf_retrans > path_max_retrans, but ocassionally the value of pf_retrans
> + or path_max_retrans is changed by the user application, this pf state is
> + enabled. As such, it is necessary to add this to dynamically enable
> + and disable pf state.
> +
> + 1: Enable pf.
> +
> + 0: Disable pf.
> +
> + Default: 1

You never set the default value anywhere in the patch and thus disable PF 
extension by
default.

> +
>  addip_noauth_enable - BOOLEAN
>   Dynamic Address Reconfiguration (ADD-IP) requires the use of
>   authentication to protect the operations of adding or removing new
> diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
> index 8ba379f..c501d67 100644
> --- a/include/net/netns/sctp.h
> +++ b/include/net/netns/sctp.h
> @@ -89,6 +89,13 @@ struct netns_sctp {
>   int pf_retrans;
>  
>   /*
> +  * Disable Potentially-Failed feature, the feature is enabled by default
> +  * pf_enable-  0  : disable pf
> +  *  - >0  : enable pf
> +  */
> + int pf_enable;
> +
> + /*
>* Policy for preforming sctp/socket accounting
>* 0   - do socket level accounting, all assocs share sk_sndbuf
>* 1   - do sctp accounting, each asoc may use sk_sndbuf bytes
> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> index 6098d4c..50309ed 100644
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -477,6 +477,8 @@ static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t 
> *commands,
>struct sctp_transport *transport,
>int is_hb)
>  {
> + struct net *net = sock_net(asoc->base.sk);
> +
>   /* The check for association's overall error counter exceeding the
>* threshold is done in the state function.
>*/
> @@ -503,7 +505,8 @@ static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t 
> *commands,
>* is SCTP_ACTIVE, then mark this transport as Partially Failed,
>* see SCTP Quick Failover Draft, section 5.1
>*/
> - if ((transport->state == SCTP_ACTIVE) &&
> + if (net->sctp.pf_enable &&
> +(transport->state == SCTP_ACTIVE) &&
>  (asoc->pf_retrans < transport->pathmaxrxt) &&
>  (transport->error_count > asoc->pf_retrans)) {
>  
> diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
> index 26d50c5..0a4f402 100644
> --- a/net/sctp/sysctl.c
> +++ b/net/sctp/sysctl.c
> @@ -308,6 +308,15 @@ static struct ctl_table sctp_net_table[] = {
>   .extra1 = _autoclose_min,
>   .extra2 = _autoclose_max,
>   },
> + {
> + .procname   = "pf_enable",
> + .data   = _net.sctp.pf_enable,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,
> + .extra1 = ,
> + .extra2 = _max
> + },

extra1 and extra2 above are ignored in proc_dointvec.  Don't include them.

-vlad

>  
>   { /* sentinel */ }
>  };
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-11 Thread Vlad Yasevich
On 12/11/2015 09:03 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Dec 11, 2015 at 11:51:21AM -0200, Marcelo Ricardo Leitner wrote:
>> Em 11-12-2015 11:35, Dmitry Vyukov escreveu:
>>> On Wed, Dec 9, 2015 at 5:41 PM, Marcelo Ricardo Leitner
>>>  wrote:
 On Wed, Dec 09, 2015 at 01:03:56PM -0200, Marcelo Ricardo Leitner wrote:
> On Wed, Dec 09, 2015 at 03:41:29PM +0100, Dmitry Vyukov wrote:
>> On Tue, Dec 8, 2015 at 8:22 PM, Dmitry Vyukov  wrote:
>>> On Tue, Dec 8, 2015 at 6:40 PM, Marcelo Ricardo Leitner
>>>  wrote:
> ...
 The patches were combined already, but this last pick by Vlad is just
 not yet patched. It's not necessary for your testing and I didn't want
 to interrupt it in case you were already testing it.

 You can use my last patch here, from 2 emails ago, the one which
 contains this line:
 -   case SCTP_DISPOSITION_ABORT:
>>>
>>>
>>> You are right. I missed that they are combined. Testing with it now.
>>
>>
>>
>>
>> Use-after-free still happens.
>> I am on commit aa53685549a2cfb5f175b0c4a20bc9aa1e5a1b85 (Dec 8) plus
>> the following sctp-related changes:
>
> Changes are fine.  Ugh. Ok, I'll try your new reproducer here.

 Heh I wasn't going to reproduce this by myself anytime soon, I think.
 It's using the same socket to connect to itself, and only happens if the
 connect() gets there before the listen() call. Figured this out because
 I could only reproduce it under strace at first.

 Please give this other patch a try. A state command
 (sctp_sf_cookie_wait_prm_abort) was issuing SCTP_CMD_INIT_FAILED, which
 leads to SCTP_CMD_DELETE_TCB, but returning SCTP_DISPOSITION_CONSUME,
 which fooled the patch.

 ---8<---
 commit 9f84d50e36cee0ce66e4ce9b3b1665e0a1dbcdd3
 Author: Marcelo Ricardo Leitner 
 Date:   Fri Dec 4 15:30:23 2015 -0200

 sctp: fix use-after-free in pr_debug statement

 Dmitry Vyukov reported a use-after-free in the code expanded by the
 macro debug_post_sfx, which is caused by the use of the asoc pointer
 after it was freed within sctp_side_effect() scope.

 This patch fixes it by allowing sctp_side_effect to clear that asoc
 pointer when the TCB is freed.

 As Vlad explained, we also have to cover the SCTP_DISPOSITION_ABORT 
 case
 because it will trigger DELETE_TCB too on that same loop.

 Also, there was a place issuing SCTP_CMD_INIT_FAILED but returning
 SCTP_DISPOSITION_CONSUME, which would fool the scheme above. Fix it by
 returning SCTP_DISPOSITION_ABORT instead.

 The macro is already prepared to handle such NULL pointer.

 Reported-by: Dmitry Vyukov 

 diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
 index 6098d4c42fa9..be23d5c2074f 100644
 --- a/net/sctp/sm_sideeffect.c
 +++ b/net/sctp/sm_sideeffect.c
 @@ -63,7 +63,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
  static int sctp_side_effects(sctp_event_t event_type, sctp_subtype_t 
 subtype,
  sctp_state_t state,
  struct sctp_endpoint *ep,
 -struct sctp_association *asoc,
 +struct sctp_association **asoc,
  void *event_arg,
  sctp_disposition_t status,
  sctp_cmd_seq_t *commands,
 @@ -1123,7 +1123,7 @@ int sctp_do_sm(struct net *net, sctp_event_t 
 event_type, sctp_subtype_t subtype,
 debug_post_sfn();

 error = sctp_side_effects(event_type, subtype, state,
 - ep, asoc, event_arg, status,
 + ep, , event_arg, status,
   , gfp);
 debug_post_sfx();

 @@ -1136,7 +1136,7 @@ int sctp_do_sm(struct net *net, sctp_event_t 
 event_type, sctp_subtype_t subtype,
  static int sctp_side_effects(sctp_event_t event_type, sctp_subtype_t 
 subtype,
  sctp_state_t state,
  struct sctp_endpoint *ep,
 -struct sctp_association *asoc,
 +struct sctp_association **asoc,
  void *event_arg,
  sctp_disposition_t status,
  sctp_cmd_seq_t *commands,
 @@ -1151,7 +1151,7 @@ static int sctp_side_effects(sctp_event_t 
 event_type, sctp_subtype_t subtype,
  * disposition SCTP_DISPOSITION_CONSUME.
  */
 

Re: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-09 Thread Vlad Yasevich
On 12/09/2015 10:25 AM, Eric Dumazet wrote:
> From: Eric Dumazet <eduma...@google.com>
> 
> SCTP is lacking proper np->opt cloning at accept() time.
> 
> TCP and DCCP use ipv6_dup_options() helper, do the same
> in SCTP.
> 
> We might later factorize this code in a common helper to avoid
> future mistakes.
> 
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

This is sufficient for accept() processing, but looks like peeloff is missing
a bunch of ipv6 support.  I'll see if I can cook something up to fix that part.

-vlad

> ---
>  net/sctp/ipv6.c |8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index d28c0b4c9128..ec529121f38a 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -641,6 +641,7 @@ static struct sock *sctp_v6_create_accept_sk(struct sock 
> *sk,
>   struct sock *newsk;
>   struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
>   struct sctp6_sock *newsctp6sk;
> + struct ipv6_txoptions *opt;
>  
>   newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot, 0);
>   if (!newsk)
> @@ -660,6 +661,13 @@ static struct sock *sctp_v6_create_accept_sk(struct sock 
> *sk,
>  
>   memcpy(newnp, np, sizeof(struct ipv6_pinfo));
>  
> + rcu_read_lock();
> + opt = rcu_dereference(np->opt);
> + if (opt)
> + opt = ipv6_dup_options(newsk, opt);
> + RCU_INIT_POINTER(newnp->opt, opt);
> + rcu_read_unlock();
> +
>   /* Initialize sk's sport, dport, rcv_saddr and daddr for getsockname()
>* and getpeername().
>*/
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-07 Thread Vlad Yasevich
On 12/07/2015 01:52 PM, Marcelo Ricardo Leitner wrote:
> On Mon, Dec 07, 2015 at 02:20:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Dec 7, 2015 at 2:15 PM, Marcelo Ricardo Leitner
>> <marcelo.leit...@gmail.com> wrote:
>>> On Mon, Dec 07, 2015 at 12:26:09PM +0100, Dmitry Vyukov wrote:
>>>> On Sat, Dec 5, 2015 at 5:39 PM, Vlad Yasevich <vyasev...@gmail.com> wrote:
> ...
>>>>> Hi Marcelo
>>>>>
>>>>> I think you also need to catch the SCTP_DISPOSITION_ABORT and update
>>>>> the pointer.  There are some issues there though as some functions report
>>>>> that code without actually destroying the association.  This happens when
>>>>> the ABORT chunk may be dropped.
>>>>>
>>>>> I think this might be why we still see the issue.
>>>>
>>>>
>>>> Marcelo,
>>>>
>>>> Is this info enough for you to cook another fix?
>>>
>>> Hi, I think so. I was really wondering how you could trigger that issue
>>> without the timestamp fix and Vlad's comment does shed some light on it.
>>>
>>> I'll do more tests later today, but what did you have connecting to the
>>> listening socket? Somehow you made that accept() call to return..
>>
>> Local connect in another thread I guess.
> 
> Vlad, I reviewed the places on which it returns SCTP_DISPOSITION_ABORT,
> and if I didn't miss something in there all of them either issue
> SCTP_CMD_ASSOC_FAILED or SCTP_CMD_INIT_FAILED before returning it, thus
> delaying DELETE_TCB and with that the asoc free.

They delay it from the perspective of the command interpreter since the command
to delete the TCB happens a little later, but status code  is checked after all
commands are processed and command processing doesn't change it.  So the 
'status'
code would still be SCTP_DISPOSITION_ABORT after DELETE_TCB command was 
processed.
So, I think we may still have an use-after-free issue here.

> There is one place,
> though, that may not do it that way, it's sctp_sf_abort_violation(), but
> then that code only runs if asoc is already NULL by then.

I don't believe so.  The violation state function can run with a non-NULL 
association
if we are encountering protocol violations after the association is established.

-vlad

> 
> Dmitry, still no luck here, cannot reproduce another hit.
> I'm using sctp_test and a custom test of mine, both on localhost so I
> would catch it in server or client side, nothing..
> 
> I need more info. Please enable the pr_debug() on debug_post_sfn() macro
> and see which status is being reported when you trigger the issue.
> And/or share a traffic capture so we can see what's going on with the
> association.
> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-07 Thread Vlad Yasevich
On 12/07/2015 02:50 PM, Marcelo Ricardo Leitner wrote:
> On Mon, Dec 07, 2015 at 02:33:52PM -0500, Vlad Yasevich wrote:
>> On 12/07/2015 01:52 PM, Marcelo Ricardo Leitner wrote:
>>> On Mon, Dec 07, 2015 at 02:20:47PM +0100, Dmitry Vyukov wrote:
>>>> On Mon, Dec 7, 2015 at 2:15 PM, Marcelo Ricardo Leitner
>>>> <marcelo.leit...@gmail.com> wrote:
>>>>> On Mon, Dec 07, 2015 at 12:26:09PM +0100, Dmitry Vyukov wrote:
>>>>>> On Sat, Dec 5, 2015 at 5:39 PM, Vlad Yasevich <vyasev...@gmail.com> 
>>>>>> wrote:
>>> ...
>>>>>>> Hi Marcelo
>>>>>>>
>>>>>>> I think you also need to catch the SCTP_DISPOSITION_ABORT and update
>>>>>>> the pointer.  There are some issues there though as some functions 
>>>>>>> report
>>>>>>> that code without actually destroying the association.  This happens 
>>>>>>> when
>>>>>>> the ABORT chunk may be dropped.
>>>>>>>
>>>>>>> I think this might be why we still see the issue.
>>>>>>
>>>>>>
>>>>>> Marcelo,
>>>>>>
>>>>>> Is this info enough for you to cook another fix?
>>>>>
>>>>> Hi, I think so. I was really wondering how you could trigger that issue
>>>>> without the timestamp fix and Vlad's comment does shed some light on it.
>>>>>
>>>>> I'll do more tests later today, but what did you have connecting to the
>>>>> listening socket? Somehow you made that accept() call to return..
>>>>
>>>> Local connect in another thread I guess.
>>>
>>> Vlad, I reviewed the places on which it returns SCTP_DISPOSITION_ABORT,
>>> and if I didn't miss something in there all of them either issue
>>> SCTP_CMD_ASSOC_FAILED or SCTP_CMD_INIT_FAILED before returning it, thus
>>> delaying DELETE_TCB and with that the asoc free.
>>
>> They delay it from the perspective of the command interpreter since the 
>> command
>> to delete the TCB happens a little later, but status code  is checked after 
>> all
>> commands are processed and command processing doesn't change it.  So the 
>> 'status'
>> code would still be SCTP_DISPOSITION_ABORT after DELETE_TCB command was 
>> processed.
>> So, I think we may still have an use-after-free issue here.
> 
> Gotcha! That's pretty much it then. From that point of view now, there
> shouldn't be a case that it returns _ABORT without freeing the asoc in
> the same loop. (more below)
> 
>>> There is one place,
>>> though, that may not do it that way, it's sctp_sf_abort_violation(), but
>>> then that code only runs if asoc is already NULL by then.
>>
>> I don't believe so.  The violation state function can run with a non-NULL 
>> association
>> if we are encountering protocol violations after the association is 
>> established.
> 
> Yup, that's correct. I just tried to reference one case on which it
> would return _ABORT without issuing any of those _FAILEDs before doing
> so (meaning the association could still be valid) but that in that case,
> the asoc was already NULL.

I think it is possible to hit the 'discard:' tag in that function while still
having a valid association.  That happens when ABORT chunk is required to be
authenticated.  This that case, instead of generating an ABORT and terminating 
the
current association, we just drop the packet, but still report an _ABORT 
disposition code.

This probably need to change if we are going to catch the _ABORT disposition and
clear the asoc pointer.

-vlad

> 
> Dmitry, please give this one a run, as I still cannot reproduce your use
> case..
> 
> ---8<---
> 
> commit b63ad8dc45257dd6c536ac0227fcc623efd9328b
> Author: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
> Date:   Fri Dec 4 15:30:23 2015 -0200
> 
> sctp: fix use-after-free in pr_debug statement
> 
> Dmitry Vyukov reported a use-after-free in the code expanded by the
> macro debug_post_sfx, which is caused by the use of the asoc pointer
> after it was freed within sctp_side_effect() scope.
> 
> This patch fixes it by allowing sctp_side_effect to clear that asoc
> pointer when the TCB is freed.
> 
> As Vlad explained, we also have to cover the SCTP_DISPOSITION_ABORT case
> because it will trigger DELETE_TCB too on that same loop.
> 
> The macro is already prepared to handle such NULL pointer.
> 
> Reported-by: Dmi

Re: use-after-free in sctp_do_sm

2015-12-05 Thread Vlad Yasevich
On 12/04/2015 04:34 PM, Marcelo Ricardo Leitner wrote:
> On Fri, Dec 04, 2015 at 09:25:35PM +0100, Dmitry Vyukov wrote:
>> On Fri, Dec 4, 2015 at 6:48 PM, Marcelo Ricardo Leitner
>>  wrote:
>>> Hi Dmitry,
>>>
>>> Can you please test this patch?
>>> I'll re-post with proper subject if it works.
>>
>> Still happening with the same stacks.
> 
> Then there may be another one, I'm afraid.
> 
> I'm using the testapp you shared in the first email, with that debug line
> enabled and added a new one:
> +   pr_debug("%p %d\n", asoc, asoc ? asoc->state : 0);
> debug_post_sfx();
> (should have used %x, but ok)
> 
> Also enabled slub_debug=PUZ, and I get:
> 
> without the patch:
> [   87.873640] sctp: 8800b71533d8 1
> [   87.873647] sctp: sctp_do_sm[post-sfx]: error:0,
> asoc:8800b71533d8[STATE_CLOSED]
> [   87.873739] sctp: 8800b71533d8 1
> [   87.873742] sctp: sctp_do_sm[post-sfx]: error:0,
> asoc:8800b71533d8[STATE_CLOSED]
> [   87.875149] sctp: 8800b71533d8 1802201963
> [   87.875238] sctp: sctp_do_sm[post-sfx]: error:0,
> asoc:8800b71533d8[STATE_CLOSED]
> 
> 1802201963 = 0x6b6b6b6b, poison
> 
> with the patch:
> [   81.071265] sctp: 880137571148 1
> [   81.071273] sctp: sctp_do_sm[post-sfx]: error:0,
> asoc:880137571148[STATE_CLOSED]
> [   81.071372] sctp: 880137571148 1
> [   81.071375] sctp: sctp_do_sm[post-sfx]: error:0,
> asoc:880137571148[STATE_CLOSED]
> [   81.072423] sctp:   (null) 0
> [   81.072427] sctp: sctp_do_sm[post-sfx]: error:0, asoc:
> (null)[STATE_CLOSED]
> 
> This one, at least, is gone with this patch.
> 
>   Marcelo
> 

Hi Marcelo

I think you also need to catch the SCTP_DISPOSITION_ABORT and update
the pointer.  There are some issues there though as some functions report
that code without actually destroying the association.  This happens when
the ABORT chunk may be dropped.

I think this might be why we still see the issue.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-04 Thread Vlad Yasevich
On 12/04/2015 07:55 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Dec 04, 2015 at 11:40:02AM +0100, Dmitry Vyukov wrote:
>> On Thu, Dec 3, 2015 at 9:51 PM, Joe Perches  wrote:
>>> (adding lkml as this is likely better discussed there)
>>>
>>> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote:
 On 12/03/2015 03:24 PM, Joe Perches wrote:
> On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote:
>> On 12/03/2015 03:03 PM, Joe Perches wrote:
>>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote:
 On 12/03/2015 01:52 PM, Aaron Conole wrote:
> I think that as a minimum, the following patch should be evaluted,
> but am unsure to whom I should submit it (after I test):
>>> []
 Agreed - the intention here is certainly to have no side effects. It
 looks like 'no_printk()' is used in quite a few other places that would
 benefit from this change. So we probably want a generic
 'really_no_printk()' macro.
>>>
>>> https://lkml.org/lkml/2012/6/17/231
>>
>> I don't see this in the tree.
>
> It never got applied.
>
>> Also maybe we should just convert
>> no_printk() to do what your 'eliminated_printk()'.
>
> Some of them at least.
>
>> So we can convert all users with this change?
>
> I don't think so, I think there are some
> function evaluation/side effects that are
> required.  I believe some do hardware I/O.
>
> It'd be good to at least isolate them.
>
> I'm not sure how to find them via some
> automated tool/mechanism though.
>
> I asked Julia Lawall about it once in this
> thread:  https://lkml.org/lkml/2014/12/3/696
>

 Seems rather fragile to have side effects that we rely
 upon hidden in a printk().
>>>
>>> Yup.
>>>
 Just convert them and see what breaks :)
>>>
>>> I appreciate your optimism.  It's very 1995.
>>> Try it and see what happens.
>>
>>
>> Whatever is the resolution for pr_debug, we still need to fix this
>> particular use-after-free. It affects stability of debug builds, gives
>> invalid debug output, prevents us from finding more bugs in SCTP. And
>> maybe somebody uses CONFIG_DYNAMIC_DEBUG in production.
> 
> Agreed. I'm already working on a fix for this particular use-after-free.
> 
> Another interesting thing about this is that sctp_do_sm() is called for
> nearly every movement that happens on a sctp socket. Said that, that
> always-running IDR search hidden on that debug statement do have some
> nasty performance impact, specially because it's serialized on a
> spinlock.

YUCK!  I didn't really pay much attention to those debug macros before, but
debug_post_sfx() is truly awful.

This wasn't such a bad thing where these macros depended on CONFIG_SCTP_DEBUG,
but now that they are always built, we need fix them.

-vlad



> This wouldn't be happening if it was fully ellided and would
> be ok if that pr_debug() was really being printed, but not as it is.
> Kudos to this report that I could notice this. I'm trying to fix this on
> SCTP-side as well.
> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 2/3] sctp: update the netstamp_needed counter when copying sockets

2015-12-04 Thread Vlad Yasevich
On 12/04/2015 12:14 PM, Marcelo Ricardo Leitner wrote:
> Dmitry Vyukov reported that SCTP was triggering a WARN on socket destroy
> related to disabling sock timestamp.
> 
> When SCTP accepts an association or peel one off, it copies sock flags
> but forgot to call net_enable_timestamp() if a packet timestamping flag
> was copied, leading to extra calls to net_disable_timestamp() whenever
> such clones were closed.
> 
> The fix is to call net_enable_timestamp() whenever we copy a sock with
> that flag on, like tcp does.
> 
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  include/net/sock.h | 2 ++
>  net/core/sock.c| 2 --
>  net/sctp/socket.c  | 3 +++
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 
> 52d27ee924f47867026d8f65c65551a9137219d3..b1d475b5db6825e13df3e3e147fed8654e1cf086
>  100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -740,6 +740,8 @@ enum sock_flags {
>   SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
>  };
>  
> +#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE))
> +
>  static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
>  {
>   nsk->sk_flags = osk->sk_flags;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 
> e31dfcee1729aa23bdd2ed692fda1b90bd75afb8..d01c8f42dbb2f040fd48009b2767bd4e80aea8ab
>  100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -433,8 +433,6 @@ static bool sock_needs_netstamp(const struct sock *sk)
>   }
>  }
>  
> -#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE))
> -
>  static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
>  {
>   if (sk->sk_flags & flags) {
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 
> 03c8256063ec6355fcce034366aa5d005d75b5f7..4c9282bdd06790a0cca7f7c33986e7eb6c541398
>  100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -7199,6 +7199,9 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
>   newinet->mc_ttl = 1;
>   newinet->mc_index = 0;
>   newinet->mc_list = NULL;
> +
> + if (newsk->sk_flags & SK_FLAGS_TIMESTAMP)
> + net_enable_timestamp();
>  }
>  
>  static inline void sctp_copy_descendant(struct sock *sk_to,
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] sctp: also copy sk_tsflags when copying the socket

2015-12-04 Thread Vlad Yasevich
On 12/04/2015 12:14 PM, Marcelo Ricardo Leitner wrote:
> As we are keeping timestamps on when copying the socket, we also have to
> copy sk_tsflags.
> 
> This is needed since b9f40e21ef42 ("net-timestamp: move timestamp flags
> out of sk_flags").
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  net/sctp/socket.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 
> 4c9282bdd06790a0cca7f7c33986e7eb6c541398..1a32ecdb8bae98de2e76591f0f5ffee1441ff04d
>  100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -7167,6 +7167,7 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
>   newsk->sk_type = sk->sk_type;
>   newsk->sk_bound_dev_if = sk->sk_bound_dev_if;
>   newsk->sk_flags = sk->sk_flags;
> + newsk->sk_tsflags = sk->sk_tsflags;
>   newsk->sk_no_check_tx = sk->sk_no_check_tx;
>   newsk->sk_no_check_rx = sk->sk_no_check_rx;
>   newsk->sk_reuse = sk->sk_reuse;
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 1/3] sctp: use the same clock as if sock source timestamps were on

2015-12-04 Thread Vlad Yasevich
On 12/04/2015 12:14 PM, Marcelo Ricardo Leitner wrote:
> SCTP echoes a cookie o INIT ACK chunks that contains a timestamp, for
> detecting stale cookies. This cookie is echoed back to the server by the
> client and then that timestamp is checked.
> 
> Thing is, if the listening socket is using packet timestamping, the
> cookie is encoded with ktime_get() value and checked against
> ktime_get_real(), as done by __net_timestamp().
> 
> The fix is to sctp also use ktime_get_real(), so we can compare bananas
> with bananas later no matter if packet timestamping was enabled or not.
> 
> Fixes: 52db882f3fc2 ("net: sctp: migrate cookie life from timeval to ktime")
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
>  net/sctp/sm_make_chunk.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 
> 763e06a55155b2a9e0a9d918ecc1fe2dd6d9e0c0..5d6a03fad3789a12290f5f14c5a7efa69c98f41a
>  100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -1652,7 +1652,7 @@ static sctp_cookie_param_t *sctp_pack_cookie(const 
> struct sctp_endpoint *ep,
>  
>   /* Set an expiration time for the cookie.  */
>   cookie->c.expiration = ktime_add(asoc->cookie_life,
> -  ktime_get());
> +  ktime_get_real());
>  
>   /* Copy the peer's init packet.  */
>   memcpy(>c.peer_init[0], init_chunk->chunk_hdr,
> @@ -1780,7 +1780,7 @@ no_hmac:
>   if (sock_flag(ep->base.sk, SOCK_TIMESTAMP))
>   kt = skb_get_ktime(skb);
>   else
> - kt = ktime_get();
> + kt = ktime_get_real();
>  
>   if (!asoc && ktime_before(bear_cookie->expiration, kt)) {
>   /*
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-03 Thread Vlad Yasevich
On 12/03/2015 01:06 PM, Marcelo wrote:
> 
> 
> Em 3 de dezembro de 2015 15:59:10 BRST, Eric Dumazet  
> escreveu:
>> On Thu, 2015-12-03 at 15:43 -0200, Marcelo Ricardo Leitner wrote:
>>
>>> Vlad, others,
>>>
>>> It's been a long time but this was introduced by commit 914e1c8b6980
>>> ("sctp: Inherit all socket options from parent correctly."). This is
>> not
>>> very consistent with how other protocols work and it will be hard to
>>> keep tracking a negative mask of flags that we can't copy.
>>>
>>> I reviewed the list of options and I'm thinking that only
>>> SO_BINDTODEVICE is worth copying, leaving the others for the
>> application
>>> to re-set, as it is for other protocols. So I'm thinking on simply:
>>>
>>> -   newsk->sk_flags = sk->sk_flags;
>>> +   newsk->sk_flags = sk->sk_flags & SO_BINDTODEVICE;
>>>
>>> in the above.
>>>
>>> What do you think?
>>
>> I think SO_BINDTODEVICE is not a flag ;)
>>
>> #define SO_BINDTODEVICE25
> 
> Oops, indeed!
> Idea persists.
> Thx!
> 

Hmm...  sk_clone_lock() appears to copy the flags as well, so it would
appear the tcp accept() sockets would also have timestamping set.

I can see how we probably shouldn't being copying sk_flags as there isn't
much there that need to be set.

-vlad


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sctp: convert sack_needed and sack_generation to bits

2015-11-30 Thread Vlad Yasevich
On 11/30/2015 09:17 AM, Marcelo Ricardo Leitner wrote:
> They don't need to be any bigger than that and with this we start a new
> bitfield for tracking association runtime stuff, like zero window
> situation.
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> ---
> 
> Motivated by https://patchwork.ozlabs.org/patch/509836/
> 
>  include/net/sctp/structs.h | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 
> 495c87e367b3f2e8941807f56a77d2e14469bfed..7bbb71081aeb6cfdc9cb049cf7a094dbcd4603bb
>  100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -775,10 +775,10 @@ struct sctp_transport {
>   hb_sent:1,
>  
>   /* Is the Path MTU update pending on this tranport */
> - pmtu_pending:1;
> + pmtu_pending:1,
>  
> - /* Has this transport moved the ctsn since we last sacked */
> - __u32 sack_generation;
> + /* Has this transport moved the ctsn since we last sacked */
> + sack_generation:1;
>   u32 dst_cookie;
>  
>   struct flowi fl;
> @@ -1482,19 +1482,19 @@ struct sctp_association {
>   prsctp_capable:1,   /* Can peer do PR-SCTP? */
>   auth_capable:1; /* Is peer doing SCTP-AUTH? */
>  
> - /* Ack State   : This flag indicates if the next received
> + /* sack_needed : This flag indicates if the next received
>* : packet is to be responded to with a
> -  * : SACK. This is initializedto 0.  When a packet
> -  * : is received it is incremented. If this value
> +  * : SACK. This is initialized to 0.  When a packet
> +  * : is received sack_cnt is incremented. If this 
> value
>* : reaches 2 or more, a SACK is sent and the
>* : value is reset to 0. Note: This is used only
>* : when no DATA chunks are received out of
>* : order.  When DATA chunks are out of order,
>* : SACK's are not delayed (see Section 6).
>*/
> - __u8sack_needed; /* Do we need to sack the peer? */
> + __u8sack_needed:1, /* Do we need to sack the peer? */
> + sack_generation:1;
>   __u32   sack_cnt;
> - __u32   sack_generation;
>  
>   __u32   adaptation_ind;  /* Adaptation Code point. */
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-11-25 Thread Vlad Yasevich
On 11/24/2015 03:45 PM, Neil Horman wrote:
> On Tue, Nov 24, 2015 at 11:10:32AM +0100, Dmitry Vyukov wrote:
>> On Tue, Nov 24, 2015 at 10:31 AM, Dmitry Vyukov  wrote:
>>> On Tue, Nov 24, 2015 at 10:15 AM, Dmitry Vyukov  wrote:
 Hello,

 The following program triggers use-after-free in sctp_do_sm:

 // autogenerated by syzkaller (http://github.com/google/syzkaller)
 #include 
 #include 
 #include 

 int main()
 {
 long r0 = syscall(SYS_socket, 0xaul, 0x80805ul, 0x0ul, 0, 0, 0);
 long r1 = syscall(SYS_mmap, 0x2000ul, 0x1ul, 0x3ul,
 0x32ul, 0xul, 0x0ul);
 memcpy((void*)0x20002fe4,
 "\x0a\x00\x33\xe7\xeb\x9d\xcf\x61\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xc5\xc8\x88\x64",
 28);
 long r3 = syscall(SYS_bind, r0, 0x20002fe4ul, 0x1cul, 0, 0, 0);
 memcpy((void*)0x2faa,
 "\x9b\x01\x7d\xcd\xb8\x6a\xc7\x3d\x09\x3a\x07\x00\xa7\xc4\xe9\xee\x0a\xd6\xec\xde\x26\x75\x5f\x22\xae\x4e\x33\x00\xb0\x76\x10\x70\xd6\xca\x19\xbc\x15\x83\xcf\x2e\xbc\x99\x0c\x5e\x83\x89\xc1\x44\x9c\x6e\x74\xd8\x5d\x5d\xd0\xf0\xdf\x47\xc0\x00\x71\x0b\x55\x4c\xab\xf0\xd8\x90\xd5\x92\x8c\x6e\x33\x22\x15\x5b\x19\xfb\xed\xdd\xa6\xac\xcb\x60\xcf\xe2\xde\xed\xdb\x95\x5c\xaa\x20\xa3",
 94);
 memcpy((void*)0x233a,
 "\x02\x00\x33\xe2\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
 128);
 long r6 = syscall(SYS_sendto, r0, 0x2faaul, 0x5eul,
 0x81ul, 0x233aul, 0x80ul);
 return 0;
 }


 ==
 BUG: KASAN: use-after-free in sctp_do_sm+0x42f6/0x4f60 at addr 
 880036fa80a8
 Read of size 4 by task a.out/5664
 =
 BUG kmalloc-4096 (Tainted: GB  ): kasan: bad access detected
 -

 INFO: Allocated in sctp_association_new+0x6f/0x1ea0 age=8 cpu=1 pid=5664
 [<  none  >] kmem_cache_alloc_trace+0x1cf/0x220 ./mm/slab.c:3707
 [<  none  >] sctp_association_new+0x6f/0x1ea0
 [<  none  >] sctp_sendmsg+0x1954/0x28e0
 [<  none  >] inet_sendmsg+0x316/0x4f0 ./net/ipv4/af_inet.c:802
 [< inline >] __sock_sendmsg_nosec ./net/socket.c:641
 [< inline >] __sock_sendmsg ./net/socket.c:651
 [<  none  >] sock_sendmsg+0xca/0x110 ./net/socket.c:662
 [<  none  >] SYSC_sendto+0x208/0x350 ./net/socket.c:1841
 [<  none  >] SyS_sendto+0x40/0x50 ./net/socket.c:1862
 [<  none  >] entry_SYSCALL_64_fastpath+0x16/0x7a

 INFO: Freed in sctp_association_put+0x150/0x250 age=14 cpu=1 pid=5664
 [<  none  >] kfree+0x199/0x1b0 ./mm/slab.c:1211
 [<  none  >] sctp_association_put+0x150/0x250
 [<  none  >] sctp_association_free+0x498/0x630
 [<  none  >] sctp_do_sm+0xd8b/0x4f60
 [<  none  >] sctp_primitive_SHUTDOWN+0xa9/0xd0
 [<  none  >] sctp_close+0x616/0x790
 [<  none  >] inet_release+0xed/0x1c0 ./net/ipv4/af_inet.c:471
 [<  none  >] inet6_release+0x50/0x70 ./net/ipv6/af_inet6.c:416
 [< inline >] constant_test_bit 
 ././arch/x86/include/asm/bitops.h:321
 [<  none  >] sock_release+0x8d/0x200 ./net/socket.c:601
 [<  none  >] sock_close+0x16/0x20 ./net/socket.c:1188
 [<  none  >] __fput+0x21d/0x6e0 ./fs/file_table.c:265
 [<  none  >] fput+0x15/0x20 ./fs/file_table.c:84
 [<  none  >] task_work_run+0x163/0x1f0 
 ./include/trace/events/rcu.h:20
 [< inline >] __list_add ./include/linux/list.h:42
 [< inline >] list_add_tail ./include/linux/list.h:76
 [< inline >] list_move_tail ./include/linux/list.h:168
 [< inline >] reparent_leader ./kernel/exit.c:618
 [< inline >] forget_original_parent ./kernel/exit.c:669
 [< inline >] exit_notify ./kernel/exit.c:697
 [<  none  >] do_exit+0x809/0x2b90 ./kernel/exit.c:878
 [<  none  >] do_group_exit+0x108/0x320 ./kernel/exit.c:985

 INFO: Slab 0xeadbea00 objects=7 used=1 fp=0x880036fa8000
 flags=0x1004080
 INFO: Object 0x880036fa8000 @offset=0 fp=0x880036fad668

Re: [PATCH net] sctp: translate host order to network order when setting a hmacid

2015-11-12 Thread Vlad Yasevich
On 11/12/2015 12:07 AM, Xin Long wrote:
> now sctp auth cannot work well when setting a hmacid manually, which
> is caused by that we didn't use the network order for hmacid, so fix
> it by adding the transformation in sctp_auth_ep_set_hmacs.
> 
> even we set hmacid with the network order in userspace, it still
> can't work, because of this condition in sctp_auth_ep_set_hmacs():
> 
>   if (id > SCTP_AUTH_HMAC_ID_MAX)
>   return -EOPNOTSUPP;
> 
> so this wasn't working before and thus it won't break compatibility.
> 
> Signed-off-by: Xin Long <lucien@gmail.com>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
> ---
>  net/sctp/auth.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/auth.c b/net/sctp/auth.c
> index 4f15b7d..1543e39 100644
> --- a/net/sctp/auth.c
> +++ b/net/sctp/auth.c
> @@ -809,8 +809,8 @@ int sctp_auth_ep_set_hmacs(struct sctp_endpoint *ep,
>   if (!has_sha1)
>   return -EINVAL;
>  
> - memcpy(ep->auth_hmacs_list->hmac_ids, >shmac_idents[0],
> - hmacs->shmac_num_idents * sizeof(__u16));
> + for (i = 0; i < hmacs->shmac_num_idents; i++)
> + ep->auth_hmacs_list->hmac_ids[i] = 
> htons(hmacs->shmac_idents[i]);
>   ep->auth_hmacs_list->param_hdr.length = htons(sizeof(sctp_paramhdr_t) +
>   hmacs->shmac_num_idents * sizeof(__u16));
>   return 0;
> 

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] ipv6: don't use CHECKSUM_PARTIAL on MSG_MORE/UDP_CORK sockets

2015-10-20 Thread Vlad Yasevich
On 10/20/2015 10:38 AM, Hannes Frederic Sowa wrote:
> MSG_MORE might cause the packet to get fragmented in the end when
> passed down to the flush function and the transhdrlen check alone is
> not sufficient to protect against fragmentation. Instead check if the
> socket user intends to add more data to the socket on the first packet.
> 
> This broke checksum calculation for UDPv6 for NFS protocols.
> 
> Fixes: 32dce968dd987 ("ipv6: Allow for partial checksums on non-ufo packets")
> Cc: Vlad Yasevich <vyasev...@gmail.com>
> Tested-by: Sabrina Dubroca <s...@quesysnail.net>
> Tested-by: Benjamin Coddington <bcodd...@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <han...@stressinduktion.org>
> ---
>  net/ipv6/ip6_output.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 61d403e..95c5780 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1317,6 +1317,7 @@ emsgsize:
>* sums only work when transhdrlen is set.
>*/
>   if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
> + !(flags & MSG_MORE) &&
>   length + fragheaderlen < mtu &&
>   rt->dst.dev->features & NETIF_F_V6_CSUM &&
>   !exthdrlen)
> 

Hmm... so while this solves this problem by simply avoiding the combination of
skb #1 having CHECKSUM_PARTIAL and others having CHECKSUM_NONE, I think the 
actual
problem is a bit deeper.
The above combination seems to work for me since udp6_hwcsum_outgoing() corrects
the checksum.  However, my testing so far has been on nics that have 
NETIF_F_V6_CSUM,
but without UFO support.

On such systems a simple test of using MSG_MORE an IPv6 udp socket sending 200 
bytes
followed by 2000 bytes works correctly.

I am now wondering if this might be UFO related instead and looking for a nic 
that
has UFO support.

-vlad


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] ipv6: don't use CHECKSUM_PARTIAL on MSG_MORE/UDP_CORK sockets

2015-10-20 Thread Vlad Yasevich
On 10/20/2015 10:38 AM, Hannes Frederic Sowa wrote:
> MSG_MORE might cause the packet to get fragmented in the end when
> passed down to the flush function and the transhdrlen check alone is
> not sufficient to protect against fragmentation. Instead check if the
> socket user intends to add more data to the socket on the first packet.
> 
> This broke checksum calculation for UDPv6 for NFS protocols.
> 
> Fixes: 32dce968dd987 ("ipv6: Allow for partial checksums on non-ufo packets")
> Cc: Vlad Yasevich <vyasev...@gmail.com>

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad

> Tested-by: Sabrina Dubroca <s...@quesysnail.net>
> Tested-by: Benjamin Coddington <bcodd...@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <han...@stressinduktion.org>
> ---
>  net/ipv6/ip6_output.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 61d403e..95c5780 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1317,6 +1317,7 @@ emsgsize:
>* sums only work when transhdrlen is set.
>*/
>   if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
> + !(flags & MSG_MORE) &&
>   length + fragheaderlen < mtu &&
>   rt->dst.dev->features & NETIF_F_V6_CSUM &&
>   !exthdrlen)
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-15 Thread Vlad Yasevich
On 09/15/2015 02:17 PM, Phil Sutter wrote:
> On Tue, Sep 15, 2015 at 11:11:53AM -0400, Vlad Yasevich wrote:
>> On 09/14/2015 04:06 PM, Phil Sutter wrote:
>>> On Mon, Sep 14, 2015 at 02:21:10PM -0400, Vlad Yasevich wrote:
>>>> On 09/11/2015 04:20 PM, Phil Sutter wrote:
>>>>> On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote:
>>>>>> On Fri, 11 Sep 2015 21:22:03 +0200
>>>>>> Phil Sutter <p...@nwl.cc> wrote:
>>>>>>
>>>>>>> When forwarding packets from an 802.1Q interface with REORDER_HDR set to
>>>>>>> zero, the VLAN header previously inserted by vlan_do_receive() needs to
>>>>>>> be stripped from the packet and the mac_header adjustment undone,
>>>>>>> otherwise a tagged frame with first four bytes missing will be
>>>>>>> transmitted.
>>>>>>>
>>>>>>> Signed-off-by: Phil Sutter <p...@nwl.cc>
>>>>>>> ---
>>>>>>>  net/bridge/br_input.c | 10 ++
>>>>>>>  1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>>>>>>> index f921a5d..e4e3fc7 100644
>>>>>>> --- a/net/bridge/br_input.c
>>>>>>> +++ b/net/bridge/br_input.c
>>>>>>> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
>>>>>>> **pskb)
>>>>>>> }
>>>>>>>  
>>>>>>>  forward:
>>>>>>> +   if (is_vlan_dev(skb->dev) &&
>>>>>>> +   !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
>>>>>>> +   unsigned int offset = skb->data - skb_mac_header(skb);
>>>>>>> +
>>>>>>> +   skb_push(skb, offset);
>>>>>>> +   memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
>>>>>>> +   skb->mac_header += VLAN_HLEN;
>>>>>>> +   skb_pull(skb, offset);
>>>>>>> +   skb_reset_mac_len(skb);
>>>>>>> +   }
>>>>>>> switch (p->state) {
>>>>>>> case BR_STATE_FORWARDING:
>>>>>>> rhook = rcu_dereference(br_should_route_hook);
>>>>>>
>>>>>> Thanks for finding this. Is this a new thing or has it always been there?
>>>>>
>>>>> Sorry, I didn't check if this is a regression or not. Seen initially
>>>>> with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting
>>>>> is by far not as old as it might seem. But it's surely not a brand new
>>>>> problem of net-next or so.
>>>>>
>>>>> Since nowadays no sane mind touches REORDER_HDR (there was originally a
>>>>> bug in NetworkManager which defaulted this to 0), it may very well be
>>>>> there for a long time already.
>>>>>
>>>>>> Sorry, this looks so special case it doesn't seem like a good idea.
>>>>>> Something is broken in VLAN handling if this is required.
>>>>>
>>>>> It is so ugly, I wish I had found a better way to fix the problem. Well,
>>>>> maybe I miss something:
>>>>>
>>>>> - packet enters __netif_receive_skb_core():
>>>>>   - skb->protocol is set to ETH_P_8021Q, so:
>>>>> - packet is untagged
>>>>> - skb->vlan_tci set
>>>>> - skb->protocol set to 'real' protocol
>>>>>   - skb_vlan_tag_present(skb) == true, so:
>>>>> - vlan_do_receive() is called:
>>>>>   - tags the packet again
>>>>>   - zeroes vlan_tci
>>>>> - goto another_round
>>>>> - __netif_receive_skb_core(), round 2:
>>>>>   - skb->protocol is not ETH_P_8021Q -> no untagging
>>>>>   - skb_vlan_tag_present(skb) == false -> no vlan_do_receive()
>>>>>   - rx_handler handler (== br_handle_frame) is called
>>>>>
>>>>> IMO the root of all evil is the existence of REORDER_HDR itself. It
>>>>> causes an skb which should have been untagged to being passed along with
>>>>> VLAN header present and code dealing with it needs to cl

Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-15 Thread Vlad Yasevich
On 09/14/2015 04:06 PM, Phil Sutter wrote:
> On Mon, Sep 14, 2015 at 02:21:10PM -0400, Vlad Yasevich wrote:
>> On 09/11/2015 04:20 PM, Phil Sutter wrote:
>>> On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote:
>>>> On Fri, 11 Sep 2015 21:22:03 +0200
>>>> Phil Sutter <p...@nwl.cc> wrote:
>>>>
>>>>> When forwarding packets from an 802.1Q interface with REORDER_HDR set to
>>>>> zero, the VLAN header previously inserted by vlan_do_receive() needs to
>>>>> be stripped from the packet and the mac_header adjustment undone,
>>>>> otherwise a tagged frame with first four bytes missing will be
>>>>> transmitted.
>>>>>
>>>>> Signed-off-by: Phil Sutter <p...@nwl.cc>
>>>>> ---
>>>>>  net/bridge/br_input.c | 10 ++
>>>>>  1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>>>>> index f921a5d..e4e3fc7 100644
>>>>> --- a/net/bridge/br_input.c
>>>>> +++ b/net/bridge/br_input.c
>>>>> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
>>>>> **pskb)
>>>>>   }
>>>>>  
>>>>>  forward:
>>>>> + if (is_vlan_dev(skb->dev) &&
>>>>> + !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
>>>>> + unsigned int offset = skb->data - skb_mac_header(skb);
>>>>> +
>>>>> + skb_push(skb, offset);
>>>>> + memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
>>>>> + skb->mac_header += VLAN_HLEN;
>>>>> + skb_pull(skb, offset);
>>>>> + skb_reset_mac_len(skb);
>>>>> + }
>>>>>   switch (p->state) {
>>>>>   case BR_STATE_FORWARDING:
>>>>>   rhook = rcu_dereference(br_should_route_hook);
>>>>
>>>> Thanks for finding this. Is this a new thing or has it always been there?
>>>
>>> Sorry, I didn't check if this is a regression or not. Seen initially
>>> with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting
>>> is by far not as old as it might seem. But it's surely not a brand new
>>> problem of net-next or so.
>>>
>>> Since nowadays no sane mind touches REORDER_HDR (there was originally a
>>> bug in NetworkManager which defaulted this to 0), it may very well be
>>> there for a long time already.
>>>
>>>> Sorry, this looks so special case it doesn't seem like a good idea.
>>>> Something is broken in VLAN handling if this is required.
>>>
>>> It is so ugly, I wish I had found a better way to fix the problem. Well,
>>> maybe I miss something:
>>>
>>> - packet enters __netif_receive_skb_core():
>>>   - skb->protocol is set to ETH_P_8021Q, so:
>>> - packet is untagged
>>> - skb->vlan_tci set
>>> - skb->protocol set to 'real' protocol
>>>   - skb_vlan_tag_present(skb) == true, so:
>>> - vlan_do_receive() is called:
>>>   - tags the packet again
>>>   - zeroes vlan_tci
>>> - goto another_round
>>> - __netif_receive_skb_core(), round 2:
>>>   - skb->protocol is not ETH_P_8021Q -> no untagging
>>>   - skb_vlan_tag_present(skb) == false -> no vlan_do_receive()
>>>   - rx_handler handler (== br_handle_frame) is called
>>>
>>> IMO the root of all evil is the existence of REORDER_HDR itself. It
>>> causes an skb which should have been untagged to being passed along with
>>> VLAN header present and code dealing with it needs to clean up the mess.
>>
>> So the problem here appears the be the code the in br_dev_queue_push_xmit().
>> It assumes that MAC_HLEN worth of data has been removed from the skb,
>> which is normal in case of normal VLAN processing.  However, without
>> REORDER_HEADER set this is no longer the case.  In this case, the ethernet
>> header is shifted 4 bytes, and when we push the it back we miss the 4 bytes
>> of the destination mac address...
> 
> Please note that vlan_do_receive() also inserts the VLAN header in
> between ethernet header and IP header, therefore:
> 
>> I wonder if it would be safe to just use skb->mac_len.
> 
> Given this works, the bridge would still forward a tagged frame which
> should have been untagged in the first place.
> 
> I just wondered where this added VLAN header is dropped if the interface
> does not belong to a bridge, but then realized that further packet
> processing simply ignores the ethernet header (and everything following
> it). So unless I forget something, this should indeed be a
> bridge-specific problem.
> 

Looks like macvtap is also susceptible to this problem.  It seems to be a bad
idea to allow any upper device configuration on top of a REORDER_HDR=0 vlan.
It is also not enough to just check is_vlan_dev(skb->dev) because vlan may be at
lower in the device stack.

-vlad




> Cheers, Phil
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-14 Thread Vlad Yasevich
On 09/11/2015 04:20 PM, Phil Sutter wrote:
> On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote:
>> On Fri, 11 Sep 2015 21:22:03 +0200
>> Phil Sutter  wrote:
>>
>>> When forwarding packets from an 802.1Q interface with REORDER_HDR set to
>>> zero, the VLAN header previously inserted by vlan_do_receive() needs to
>>> be stripped from the packet and the mac_header adjustment undone,
>>> otherwise a tagged frame with first four bytes missing will be
>>> transmitted.
>>>
>>> Signed-off-by: Phil Sutter 
>>> ---
>>>  net/bridge/br_input.c | 10 ++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>>> index f921a5d..e4e3fc7 100644
>>> --- a/net/bridge/br_input.c
>>> +++ b/net/bridge/br_input.c
>>> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
>>> **pskb)
>>> }
>>>  
>>>  forward:
>>> +   if (is_vlan_dev(skb->dev) &&
>>> +   !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
>>> +   unsigned int offset = skb->data - skb_mac_header(skb);
>>> +
>>> +   skb_push(skb, offset);
>>> +   memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
>>> +   skb->mac_header += VLAN_HLEN;
>>> +   skb_pull(skb, offset);
>>> +   skb_reset_mac_len(skb);
>>> +   }
>>> switch (p->state) {
>>> case BR_STATE_FORWARDING:
>>> rhook = rcu_dereference(br_should_route_hook);
>>
>> Thanks for finding this. Is this a new thing or has it always been there?
> 
> Sorry, I didn't check if this is a regression or not. Seen initially
> with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting
> is by far not as old as it might seem. But it's surely not a brand new
> problem of net-next or so.
> 
> Since nowadays no sane mind touches REORDER_HDR (there was originally a
> bug in NetworkManager which defaulted this to 0), it may very well be
> there for a long time already.
> 
>> Sorry, this looks so special case it doesn't seem like a good idea.
>> Something is broken in VLAN handling if this is required.
> 
> It is so ugly, I wish I had found a better way to fix the problem. Well,
> maybe I miss something:
> 
> - packet enters __netif_receive_skb_core():
>   - skb->protocol is set to ETH_P_8021Q, so:
> - packet is untagged
> - skb->vlan_tci set
> - skb->protocol set to 'real' protocol
>   - skb_vlan_tag_present(skb) == true, so:
> - vlan_do_receive() is called:
>   - tags the packet again
>   - zeroes vlan_tci
> - goto another_round
> - __netif_receive_skb_core(), round 2:
>   - skb->protocol is not ETH_P_8021Q -> no untagging
>   - skb_vlan_tag_present(skb) == false -> no vlan_do_receive()
>   - rx_handler handler (== br_handle_frame) is called
> 
> IMO the root of all evil is the existence of REORDER_HDR itself. It
> causes an skb which should have been untagged to being passed along with
> VLAN header present and code dealing with it needs to clean up the mess.

So the problem here appears the be the code the in br_dev_queue_push_xmit().
It assumes that MAC_HLEN worth of data has been removed from the skb,
which is normal in case of normal VLAN processing.  However, without
REORDER_HEADER set this is no longer the case.  In this case, the ethernet
header is shifted 4 bytes, and when we push the it back we miss the 4 bytes
of the destination mac address...

I wonder if it would be safe to just use skb->mac_len.

Of course, looks like vlan filtering also makes this assumption and
could be really broken.  And God forbid, someone creates a bunch of
nested encapsulated vlans (Q-in-Q-in...) with REORDER_HEADER == 0.
We could end up completely leaving the ethernet header out.

Looks like it's been there for a very long while.

-vlad

> 
> Cheers, Phil
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any way to configure a vlan interface to grab ONLY untagged frames?

2015-09-14 Thread Vlad Yasevich
On 09/13/2015 12:49 PM, Nathan Neulinger wrote:
> It seems like running 'vconfig add IFACE 0' and using IFACE.0 would do this, 
> but it
> doesn't actually seem to work that way.
> 
> If I capture on IFACE directly, I'd expect to get all traffic, including the 
> tagged frames
> (with the tag intact). Looking to be able to bridge/capture/etc. and 
> specifically only
> receive the untagged frames that haven't already been pulled out into a vlan 
> specific
> interface.
> 
> Is there any way to accomplish this without using ebtables or other similar 
> hacks?

If you are dealing with a hw interface, any interface that supports vlan
filtering will by default receive only untagged frames.  Only when you put
into promiscuous mode, will you receive all frames.

With bridge, you could configure your vlans adjacent to you bridge:

   vlan0...N   bridge
 |  |
 +-- eth0 --+

This way, configured vlan traffic will go to vlan devices, while all other
traffic will got bridge.  You can even limit this "all other traffic"
further, by turning on vlan filtering on the bridge which will allow
you to run eth0 in non-promiscuous mode thus enforcing HW vlan filters.

-vlad

> 
> -- Nathan
> 
> 
> Nathan Neulinger   nn...@neulinger.org
> Neulinger Consulting   (573) 612-1412
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: fix race on protocol/netns initialization

2015-09-10 Thread Vlad Yasevich
On 09/09/2015 05:06 PM, Marcelo Ricardo Leitner wrote:
> Em 09-09-2015 17:30, Vlad Yasevich escreveu:
>> On 09/09/2015 04:03 PM, Marcelo Ricardo Leitner wrote:
>>> Consider sctp module is unloaded and is being requested because an user
>>> is creating a sctp socket.
>>>
>>> During initialization, sctp will add the new protocol type and then
>>> initialize pernet subsys:
>>>
>>>  status = sctp_v4_protosw_init();
>>>  if (status)
>>>  goto err_protosw_init;
>>>
>>>  status = sctp_v6_protosw_init();
>>>  if (status)
>>>  goto err_v6_protosw_init;
>>>
>>>  status = register_pernet_subsys(_net_ops);
>>>
>>> The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
>>> is possible for userspace to create SCTP sockets like if the module is
>>> already fully loaded. If that happens, one of the possible effects is
>>> that we will have readers for net->sctp.local_addr_list list earlier
>>> than expected and sctp_net_init() does not take precautions while
>>> dealing with that list, leading to a potential panic but not limited to
>>> that, as sctp_sock_init() will copy a bunch of blank/partially
>>> initialized values from net->sctp.
>>>
>>> The race happens like this:
>>>
>>>   CPU 0   |  CPU 1
>>>socket()   |
>>> __sock_create | socket()
>>>  inet_create  |  __sock_create
>>>   list_for_each_entry_rcu(|
>>>  answer, [sock->type], |
>>>  list) {  |   inet_create
>>>/* no hits */  |
>>>   if (unlikely(err)) {|
>>>...|
>>>request_module()   |
>>>/* socket creation is blocked  |
>>> * the module is fully loaded  |
>>> */|
>>> sctp_init |
>>>  sctp_v4_protosw_init |
>>>   inet_register_protosw   |
>>>list_add_rcu(>list, |
>>> last_perm);   |
>>>   |  list_for_each_entry_rcu(
>>>   | answer, [sock->type],
>>>  sctp_v6_protosw_init | list) {
>>>   | /* hit, so assumes protocol
>>>   |  * is already loaded
>>>   |  */
>>>   |  /* socket creation continues
>>>   |   * before netns is initialized
>>>   |   */
>>>  register_pernet_subsys   |
>>>
>>> Inverting the initialization order between register_pernet_subsys() and
>>> sctp_v4_protosw_init() is not possible because register_pernet_subsys()
>>> will create a control sctp socket, so the protocol must be already
>>> visible by then. Deferring the socket creation to a work-queue is not
>>> good specially because we loose the ability to handle its errors.
>>>
>>> So the fix then is to invert the initialization order inside
>>> register_pernet_subsys() so that the control socket is created by last
>>> and also block socket creation if netns initialization wasn't yet
>>> performed.
>>>
>>
>> not sure how much I like that...  Wouldn't it be better
>> to pull the control socket initialization stuff out into its
>> own function that does something like
>>
>> for_each_net_rcu()
>> init_control_socket(net, ...)
>>
>>
>> Or may be even pull the control socket creation
>> stuff completely into its own per-net ops operations structure
>> and initialize it after the the protosw stuff has been done.
>>
>> -vlad
> 
> I'm afraid error handling won't be easy then.
> 
> But still, the control socket is not really the problem, because we don't 
> care (much?) if
> it contains zeroed values and the panic happens only if you call connect() on 
> it. I moved
> it solely because of the protection on sctp_init_sock().
> 
> The real problem is new sockets created by an user application while module 
> is still
> loading, because even if th

Re: [PATCH net] sctp: fix race on protocol/netns initialization

2015-09-10 Thread Vlad Yasevich
On 09/10/2015 10:22 AM, Marcelo Ricardo Leitner wrote:
> Em 10-09-2015 10:24, Vlad Yasevich escreveu:
>> On 09/09/2015 05:06 PM, Marcelo Ricardo Leitner wrote:
>>> Em 09-09-2015 17:30, Vlad Yasevich escreveu:
>>>> On 09/09/2015 04:03 PM, Marcelo Ricardo Leitner wrote:
>>>>> Consider sctp module is unloaded and is being requested because an user
>>>>> is creating a sctp socket.
>>>>>
>>>>> During initialization, sctp will add the new protocol type and then
>>>>> initialize pernet subsys:
>>>>>
>>>>>   status = sctp_v4_protosw_init();
>>>>>   if (status)
>>>>>   goto err_protosw_init;
>>>>>
>>>>>   status = sctp_v6_protosw_init();
>>>>>   if (status)
>>>>>   goto err_v6_protosw_init;
>>>>>
>>>>>   status = register_pernet_subsys(_net_ops);
>>>>>
>>>>> The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
>>>>> is possible for userspace to create SCTP sockets like if the module is
>>>>> already fully loaded. If that happens, one of the possible effects is
>>>>> that we will have readers for net->sctp.local_addr_list list earlier
>>>>> than expected and sctp_net_init() does not take precautions while
>>>>> dealing with that list, leading to a potential panic but not limited to
>>>>> that, as sctp_sock_init() will copy a bunch of blank/partially
>>>>> initialized values from net->sctp.
>>>>>
>>>>> The race happens like this:
>>>>>
>>>>>CPU 0   |  CPU 1
>>>>> socket()   |
>>>>>  __sock_create | socket()
>>>>>   inet_create  |  __sock_create
>>>>>list_for_each_entry_rcu(|
>>>>>   answer, [sock->type], |
>>>>>   list) {  |   inet_create
>>>>> /* no hits */  |
>>>>>if (unlikely(err)) {|
>>>>> ...|
>>>>> request_module()   |
>>>>> /* socket creation is blocked  |
>>>>>  * the module is fully loaded  |
>>>>>  */|
>>>>>  sctp_init |
>>>>>   sctp_v4_protosw_init |
>>>>>inet_register_protosw   |
>>>>> list_add_rcu(>list, |
>>>>>  last_perm);   |
>>>>>|  list_for_each_entry_rcu(
>>>>>| answer, [sock->type],
>>>>>   sctp_v6_protosw_init | list) {
>>>>>| /* hit, so assumes protocol
>>>>>|  * is already loaded
>>>>>|  */
>>>>>|  /* socket creation continues
>>>>>|   * before netns is initialized
>>>>>|   */
>>>>>   register_pernet_subsys   |
>>>>>
>>>>> Inverting the initialization order between register_pernet_subsys() and
>>>>> sctp_v4_protosw_init() is not possible because register_pernet_subsys()
>>>>> will create a control sctp socket, so the protocol must be already
>>>>> visible by then. Deferring the socket creation to a work-queue is not
>>>>> good specially because we loose the ability to handle its errors.
>>>>>
>>>>> So the fix then is to invert the initialization order inside
>>>>> register_pernet_subsys() so that the control socket is created by last
>>>>> and also block socket creation if netns initialization wasn't yet
>>>>> performed.
>>>>>
>>>>
>>>> not sure how much I like that...  Wouldn't it be better
>>>> to pull the control socket initialization stuff out into its
>>>> own function that does something like
>>>>
>>>> for_each_net_rcu

Re: [PATCH net] sctp: fix race on protocol/netns initialization

2015-09-10 Thread Vlad Yasevich
On 09/10/2015 02:35 PM, Marcelo Ricardo Leitner wrote:
> On Thu, Sep 10, 2015 at 01:24:54PM -0300, Marcelo Ricardo Leitner wrote:
>> On Thu, Sep 10, 2015 at 11:50:06AM -0400, Vlad Yasevich wrote:
>>> On 09/10/2015 10:22 AM, Marcelo Ricardo Leitner wrote:
>>>> Em 10-09-2015 10:24, Vlad Yasevich escreveu:
>> ...
>>>>> Then you can order sctp_net_init() such that it happens first, then 
>>>>> protosw registration
>>>>> happens, then control socket initialization happens, then inet protocol 
>>>>> registration
>>>>> happens.
>>>>>
>>>>> This way, we are always guaranteed that by the time user calls socket(), 
>>>>> protocol
>>>>> defaults are fully initialized.
>>>>
>>>> Okay, that works for module loading stage, but then how would we handle 
>>>> new netns's? We
>>>> have to create the control socket per netns and AFAICT sctp_net_init() is 
>>>> the only hook
>>>> called when a new netns is being created.
>>>>
>>>> Then if we move it a workqueue that is scheduled by sctp_net_init(), we 
>>>> loose the ability
>>>> to handle its errors by propagating through sctp_net_init() return value, 
>>>> not good.
>>>
>>> Here is kind of what I had in mind.  It's incomplete and completely 
>>> untested (not even
>>> compiled), but good enough to describe the idea:
>> ...
>>
>> Ohh, ok now I get it, thanks. If having two pernet_subsys for a given
>> module is fine, that works for me. It's clearer and has no moment of
>> temporary failure.
>>
>> I can finish this patch if everybody agrees with it.
>>
>>>>>> I used the list pointer because that's null as that memory is entirely 
>>>>>> zeroed when alloced
>>>>>> and, after initialization, it's never null again. Works like a 
>>>>>> lock/condition without
>>>>>> using an extra field.
>>>>>>
>>>>>
>>>>> I understand this a well.  What I don't particularly like is that we are 
>>>>> re-using
>>>>> a list without really stating why it's now done this way.  Additionally, 
>>>>> it's not really
>>>>> the last that happens so it's seems kind of hacky...  If we need to add 
>>>>> new
>>>>> per-net initializers, we now need to make sure that the code is put in 
>>>>> the right
>>>>> place.  I'd just really like to have a cleaner solution...
>>>>
>>>> Ok, got you. We could add a dedicated flag/bit for that then, if reusing 
>>>> the list is not
>>>> clear enough. Or, as we are discussing on the other part of thread, we 
>>>> could make it block
>>>> and wait for the initialization, probably using some wait_queue. I'm still 
>>>> thinking on
>>>> something this way, likely something more below than sctp then.
>>>>
>>>
>>> I think if we don the above, the second process calling socket() will 
>>> either find the
>>> the protosw or will try to load the module also.  I think either is ok after
>>> request_module returns we'll look at the protosw and will find find things.
>>
>> Seems so, yes. Nice.
> 
> I was testing with it, something is not good. I finished your patch and
> testing with a flooder like:
>  # for j in {1..5}; do for i in {1234..1280}; do \
>sctp_darn -H 192.168.122.147 -P $j$i -l & done & done
> 
... snip...
> 
> It seems that request_module will not serialize it as we wanted and we
> would be putting unexpected pressure on it, yet it fixes the original
> issue.

So, wouldn't the same issue exist when running the above with DCCP sockets?

> Maybe we can place a semaphore at inet_create(), protecting the
> request_module()s so only one socket can do it at a time and, after it
> is released, whoever was blocked on it re-checks if the module isn't
> already loaded before attempting again. It makes the loading of
> different modules slower, though, but I'm not sure if that's really a
> problem. Not many modules are loaded at the same time like that. What do
> you think? 

I think this is a different issue.  The fact that we keep trying to probe
the same module is silly.  May be a per proto semaphore so that SCTP doesn't
block DCCP for example.

-vlad

> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: fix race on protocol/netns initialization

2015-09-09 Thread Vlad Yasevich
On 09/09/2015 04:03 PM, Marcelo Ricardo Leitner wrote:
> Consider sctp module is unloaded and is being requested because an user
> is creating a sctp socket.
> 
> During initialization, sctp will add the new protocol type and then
> initialize pernet subsys:
> 
> status = sctp_v4_protosw_init();
> if (status)
> goto err_protosw_init;
> 
> status = sctp_v6_protosw_init();
> if (status)
> goto err_v6_protosw_init;
> 
> status = register_pernet_subsys(_net_ops);
> 
> The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
> is possible for userspace to create SCTP sockets like if the module is
> already fully loaded. If that happens, one of the possible effects is
> that we will have readers for net->sctp.local_addr_list list earlier
> than expected and sctp_net_init() does not take precautions while
> dealing with that list, leading to a potential panic but not limited to
> that, as sctp_sock_init() will copy a bunch of blank/partially
> initialized values from net->sctp.
> 
> The race happens like this:
> 
>  CPU 0   |  CPU 1
>   socket()   |
>__sock_create | socket()
> inet_create  |  __sock_create
>  list_for_each_entry_rcu(|
> answer, [sock->type], |
> list) {  |   inet_create
>   /* no hits */  |
>  if (unlikely(err)) {|
>   ...|
>   request_module()   |
>   /* socket creation is blocked  |
>* the module is fully loaded  |
>*/|
>sctp_init |
> sctp_v4_protosw_init |
>  inet_register_protosw   |
>   list_add_rcu(>list, |
>last_perm);   |
>  |  list_for_each_entry_rcu(
>  | answer, [sock->type],
> sctp_v6_protosw_init | list) {
>  | /* hit, so assumes protocol
>  |  * is already loaded
>  |  */
>  |  /* socket creation continues
>  |   * before netns is initialized
>  |   */
> register_pernet_subsys   |
> 
> Inverting the initialization order between register_pernet_subsys() and
> sctp_v4_protosw_init() is not possible because register_pernet_subsys()
> will create a control sctp socket, so the protocol must be already
> visible by then. Deferring the socket creation to a work-queue is not
> good specially because we loose the ability to handle its errors.
> 
> So the fix then is to invert the initialization order inside
> register_pernet_subsys() so that the control socket is created by last
> and also block socket creation if netns initialization wasn't yet
> performed.
> 

not sure how much I like that...  Wouldn't it be better
to pull the control socket initialization stuff out into its
own function that does something like

for_each_net_rcu()
init_control_socket(net, ...)


Or may be even pull the control socket creation
stuff completely into its own per-net ops operations structure
and initialize it after the the protosw stuff has been done.

-vlad

> Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/protocol.c | 18 +++---
>  net/sctp/socket.c   |  4 
>  2 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 
> 4345790ad3266c353eeac5398593c2a9ce4effda..d8f78165768a75f93f4ce4120dd5475b6a623aaf
>  100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1271,12 +1271,6 @@ static int __net_init sctp_net_init(struct net *net)
>  
>   sctp_dbg_objcnt_init(net);
>  
> - /* Initialize the control inode/socket for handling OOTB packets.  */
> - if ((status = sctp_ctl_sock_init(net))) {
> - pr_err("Failed to initialize the SCTP control sock\n");
> - goto err_ctl_sock_init;
> - }
> -
>   /* Initialize the local address list. */
>   INIT_LIST_HEAD(>sctp.local_addr_list);
>   spin_lock_init(>sctp.local_addr_lock);
> @@ -1284,11 +1278,21 @@ static int __net_init sctp_net_init(struct net *net)
>  
>   /* Initialize the address event list */
>   INIT_LIST_HEAD(>sctp.addr_waitq);
> - INIT_LIST_HEAD(>sctp.auto_asconf_splist);
>   spin_lock_init(>sctp.addr_wq_lock);
>   net->sctp.addr_wq_timer.expires = 0;
>   setup_timer(>sctp.addr_wq_timer, sctp_addr_wq_timeout_handler,
>   (unsigned long)net);
> + /* sctp_init_sock() will use this 

Re: [PATCH net 0/2] couple of sctp fixes for 0ca50d12fe46

2015-09-03 Thread Vlad Yasevich
On 09/02/2015 03:20 PM, Marcelo Ricardo Leitner wrote:
> These are two fixes for sctp after my patch on 0ca50d12fe46 ("sctp: fix
> src address selection if using secondary addresses")
> 
> The first, fix a dst leak on those it decided to skip.
> 
> The second, adds the fallback on src selection that Vlad had asked
> about. Unfortunatelly a lot of ipvs setups relies on the old behavior
> and I don't see a better fix for it.
> 
> Please consider both to -stable tree.
> 
> Thanks!
> 
> Marcelo Ricardo Leitner (2):
>   sctp: fix dst leak
>   sctp: add routing output fallback
> 
>  net/sctp/protocol.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 

For the series

Acked-by: Vlad Yasevich <vyasev...@gmail.com>

-vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: support global vtag assochash and per endpoint s(d)port assochash table

2015-08-31 Thread Vlad Yasevich
On 08/31/2015 01:44 PM, Xin Long wrote:
> for telecom center, the usual case is that a server is connected by thousands
> of clients. but if the server with only one enpoint(udp style) use the same
> sport and dport to communicate with every clients, and every assoc in server
> will be hashed in the same chain of global assoc hashtable due to currently we
> choose dport and sport as the hash key.
> 
> when a packet is received, sctp_rcv try to find the assoc with sport and 
> dport,
> since that chain is too long to find it fast, it make the performance turn to
> very low, some test data is as follow:
> 
> in server:
> $./ss [start a udp server there]
> in client:
> $./cc [start 2500 sockets to connect server with same port and different ip,
>and use one of them to send data to server]
> 
> *spend time is 854s, send pkt is 1000*
> 
> in server: #perf top
>   46.60%  [kernel]  [k] sctp_assoc_is_match
>8.81%  [kernel]  [k] sctp_v4_cmp_addr
>5.15%  [kernel]  [k] sctp_assoc_lookup_paddr
>...
> 
> in client: #perf top
>   88.42%  [kernel][k] __sctp_endpoint_lookup_assoc
>2.06%  libnl-3.so.200.16.1 [.] nl_object_identical
>1.23%  libnl-3.so.200.16.1 [.] nl_addr_cmp
>...
> 
> we need to change the way to calculate the hash key, vtag is good value for
> that, insteading the sport and dport. this way can work well for looking for
> assoc in sctp_rcv.
> becides,  for the clients, if we turn the dport and sport global hash to per
> endpoint's, which can make sctp_sendmsg look up assoc more quickly.
> 
> after that, the data of the same test is:
> 
> *spend time is 25s, send pkt is 1000*
> 
> in server: #perf top
>9.92%  libnl-3.so.200.16.1 [.] nl_object_identical
>6.05%  libnl-3.so.200.16.1 [.] nl_addr_cmp
>4.72%  libc-2.17.so[.] __memcmp_sse2
>...
> 
> in client: #perf top
>6.79%  libc-2.17.so[.] __libc_calloc
>6.50%  [kernel][k] memcpy
>6.35%  libc-2.17.so[.] _int_free
>...
> 
> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/sctp.h|   8 +--
>  include/net/sctp/structs.h |   8 ++-
>  net/sctp/endpointola.c |  22 +--
>  net/sctp/input.c   | 160 
> +
>  4 files changed, 129 insertions(+), 69 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index ce13cf2..df14452 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -524,18 +524,16 @@ static inline int sctp_assoc_hashfn(struct net *net, 
> __u16 lport, __u16 rport)
>  {
>   int h = (lport << 16) + rport + net_hash_mix(net);
>   h ^= h>>8;
> - return h & (sctp_assoc_hashsize - 1);
> + return h & (256 - 1);

In addition to what David said and looking at it from a different angle... 256 
buckets
may not be enough for someone with a single endpoint and alot of associations.  
You
will still hit a long chain on INIT and COOKIE-ECHO chunks.

Switching to using rhashtable for association hash would be very nice.

Also, see if you can split this into 2 parts, one that does vtag hash and
the other is refactoring.  It would make it much easier to review.

-vlad

>  }
>  
>  /* This is the hash function for the association hash table.  This is
>   * not used yet, but could be used as a better hash function when
>   * we have a vtag.
>   */
> -static inline int sctp_vtag_hashfn(__u16 lport, __u16 rport, __u32 vtag)
> +static inline int sctp_vtag_hashfn(struct net *net, __u32 vtag)
>  {
> - int h = (lport << 16) + rport;
> - h ^= vtag;
> - return h & (sctp_assoc_hashsize - 1);
> + return vtag & (sctp_assoc_hashsize - 1);
>  }
>  
>  #define sctp_for_each_hentry(epb, head) \
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 495c87e..e0edfed 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1150,6 +1150,9 @@ struct sctp_ep_common {
>   struct hlist_node node;
>   int hashent;
>  
> + struct hlist_node node_ep;
> + int hashent_ep;
> +
>   /* Runtime type information.  What kind of endpoint is this? */
>   sctp_endpoint_type_t type;
>  
> @@ -1210,6 +1213,8 @@ struct sctp_endpoint {
>   /* This is really a list of struct sctp_association entries. */
>   struct list_head asocs;
>  
> + struct sctp_hashbucket *asocshash;
> +
>   /* Secret Key: A secret key used by this endpoint to compute
>*the MAC.  This SHOULD be a cryptographic quality
>*random number with a sufficient length.
> @@ -1272,7 +1277,8 @@ int sctp_endpoint_is_peeled_off(struct sctp_endpoint *,
>   const union sctp_addr *);
>  struct sctp_endpoint *sctp_endpoint_is_match(struct sctp_endpoint *,
>   

Re: [PATCH net-next] macvtap/macvlan: use IFF_NO_QUEUE

2015-08-28 Thread Vlad Yasevich
On 08/27/2015 10:42 PM, Jason Wang wrote:
 
 
 On 08/27/2015 06:43 PM, Michael S. Tsirkin wrote:
 On Wed, Aug 26, 2015 at 01:45:30PM +0800, Jason Wang wrote:

 On 08/26/2015 12:32 AM, Vlad Yasevich wrote:
 On 08/25/2015 07:30 AM, Jason Wang wrote:
 On 08/25/2015 06:17 PM, Michael S. Tsirkin wrote:
 On Mon, Aug 24, 2015 at 04:33:12PM +0800, Jason Wang wrote:
 For macvlan, switch to use IFF_NO_QUEUE instead of tx_queue_len = 0.

 For macvtap, after commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523
 (macvtap: Add support of packet capture on macvtap
 device.). Multiqueue macvtap suffers from single qdisc lock
 contention. This is because macvtap claims a non zero tx_queue_len and
 it reuses this value as it socket receive queue size.Thanks to
 IFF_NO_QUEUE, we can remove the lock contention without breaking
 existing socket receive queue length logic.

 Cc: Patrick McHardy ka...@trash.net
 Cc: Vladislav Yasevich vyase...@redhat.com
 Cc: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 Seems to make sense. Give me a day or two to get over the jet lag
 (and get out from under the pile of mail accumulated while I was 
 traveling),
 I'll review properly and ack.

 A note on this patch: only default qdisc were removed but we don't lose
 the ability to attach a qdisc to macvtap (though it may cause lock
 contention on multiqueue case).

 Wouldn't that lock contention be solved if we really had multiple queues
 for multi-queue macvtaps?

 -vlad
 Yes, but this introduce another layer of txq locks contention?
 I don't follow - why does it? Could you clarify please?
 
 I believe Vlad wants to remove NETIF_F_LLTX. If yes, core will do an
 extra tx lock at macvlan layer.

No, I don't want to remove it.  In a sense, it would function similar to
how it works when fwd_priv is populated.  I am still testing the code
as it's showing some strange artifacts...  could be due to keeping LLTX.

-vlad

 

 And it
 also needs macvlan multiqueue support. We used to do something like this
 but switch to NETIF_F_LLTX finally. You may refer:

 2c11455321f37da6fe6cc36353149f9ac9183334 macvlan: add multiqueue capability
 8ffab51b3dfc54876f145f15b351c41f3f703195 macvlan: lockless tx path
 My concern is that the moment someone configures a non-standard qdisc
 scalability suddenly disappears. That would also be tricky to debug in the
 field as not a lot of developers use non-standard qdiscs.
 What do you think?
 
 Probably not an issue. Non-standard qdisc may need be attached manually
 after device creation, and we don't lose this ability with this patch.
 (Unless somebody changes default_qdisc). Actually, user before
 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 does not expect any qdisc work
 for macvtap like other stacked devices. This patch also restore this.
 
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: ASCONF-ACK with Unresolvable Address should be sent

2015-08-28 Thread Vlad Yasevich
On 08/28/2015 05:45 AM, Xin Long wrote:
 RFC 5061:
 This is an opaque integer assigned by the sender to identify each
 request parameter.  The receiver of the ASCONF Chunk will copy this
 32-bit value into the ASCONF Response Correlation ID field of the
 ASCONF-ACK response parameter.  The sender of the ASCONF can use this
 same value in the ASCONF-ACK to find which request the response is
 for.  Note that the receiver MUST NOT change this 32-bit value.
 
 Address Parameter: TLV
 
 This field contains an IPv4 or IPv6 address parameter, as described
 in Section 3.3.2.1 of [RFC4960].
 
 ASCONF chunk with Error Cause Indication Parameter (Unresolvable Address)
 should be sent if the Delete IP Address is not part of the association.
 
   Endpoint A   Endpoint B
   (ESTABLISHED)(ESTABLISHED)
 
   ASCONF-
   (Delete IP Address)
 -  ASCONF-ACK
 (Unresolvable Address)
 
 Signed-off-by: Xin Long lucien@gmail.com

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad

 ---
  net/sctp/sm_make_chunk.c | 15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)
 
 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 4068fe1..ce7f343 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3090,8 +3090,19 @@ static __be16 sctp_process_asconf_param(struct 
 sctp_association *asoc,
   sctp_assoc_set_primary(asoc, asconf-transport);
   sctp_assoc_del_nonprimary_peers(asoc,
   asconf-transport);
 - } else
 - sctp_assoc_del_peer(asoc, addr);
 + return SCTP_ERROR_NO_ERROR;
 + }
 +
 + /* If the address is not part of the association, the
 +  * ASCONF-ACK with Error Cause Indication Parameter
 +  * which including cause of Unresolvable Address should
 +  * be sent.
 +  */
 + peer = sctp_assoc_lookup_paddr(asoc, addr);
 + if (!peer)
 + return SCTP_ERROR_DNS_FAILED;
 +
 + sctp_assoc_rm_peer(asoc, peer);
   break;
   case SCTP_PARAM_SET_PRIMARY:
   /* ADDIP Section 4.2.4
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] bridge: vlan: allow to suppress local mac install for all vlans

2015-08-28 Thread Vlad Yasevich
On 08/27/2015 10:17 PM, Nikolay Aleksandrov wrote:
 
 On Aug 27, 2015, at 4:47 PM, Vlad Yasevich vyase...@redhat.com wrote:

 On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:

 On Aug 26, 2015, at 9:57 PM, roopa ro...@cumulusnetworks.com wrote:

 On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
 On Aug 25, 2015, at 11:06 PM, David Miller da...@davemloft.net wrote:

 From: Nikolay Aleksandrov niko...@cumulusnetworks.com
 Date: Tue, 25 Aug 2015 22:28:16 -0700

 Certainly, that should be done and I will look into it, but the
 essence of this patch is a bit different. The problem here is not
 the size of the fdb entries, it’s more the number of them - having
 96000 entries (even if they were 1 byte ones) is just way too much
 especially when the fdb hash size is small and static. We could work
 on making it dynamic though, but still these type of local entries
 per vlan per port can easily be avoided with this option.
 96000 bits can be stored in 12k.  Get where I'm going with this?

 Look at the problem sideways.
 Oh okay, I misunderstood your previous comment. I’ll look into that.

 I just wanted to add the other problems we have had with keeping these 
 macs (mostly from userspace POV):
 - add/del netlink notification storms
 - and large netlink dumps

 In addition to in-kernel optimizations, will be nice to have a solution 
 that reduces the burden on userspace. That will need a newer netlink dump 
 format for fdbs. Considering all the changes needed, Nikolays patch seems 
 less intrusive.

 Right, we need to take these into account as well. I’ll continue the 
 discussion on this (or restart it) because
 I looked into using a bitmap for the local entries only and while it fixes 
 the scalability issue, it presents
 a few new ones which are mostly related to the fact that these entries now 
 exist only without a vlan
 and if a new mac comes along which matches one of these but is in a vlan, 
 the entry will get created
 in br_fdb_update() unless we add a second lookup, but that will slow down 
 the learning path.
 Also this change requires an update of every fdb function that uses the vid 
 as a key (every fdb function?!)
 because now we can have the mac in two places instead of one which is a 
 pretty big churn with lots
 of conditionals all over the place and I don’t like it. Adding this 
 complexity for the local addresses only
 seems like an overkill, so I think to drop this issue for now.

 I seem to recall Roopa and I and maybe a few others have discussing this a 
 few
 years ago at plumbers, I can't remember the details any more.  All these 
 local
 addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
 original reason was for all these local addresses? I wonder if we can have
 a nob to disable all of them (not just per vlan)?  That might be cleaner and
 easier to swallow.

 
 Right, this would be the easiest way and if the others agree - I’ll post a 
 patch for it so we can
 have some way to resolve it today and even if we fix the scalability issue, 
 this is still a valid case
 that some people don’t want local fdbs installed automatically.
 Any objections to this ?
 
 This patch (that works around the initial problem) also has these issues.
 Note that one way to take care of this in a more straight-forward way would 
 be to have each entry
 with some sort of a bitmap (like Vlad has tried earlier) and then we can 
 combine the paths so most
 of these issues disappear, but that will not be easy as was already 
 commented earlier. I’ve looked
 briefly into doing this with rhashtable so we can keep the memory footprint 
 for each entry relatively
 small but it still affects the performance and we can have thousands of 
 resizes happening. 


 So, one of the earlier approaches that I've tried (before rhashtable was
 in the kernel) was to have a hash of vlan ids each with a data structure
 pointing to a list of ports for a given vlan as well as a list of fdbs for
 a given vlan.  As far as scalability goes, that's really the best approach.
 It would also allow us to do packet accounting per vlan.  The only concern
 at the time was performance of ingress lookup.   I think rhashtables might
 help with this as well as ability to grow the footprint of the vlan hash
 table dynamically.

 -vlad

 I’ll look into it but I’m guessing the learning will become a more 
 complicated process with additional 
 allocations and some hash handling.

I don't remember learning being all that complicated.  The hash only changed 
under
rtnl when vlans were added/removed.  The nice this is that we wouldn't need
to rebalance, because if the vlan is removed all fdb links get removed too.  
They
don't move to another bucket (But that was with static hash.  Need to look at 
rhash in
more detail).

If you want, I might still have patches hanging around on my machine that had a 
hash
table implementation.  I can send them to you.

-vlad

 
 On the notification side if we can fix that, we can actually delete

Re: [PATCH net-next v2] bridge: vlan: allow to suppress local mac install for all vlans

2015-08-28 Thread Vlad Yasevich
On 08/28/2015 11:26 AM, Nikolay Aleksandrov wrote:
 
 On Aug 28, 2015, at 5:31 AM, Vlad Yasevich vyase...@redhat.com wrote:

 On 08/27/2015 10:17 PM, Nikolay Aleksandrov wrote:

 On Aug 27, 2015, at 4:47 PM, Vlad Yasevich vyase...@redhat.com wrote:

 On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:

 On Aug 26, 2015, at 9:57 PM, roopa ro...@cumulusnetworks.com wrote:

 On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
 On Aug 25, 2015, at 11:06 PM, David Miller da...@davemloft.net wrote:

 From: Nikolay Aleksandrov niko...@cumulusnetworks.com
 Date: Tue, 25 Aug 2015 22:28:16 -0700

 Certainly, that should be done and I will look into it, but the
 essence of this patch is a bit different. The problem here is not
 the size of the fdb entries, it’s more the number of them - having
 96000 entries (even if they were 1 byte ones) is just way too much
 especially when the fdb hash size is small and static. We could work
 on making it dynamic though, but still these type of local entries
 per vlan per port can easily be avoided with this option.
 96000 bits can be stored in 12k.  Get where I'm going with this?

 Look at the problem sideways.
 Oh okay, I misunderstood your previous comment. I’ll look into that.

 I just wanted to add the other problems we have had with keeping these 
 macs (mostly from userspace POV):
 - add/del netlink notification storms
 - and large netlink dumps

 In addition to in-kernel optimizations, will be nice to have a solution 
 that reduces the burden on userspace. That will need a newer netlink 
 dump format for fdbs. Considering all the changes needed, Nikolays patch 
 seems less intrusive.

 Right, we need to take these into account as well. I’ll continue the 
 discussion on this (or restart it) because
 I looked into using a bitmap for the local entries only and while it 
 fixes the scalability issue, it presents
 a few new ones which are mostly related to the fact that these entries 
 now exist only without a vlan
 and if a new mac comes along which matches one of these but is in a vlan, 
 the entry will get created
 in br_fdb_update() unless we add a second lookup, but that will slow down 
 the learning path.
 Also this change requires an update of every fdb function that uses the 
 vid as a key (every fdb function?!)
 because now we can have the mac in two places instead of one which is a 
 pretty big churn with lots
 of conditionals all over the place and I don’t like it. Adding this 
 complexity for the local addresses only
 seems like an overkill, so I think to drop this issue for now.

 I seem to recall Roopa and I and maybe a few others have discussing this a 
 few
 years ago at plumbers, I can't remember the details any more.  All these 
 local
 addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
 original reason was for all these local addresses? I wonder if we can have
 a nob to disable all of them (not just per vlan)?  That might be cleaner 
 and
 easier to swallow.


 Right, this would be the easiest way and if the others agree - I’ll post a 
 patch for it so we can
 have some way to resolve it today and even if we fix the scalability issue, 
 this is still a valid case
 that some people don’t want local fdbs installed automatically.
 Any objections to this ?

 This patch (that works around the initial problem) also has these issues.
 Note that one way to take care of this in a more straight-forward way 
 would be to have each entry
 with some sort of a bitmap (like Vlad has tried earlier) and then we can 
 combine the paths so most
 of these issues disappear, but that will not be easy as was already 
 commented earlier. I’ve looked
 briefly into doing this with rhashtable so we can keep the memory 
 footprint for each entry relatively
 small but it still affects the performance and we can have thousands of 
 resizes happening. 


 So, one of the earlier approaches that I've tried (before rhashtable was
 in the kernel) was to have a hash of vlan ids each with a data structure
 pointing to a list of ports for a given vlan as well as a list of fdbs for
 a given vlan.  As far as scalability goes, that's really the best approach.
 It would also allow us to do packet accounting per vlan.  The only concern
 at the time was performance of ingress lookup.   I think rhashtables might
 help with this as well as ability to grow the footprint of the vlan hash
 table dynamically.

 -vlad

 I’ll look into it but I’m guessing the learning will become a more 
 complicated process with additional 
 allocations and some hash handling.

 I don't remember learning being all that complicated.  The hash only changed 
 under
 rtnl when vlans were added/removed.  The nice this is that we wouldn't need
 to rebalance, because if the vlan is removed all fdb links get removed too.  
 They
 don't move to another bucket (But that was with static hash.  Need to look 
 at rhash in
 more detail).

 If you want, I might still have patches hanging around on my machine that 
 had

Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 09:19 AM, lucien xin wrote:

 No, these are 2 distinct instances.  In one instance, the peer is reachable 
 and
 is able to communication 0 rwnd state to us.  Thus we are being nice and 
 granting
 the peer more time to exit the 0 window state.

 In the other state, the peer is unreachable and we just happen to hit the 
 0-window
 condition based on some estimations of the peer window.  In this case, we 
 should
 be subject to the Max.RTX and terminate the association sooner.

 -vlad

 okay, I got you,
 
 we can see that local update their peer.rwnd in sctp_packet_append_data() and
 sctp_retransmit_mark(), it do that according to a_rwnd and outstanding, so the
 root reason is that it's hard to know that peer really closed it's window, 
 maybe
 just so many outstanding lead to that.
 
 what we can do is to trust peer.rwnd is the real window in peer.
 from another angle,  even though it's not real, at least we can reduce the
 * the other state* you mentioned by doing this. especially, if there is only 
 one
 small packet keep retransmitting in SHUTDOWN_PENDING state, the
 peer.rwnd is more believable to be the real peer window.
 
 I saw bsd code didnot care about Max.Retrans in SHUTDOWN_PENDING,
 instead it just start T5 timer. but now that we choose Max.Retrans + T5, it's
 better to process more unreachable by using Max.Retrans. I also hope we can
 do it better there as Marcelo said, but by now I cannot see it. :)
 

So one potential way is to have peer.rwnd and peer.a_rwnd, where peer.a_rwnd is
the window advertised by peer and peer.rwnd and our estimation based on 
peer.a_rwnd.
This way we will always know where we stand.

Although I am not sure yet if we want to grow the peer structure any more.

Another way is to have an estimate or 0-window probe bit/flags one the send side
and set it when we do 0-window probe.  This way we'd know that when 0-window 
probe
bit is set, peer returned 0 window.

Just some thoughts.
-vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v4] sctp: asconf's process should verify address parameter is in the beginning

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 04:26 AM, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.
 
 so add a code in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.
 
 note that this can also detect multiple address parameters, and reject it.
 
 Signed-off-by: Xin Long lucien@gmail.com
 Signed-off-by: Marcelo Ricardo Leitner mleit...@redhat.com

Looks good to me.

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad

 ---
  net/sctp/sm_make_chunk.c | 7 +++
  1 file changed, 7 insertions(+)
 
 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 06320c8..a655ddc 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3132,11 +3132,18 @@ bool sctp_verify_asconf(const struct sctp_association 
 *asoc,
   case SCTP_PARAM_IPV4_ADDRESS:
   if (length != sizeof(sctp_ipv4addr_param_t))
   return false;
 + /* ensure there is only one addr param and it's in the
 +  * beginning of addip_hdr params, or we reject it.
 +  */
 + if (param.v != addip-addip_hdr.params)
 + return false;
   addr_param_seen = true;
   break;
   case SCTP_PARAM_IPV6_ADDRESS:
   if (length != sizeof(sctp_ipv6addr_param_t))
   return false;
 + if (param.v != addip-addip_hdr.params)
 + return false;
   addr_param_seen = true;
   break;
   case SCTP_PARAM_ADD_IP:
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 10:49 AM, lucien xin wrote:

 So one potential way is to have peer.rwnd and peer.a_rwnd, where peer.a_rwnd 
 is
 the window advertised by peer and peer.rwnd and our estimation based on 
 peer.a_rwnd.
 This way we will always know where we stand.

 Although I am not sure yet if we want to grow the peer structure any more.

 Another way is to have an estimate or 0-window probe bit/flags one the send 
 side
 and set it when we do 0-window probe.  This way we'd know that when 0-window 
 probe
 bit is set, peer returned 0 window.

 I think updating 0-window may happen in sctp_process_init() and
 sctp_outq_sack(),
 I don't think 0-window can be probed, cause unreachable and closing
 window both has
 no reply from peer. but we can update the 0-window bit in those two
 functions. I just do
 not know where there is a available bit we can use if won't change the
 peer structure.

You can ignore INIT as the window will never be 0 (not allowed).

The updates could happen at the end of sctp_outq_sack().   There some spare
bits in peer if you want to go that way.

-vlad


 
 Just some thoughts.
 -vlad

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] bridge: vlan: allow to suppress local mac install for all vlans

2015-08-27 Thread Vlad Yasevich
On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:
 
 On Aug 26, 2015, at 9:57 PM, roopa ro...@cumulusnetworks.com wrote:

 On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
 On Aug 25, 2015, at 11:06 PM, David Miller da...@davemloft.net wrote:

 From: Nikolay Aleksandrov niko...@cumulusnetworks.com
 Date: Tue, 25 Aug 2015 22:28:16 -0700

 Certainly, that should be done and I will look into it, but the
 essence of this patch is a bit different. The problem here is not
 the size of the fdb entries, it’s more the number of them - having
 96000 entries (even if they were 1 byte ones) is just way too much
 especially when the fdb hash size is small and static. We could work
 on making it dynamic though, but still these type of local entries
 per vlan per port can easily be avoided with this option.
 96000 bits can be stored in 12k.  Get where I'm going with this?

 Look at the problem sideways.
 Oh okay, I misunderstood your previous comment. I’ll look into that.

 I just wanted to add the other problems we have had with keeping these macs 
 (mostly from userspace POV):
 - add/del netlink notification storms
 - and large netlink dumps

 In addition to in-kernel optimizations, will be nice to have a solution that 
 reduces the burden on userspace. That will need a newer netlink dump format 
 for fdbs. Considering all the changes needed, Nikolays patch seems less 
 intrusive.
 
 Right, we need to take these into account as well. I’ll continue the 
 discussion on this (or restart it) because
 I looked into using a bitmap for the local entries only and while it fixes 
 the scalability issue, it presents
 a few new ones which are mostly related to the fact that these entries now 
 exist only without a vlan
 and if a new mac comes along which matches one of these but is in a vlan, the 
 entry will get created
 in br_fdb_update() unless we add a second lookup, but that will slow down the 
 learning path.
 Also this change requires an update of every fdb function that uses the vid 
 as a key (every fdb function?!)
 because now we can have the mac in two places instead of one which is a 
 pretty big churn with lots
 of conditionals all over the place and I don’t like it. Adding this 
 complexity for the local addresses only
 seems like an overkill, so I think to drop this issue for now.

I seem to recall Roopa and I and maybe a few others have discussing this a few
years ago at plumbers, I can't remember the details any more.  All these local
addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
original reason was for all these local addresses? I wonder if we can have
a nob to disable all of them (not just per vlan)?  That might be cleaner and
easier to swallow.

 This patch (that works around the initial problem) also has these issues.
 Note that one way to take care of this in a more straight-forward way would 
 be to have each entry
 with some sort of a bitmap (like Vlad has tried earlier) and then we can 
 combine the paths so most
 of these issues disappear, but that will not be easy as was already commented 
 earlier. I’ve looked
 briefly into doing this with rhashtable so we can keep the memory footprint 
 for each entry relatively
 small but it still affects the performance and we can have thousands of 
 resizes happening. 
 

So, one of the earlier approaches that I've tried (before rhashtable was
in the kernel) was to have a hash of vlan ids each with a data structure
pointing to a list of ports for a given vlan as well as a list of fdbs for
a given vlan.  As far as scalability goes, that's really the best approach.
It would also allow us to do packet accounting per vlan.  The only concern
at the time was performance of ingress lookup.   I think rhashtables might
help with this as well as ability to grow the footprint of the vlan hash
table dynamically.

-vlad

 On the notification side if we can fix that, we can actually delete the 96000 
 entries without creating a
 huge notification storm and do a user-land workaround of the original issue, 
 so I’ll look into that next.
 
 Any comments or ideas are very welcome.
 
 Thank you,
  Nik
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3] sctp: donot reset the overall_error_count in SHUTDOWN_RECEIVE state

2015-08-27 Thread Vlad Yasevich
On 08/26/2015 04:52 PM, Xin Long wrote:
 Commit f8d960524328 (sctp: Enforce retransmission limit during shutdown)
 fixed a problem with excessive retransmissions in the SHUTDOWN_PENDING by not
 resetting the association overall_error_count.  This allowed the association
 to better enforce assoc.max_retrans limit.
 
 However, the same issue still exists when the association is in 
 SHUTDOWN_RECEIVED
 state.  In this state, HB-ACKs will continue to reset the overall_error_count
 for the association would extend the lifetime of association unnecessarily.
 
 This patch solves this by resetting the overall_error_count whenever the 
 current
 state is small then SCTP_STATE_SHUTDOWN_PENDING.  As a small side-effect, we
 end up also handling SCTP_STATE_SHUTDOWN_ACK_SENT and SCTP_STATE_SHUTDOWN_SENT
 states, but they are not really impacted because we disable Heartbeats in 
 those
 states.
 
 Fixes: Commit f8d960524328 (sctp: Enforce retransmission limit during 
 shutdown)
 Signed-off-by: Xin Long lucien@gmail.com
 ---
  net/sctp/sm_sideeffect.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
 index fef2acd..85e6f03 100644
 --- a/net/sctp/sm_sideeffect.c
 +++ b/net/sctp/sm_sideeffect.c
 @@ -702,7 +702,7 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
* outstanding data and rely on the retransmission limit be reached
* to shutdown the association.
*/
 - if (t-asoc-state != SCTP_STATE_SHUTDOWN_PENDING)
 + if (t-asoc-state  SCTP_STATE_SHUTDOWN_PENDING)
   t-asoc-overall_error_count = 0;
  
   /* Clear the hb_sent flag to signal that we had a good
 

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3] sctp: asconf's process should verify address parameter is in the beginning

2015-08-26 Thread Vlad Yasevich
On 08/26/2015 05:09 PM, lucien xin wrote:
 On Thu, Aug 27, 2015 at 4:59 AM, Marcelo Ricardo Leitner
 mleit...@redhat.com wrote:
 On Wed, Aug 26, 2015 at 04:42:21PM -0400, Vlad Yasevich wrote:
 On 08/26/2015 04:35 PM, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.

 so add a code in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.

 v2-v3:
  * put the check in the loop, add the check for multiple address 
 parameters.


 Please split the multiple address detection from first address detection.
 They are 2 different bugs and each one deserves a separate commit and
 changelog.

 See below, thx.


 Thanks
 -vlad

 v1-v2:
  * put the check behind the params' length verify.

 Signed-off-by: Xin Long lucien@gmail.com
 Signed-off-by: Vlad Yasevich vyase...@redhat.com
 ---
  net/sctp/sm_make_chunk.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 06320c8..4068fe1 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3130,14 +3130,24 @@ bool sctp_verify_asconf(const struct 
 sctp_association *asoc,
 case SCTP_PARAM_ERR_CAUSE:
 break;
 case SCTP_PARAM_IPV4_ADDRESS:
 +   if (addr_param_seen) {
 +   /* peer placed multiple address parameters into
 +* the same asconf. reject it.
 +*/
 +   return false;
 +   }
 if (length != sizeof(sctp_ipv4addr_param_t))
 return false;
 -   addr_param_seen = true;
 +   if (param.v == addip-addip_hdr.params)
 +   addr_param_seen = true;
 break;

 I know I had suggested using addr_param_seen to check for multiple
 occurrences, but now realized we can simplify this with something like:

 +   if (param.v != addip-addip_hdr.params)
 +   return false;
 addr_param_seen = true;

 Then the check against addr_param_seen is not needed and do both checks
 at once.

 looks nice, Vlad ?
 

yes.  This is fine too.  I think this kills 2 bugs with 1 patch...

If you go this route, make sure to document this well in the change log.

-vlad

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3] sctp: asconf's process should verify address parameter is in the beginning

2015-08-26 Thread Vlad Yasevich
On 08/26/2015 04:35 PM, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.
 
 so add a code in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.
 
 v2-v3:
  * put the check in the loop, add the check for multiple address parameters.


Please split the multiple address detection from first address detection.
They are 2 different bugs and each one deserves a separate commit and
changelog.

Thanks
-vlad

 v1-v2:
  * put the check behind the params' length verify.
 
 Signed-off-by: Xin Long lucien@gmail.com
 Signed-off-by: Vlad Yasevich vyase...@redhat.com
 ---
  net/sctp/sm_make_chunk.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)
 
 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 06320c8..4068fe1 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3130,14 +3130,24 @@ bool sctp_verify_asconf(const struct sctp_association 
 *asoc,
   case SCTP_PARAM_ERR_CAUSE:
   break;
   case SCTP_PARAM_IPV4_ADDRESS:
 + if (addr_param_seen) {
 + /* peer placed multiple address parameters into
 +  * the same asconf. reject it.
 +  */
 + return false;
 + }
   if (length != sizeof(sctp_ipv4addr_param_t))
   return false;
 - addr_param_seen = true;
 + if (param.v == addip-addip_hdr.params)
 + addr_param_seen = true;
   break;
   case SCTP_PARAM_IPV6_ADDRESS:
 + if (addr_param_seen)
 + return false;
   if (length != sizeof(sctp_ipv6addr_param_t))
   return false;
 - addr_param_seen = true;
 + if (param.v == addip-addip_hdr.params)
 + addr_param_seen = true;
   break;
   case SCTP_PARAM_ADD_IP:
   case SCTP_PARAM_DEL_IP:
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v4] sctp: asconf's process should verify address parameter is in the beginning

2015-08-26 Thread Vlad Yasevich
On 08/26/2015 05:03 PM, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.
 
 so add detection in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.
 
 Signed-off-by: Xin Long lucien@gmail.com
 Signed-off-by: Vlad Yasevich vyase...@redhat.com

Acked-by: Vlad Yasevich vyasev...@gmail.com

-vlad

 ---
  net/sctp/sm_make_chunk.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 06320c8..f3fc881 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3132,12 +3132,14 @@ bool sctp_verify_asconf(const struct sctp_association 
 *asoc,
   case SCTP_PARAM_IPV4_ADDRESS:
   if (length != sizeof(sctp_ipv4addr_param_t))
   return false;
 - addr_param_seen = true;
 + if (param.v == addip-addip_hdr.params)
 + addr_param_seen = true;
   break;
   case SCTP_PARAM_IPV6_ADDRESS:
   if (length != sizeof(sctp_ipv6addr_param_t))
   return false;
 - addr_param_seen = true;
 + if (param.v == addip-addip_hdr.params)
 + addr_param_seen = true;
   break;
   case SCTP_PARAM_ADD_IP:
   case SCTP_PARAM_DEL_IP:
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] bridge: vlan: allow to suppress local mac install for all vlans

2015-08-26 Thread Vlad Yasevich
On 08/24/2015 08:55 PM, Nikolay Aleksandrov wrote:
 From: Nikolay Aleksandrov niko...@cumulusnetworks.com
 
 This patch adds a new knob that, when enabled, allows to suppress the
 installation of local fdb entries in newly created vlans. This could
 pose a big scalability issue if we have a large number of ports and a
 large number of vlans, e.g. in a 48 port device with 2000 vlans these
 entries easily go up to 96000.
 Note that packets for these macs are still received properly because they
 are added in vlan 0 as own macs and referenced when fdb lookup by vlan
 results in a miss.
 Also note that vlan membership of ingress port and the bridge device
 as egress are still being correctly enforced.
 
 The default (0/off) is keeping the current behaviour.
 
 Based on a patch by Wilson Kok (w...@cumulusnetworks.com).
 
 Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com
 ---
 As usual I'll post iproute2 patch if this one gets accepted.
 

... snip...

 diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
 index 3cef6892c0bb..f9efa1b07994 100644
 --- a/net/bridge/br_vlan.c
 +++ b/net/bridge/br_vlan.c
 @@ -98,11 +98,12 @@ static int __vlan_add(struct net_port_vlans *v, u16 vid, 
 u16 flags)
   return err;
   }
  
 - err = br_fdb_insert(br, p, dev-dev_addr, vid);
 - if (err) {
 - br_err(br, failed insert local address into bridge 
 -forwarding table\n);
 - goto out_filt;
 + if (!br_vlan_ignore_local_fdb(br) || !v-port_idx) {
 + err = br_fdb_insert(br, p, dev-dev_addr, vid);
 + if (err) {
 + br_err(br, failed insert local address into bridge 
 forwarding table\n);
 + goto out_filt;
 + }
   }


One question.  Does it make sense to push this down into br_fdb_insert?
This patch prevents automatic entries from being added.  But what about
manual entries for a local fdb?  The code in br_fdb_add() will still a
vid 0 entry as well as entries for all vlans currently configured on the port.

-vlad

   set_bit(vid, v-vlan_bitmap);
 @@ -492,6 +493,13 @@ int br_vlan_filter_toggle(struct net_bridge *br, 
 unsigned long val)
   return 0;
  }
  
 +int br_vlan_ignore_local_fdb_toggle(struct net_bridge *br, unsigned long val)
 +{
 + br-vlan_ignore_local_fdb = val ? true : false;
 +
 + return 0;
 +}
 +
  int br_vlan_set_proto(struct net_bridge *br, unsigned long val)
  {
   int err = 0;
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] bridge: vlan: allow to suppress local mac install for all vlans

2015-08-26 Thread Vlad Yasevich
On 08/26/2015 02:10 AM, B Viswanath wrote:

 I'd rather we fix the essence of the scalability problem than add
 more spaghetti code to the various bridge paths.

 Can we make the fdb entries smaller?

 Can we enhance how we store such local entries such that they live in
 a compact datastructure?  Perhaps the FDB can consist of a very dense
 lookup mechanism for local stuff sitting alongside the current table.

 Certainly, that should be done and I will look into it, but the essence of 
 this patch
 is a bit different. The problem here is not the size of the fdb entries, 
 it’s more the
 number of them - having 96000 entries (even if they were 1 byte ones) is 
 just way
 too much especially when the fdb hash size is small and static. We could 
 work on making
 it dynamic though, but still these type of local entries per vlan per port 
 can easily be avoided
 with this option.

 
 I was wondering if it is possible to assign a vlan bitmap for the FDB
 entry, instead of replicating the entry for each vlan. ( I believe
 Roopa has done something similar, but not so sure). This means that
 the number of FDB entries remain static for any number of vlans.
 
 I guess its more complicated than it sounds, but just wanted to know
 if its feasible at all.

I've actually had this done in one of the earlier attempts.  The issue was how
to compress it because there was absolutely no gain if you have a sparse vlan 
bitmap.

I even tried doing something along the lines of vlan_group array, but that can
explode to full size almost as fast.

What actually worked better was a hash table of vlans where each entry in the 
table
contained a bunch of data one of which was a list of fdbs for a given vlan.   It
didn't replicate fdbs but simply referenced the ones we cared about and bumped 
the ref.

However, this made vlan look-ups slower since we now had a hash instead of a 
bitmap lookup
and Stephen rejected it.

-vlad

 
 Thanks
 Vissu
 

 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: donot reset the overall_error_count in SHUTDOWN_RECEIVE state

2015-08-26 Thread Vlad Yasevich
On 08/23/2015 07:30 AM, Xin Long wrote:
 commit f8d960524 fix the 0 peer.rwnd issue in SHUTDOWN_PENDING state through
 not reseting the overall_error_count when receive a heartbeat, but the same
 issue also exists in SHUTDOWN_RECEIVE state.
 
 so we change the condition to state  SCTP_STATE_SHUTDOWN_PENDING to reset the
 overall_error_count when receive a heartbeat, which can avoid the issue happen
 in SCTP_STATE_SHUTDOWN_RECEIVE.
 
 as to SCTP_STATE_SHUTDOWN_ACK_SENT and SCTP_STATE_SHUTDOWN_SENT state, with
 this patch, it will not be affected by the heartbeat, cause these two states
 have been taken charge of by t2 timer.
 
 Fixes: f8d960524 (sctp: Enforce retransmission limit during shutdown)
 Signed-off-by: Xin Long lucien@gmail.com


The code is OK, but the change log could use some help.

How is this for the explanation:

Commit f8d960524 (sctp: Enforce retransmission limit during shutdown) fixed a
problem with excessive retransmissions in the SHUTDOWN_PENDING by not resetting
the association overall_error_count.  This allowed the association to better
enforce assoc.max_retrans limit.

However, the same issue still exists when the association is in 
SHUTDOWN_RECEIVED
state.  In this state, HB-ACKs will continue to reset the overall_error_count
for the association would extend the lifetime of association unnecessarily.

This patch solves this by resetting the overall_error_count whenever the current
state is small then SCTP_STATE_SHUTDOWN_PENDING.  As a small side-effect, we
end up also handling SCTP_STATE_SHUTDOWN_ACK_SENT and SCTP_STATE_SHUTDOWN_SENT
states, but they are not really impacted because we disable Heartbeats in those
states.


Thanks
-vlad


 ---
  net/sctp/sm_sideeffect.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
 index fef2acd..85e6f03 100644
 --- a/net/sctp/sm_sideeffect.c
 +++ b/net/sctp/sm_sideeffect.c
 @@ -702,7 +702,7 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
* outstanding data and rely on the retransmission limit be reached
* to shutdown the association.
*/
 - if (t-asoc-state != SCTP_STATE_SHUTDOWN_PENDING)
 + if (t-asoc-state  SCTP_STATE_SHUTDOWN_PENDING)
   t-asoc-overall_error_count = 0;
  
   /* Clear the hb_sent flag to signal that we had a good
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: asconf's process should verify address parameter is in the beginning

2015-08-25 Thread Vlad Yasevich
On 08/25/2015 10:01 AM, Marcelo Ricardo Leitner wrote:
 On Tue, Aug 25, 2015 at 08:29:24PM +0800, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of
 the addip params. but we never check if it's really there. if the addr
 param is not there, it still can pass sctp_verify_asconf(), then to be
 handled by sctp_process_asconf(), it will not be safe.

 so add a code in sctp_verify_asconf() to check the address parameter is in
 the beginning, or return false to send abort.

 v1-v2:
  * put the check behind the params' length verify.

 Signed-off-by: Xin Long lucien@gmail.com
 ---
  net/sctp/sm_make_chunk.c | 7 +++
  1 file changed, 7 insertions(+)

 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 06320c8..89a4d1c 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3166,6 +3166,13 @@ bool sctp_verify_asconf(const struct sctp_association 
 *asoc,
  return false;
  if (!addr_param_needed  addr_param_seen)
  return false;
 +if (addr_param_needed  addr_param_seen) {
 +/* Ensure the address parameter is in the beginning */
 +param.v = chunk-skb-data + sizeof(sctp_addiphdr_t);
 
 Using param.v before the loop made sense but after the loop, it will
 cause all packets that hits here to be reject due to the check below.
 
 +if (param.p-type != SCTP_PARAM_IPV4_ADDRESS 
 +param.p-type != SCTP_PARAM_IPV6_ADDRESS)
 +return false;
 +}
  if (param.v != chunk-chunk_end)
this one-^
 
 Maybe it's easier if you put this check inside the loop for each ipv4/6,
 and check if it is the first parameter or not by mimicing the way
 sctp_walk_params() finds the first chunk, it's just a pointer
 derreference and that was already checked and performed to reach there.
 
 (You can have some logic with addr_param_seen so you don't catch the
 multiple parameters in there.)

Exactly!

something like this:
SCTP_PARAM_IPV4_ADDRESS:
if (param.v == addip-addip_hdr.params)
addr_param_seen = true;

Thus making sure that the parameter as seen only when it's at the beginning...

Then later we can do things like:
SCTP_PARAM_IPV4_ADDRESS:
if (addr_param_seen) {
/* peer placed multiple address parameters into the same
 * asconf. reject it.
 */
return false;
}

-vlad


 
   Marcelo
 
  return false;
  
 -- 
 2.1.0


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: asconf's process should verify address parameter is in the beginning

2015-08-24 Thread Vlad Yasevich
On 08/24/2015 06:07 AM, Xin Long wrote:
 in sctp_process_asconf(), we get address parameter from the beginning of the
 addip params. but we never check if it's really there. if the addr param is 
 not
 there, it still can pass sctp_verify_asconf(), then to be handled by
 sctp_process_asconf(), it will not be safe.
 
 so add a code in sctp_verify_asconf() to check the address parameter is in the
 beginning, or return false to send abort.
 
 Signed-off-by: Xin Long lucien@gmail.com
 ---
  net/sctp/sm_make_chunk.c | 8 
  1 file changed, 8 insertions(+)
 
 diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
 index 0ee5ca7..a2a72d5 100644
 --- a/net/sctp/sm_make_chunk.c
 +++ b/net/sctp/sm_make_chunk.c
 @@ -3122,6 +3122,14 @@ bool sctp_verify_asconf(const struct sctp_association 
 *asoc,
   union sctp_params param;
   bool addr_param_seen = false;
  
 + if(addr_param_needed){
 + /* Ensure the address parameter is in the beginning */
 + param.v = chunk-skb-data + sizeof(sctp_addiphdr_t);
 + if (param.p-type != SCTP_PARAM_IPV4_ADDRESS 
 + param.p-type != SCTP_PARAM_IPV6_ADDRESS)
 + return false;
 + }
 +

Sorry, you can't do that directly without a lot more checks.  The parameer
may be only only partial, or may not be there at all.  You'd end up looking
at wrong mememory.

A better way would be to set the addr_param_seen only when looking at
the first parameter (addip_hdr.params).

-vlad

   sctp_walk_params(param, addip, addip_hdr.params) {
   size_t length = ntohs(param.p-length);
  
 


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sctp: partial chunk should be drop without sending abort packet

2015-08-24 Thread Vlad Yasevich
On 08/24/2015 06:08 AM, Xin Long wrote:
 as RFC 4960, 6.10 said, *if the receiver detects a partial chunk, it MUST drop
 the chunk*, we should not send the abort. but if we put this discard to inside
 state machine, it will send abort.
 

Actually, silently dropping this is _very_ bad.  There reason is that you've 
already
processed the leading chunks and may have potentially queued a response...  
Now, you
reach the end of the packet and find that the last chunk is partial.  You end up
dropping the packet, but still handing the responses.  This actually lead to 
some very
interesting issues we were seeing.

It is better to terminate the association in this case.

-vlad

 so we just drop the partial chunk there, never let this chunk go into the 
 state
 machine.
 
 Signed-off-by: Xin Long lucien@gmail.com
 ---
  net/sctp/inqueue.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
 index 7e8a16c..a22ca57 100644
 --- a/net/sctp/inqueue.c
 +++ b/net/sctp/inqueue.c
 @@ -183,9 +183,9 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
   /* This is not a singleton */
   chunk-singleton = 0;
   } else if (chunk-chunk_end  skb_tail_pointer(chunk-skb)) {
 - /* Discard inside state machine. */
 - chunk-pdiscard = 1;
 - chunk-chunk_end = skb_tail_pointer(chunk-skb);
 + sctp_chunk_free(chunk);
 + chunk = queue-in_progress = NULL;
 + return NULL;
   } else {
   /* We are at the end of the packet, so mark the chunk
* in case we need to send a SACK.
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: start t5 timer only when peer.rwnd is 0 and local.state is SHUTDOWN_PENDING

2015-08-24 Thread Vlad Yasevich
On 08/24/2015 02:31 PM, Marcelo Ricardo Leitner wrote:
 On Mon, Aug 24, 2015 at 02:13:38PM -0400, Vlad Yasevich wrote:
 On 08/23/2015 07:30 AM, Xin Long wrote:
 when A sends a data to B, then A close() and enter into SHUTDOWN_PENDING 
 state,
 if B neither claim his rwnd is 0 nor send SACK for this data, A will keep
 retransmitting this data util t5 timeout, Max.Retrans times can't work 
 anymore,
 which is bad.

 if B's rwnd is not 0, it should send abord after Max.Retrans times, only 
 when
 B's rwnd == 0 and A's retransmitting beyonds Max.Retrans times, A will start
 t5 timer, which is also commit f8d960524 means, but it lacks the condition
 peer.rwnd == 0.

 Fixes: f8d960524 (sctp: Enforce retransmission limit during shutdown)
 Signed-off-by: Xin Long lucien@gmail.com
 ---
  net/sctp/sm_statefuns.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
 index 3ee27b7..deb9eab 100644
 --- a/net/sctp/sm_statefuns.c
 +++ b/net/sctp/sm_statefuns.c
 @@ -5412,7 +5412,8 @@ sctp_disposition_t sctp_sf_do_6_3_3_rtx(struct net 
 *net,
 SCTP_INC_STATS(net, SCTP_MIB_T3_RTX_EXPIREDS);
  
 if (asoc-overall_error_count = asoc-max_retrans) {
 -   if (asoc-state == SCTP_STATE_SHUTDOWN_PENDING) {
 +   if (!q-asoc-peer.rwnd 
 +   asoc-state == SCTP_STATE_SHUTDOWN_PENDING) {
 /*
  * We are here likely because the receiver had its rwnd
  * closed for a while and we have not been able to


 This may not work as expected.  peer.rwnd is the calculated peer window, but 
 it
 also gets updated when we receive sacks.  So there is no way to tell that
 the current windows is 0 because peer told us, or because we sent data to 
 make 0
 and the peer hasn't responded.
 
 I'm not sure I follow you, Vlad. I don't think we care on why we have
 zero-window in there, just that if we are at it on that stage. Either
 one, if it's zero window, we will go through T5 and give it more time to
 recover, but if it's not zero window, I don't see a reason to enable T5..

No, these are 2 distinct instances.  In one instance, the peer is reachable and
is able to communication 0 rwnd state to us.  Thus we are being nice and 
granting
the peer more time to exit the 0 window state.

In the other state, the peer is unreachable and we just happen to hit the 
0-window
condition based on some estimations of the peer window.  In this case, we should
be subject to the Max.RTX and terminate the association sooner.

-vlad

 
   Marcelo
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] sctp: fix src address selection if using secondary address

2015-07-20 Thread Vlad Yasevich
On 07/17/2015 11:34 AM, Marcelo Ricardo Leitner wrote:
 This series improves the way SCTP chooses its src address so that the
 choosen one will always belong to the interface being used for output.
 
 v1-v2:
  - split out the refactoring from the fix itself
  - Doing a full reverse routing as in v1 is not necessary. Only looking
for the interface that has the address and comparing its number is
enough.
 
 Marcelo Ricardo Leitner (2):
   sctp: reduce indent level on sctp_v4_get_dst
   sctp: fix src address selection if using secondary addresses
 

For the series:

Acked-by: Vlad Yasevich vyasev...@gmail.com

Thanks
-vlad

  net/sctp/protocol.c | 42 +++---
  1 file changed, 27 insertions(+), 15 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >