Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.

2018-09-13 Thread Michal Soltys
On 2018-07-19 18:20, Michal Soltys wrote:
> On 07/19/2018 01:41 AM, Mahesh Bandewar wrote:
>> From: Mahesh Bandewar 
>>
>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>> skb->dev set to link that packets arrived on") changed the behavior
>> of how link-local-multicast packets are processed. The change in
>> the behavior broke some legacy use cases where these packets are
>> expected to arrive on bonding master device also.
>>
>> This patch passes the packet to the stack with the link it arrived
>> on as well as passes to the bonding-master device to preserve the
>> legacy use case.
>>
>> Fixes: b89f04c61efe ("bonding: deliver link-local packets with
>> skb->dev set to link that packets arrived on")
>> Reported-by: Michal Soltys 
>> Signed-off-by: Mahesh Bandewar 
>> ---
>> v2: Added Fixes tag.
>> v1: Initial patch.
>>   drivers/net/bonding/bond_main.c | 17 +++--
>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c
>> b/drivers/net/bonding/bond_main.c
>> index 9a2ea3c1f949..1d3b7d8448f2 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1177,9 +1177,22 @@ static rx_handler_result_t
>> bond_handle_frame(struct sk_buff **pskb)
>>   }
>>   }
>>   -    /* don't change skb->dev for link-local packets */
>> -    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>> +    /* Link-local multicast packets should be passed to the
>> + * stack on the link they arrive as well as pass them to the
>> + * bond-master device. These packets are mostly usable when
>> + * stack receives it with the link on which they arrive
>> + * (e.g. LLDP) but there may be some legacy behavior that
>> + * expects these packets to appear on bonding master too.
> 
> I'd really change the comment from:
> 
> "These packets are mostly usable when stack receives it with the link on
> which they arrive (e.g. LLDP) but there may be some legacy behavior that
> expects these packets to appear on bonding master too."
> 
> to something like:
> 
> "These packets are mostly usable when stack receives it with the link on
> which they arrive, but they also must be available on aggregations. Some
> of the use cases include (but are not limited to): LLDP agents that must
> be able to operate both on enslaved interfaces as well as on bonds
> themselves; linux bridges that must be able to process/pass BPDUs from
> attached bonds when any kind of stp version is enabled on the network."
> 
> It's a bit longer, but clarifies the reasons more precisely (without
> going too deep into features like group_fwd_mask).
> 

Anyway, any chance for that patch to get merged ? It would be great to
get the correct functionality back asap. As for the comment, I'll submit
a trivial patch expanding/clarifying it later (or I can resubmit
adjusted v3 if it's ok with Mahesh).


Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.

2018-07-19 Thread Michal Soltys

On 07/19/2018 01:41 AM, Mahesh Bandewar wrote:

From: Mahesh Bandewar 

Commit b89f04c61efe ("bonding: deliver link-local packets with
skb->dev set to link that packets arrived on") changed the behavior
of how link-local-multicast packets are processed. The change in
the behavior broke some legacy use cases where these packets are
expected to arrive on bonding master device also.

This patch passes the packet to the stack with the link it arrived
on as well as passes to the bonding-master device to preserve the
legacy use case.

Fixes: b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link 
that packets arrived on")
Reported-by: Michal Soltys 
Signed-off-by: Mahesh Bandewar 
---
v2: Added Fixes tag.
v1: Initial patch.
  drivers/net/bonding/bond_main.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9a2ea3c1f949..1d3b7d8448f2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1177,9 +1177,22 @@ static rx_handler_result_t bond_handle_frame(struct 
sk_buff **pskb)
}
}
  
-	/* don't change skb->dev for link-local packets */

-   if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
+   /* Link-local multicast packets should be passed to the
+* stack on the link they arrive as well as pass them to the
+* bond-master device. These packets are mostly usable when
+* stack receives it with the link on which they arrive
+* (e.g. LLDP) but there may be some legacy behavior that
+* expects these packets to appear on bonding master too.


I'd really change the comment from:

"These packets are mostly usable when stack receives it with the link on which 
they arrive (e.g. LLDP) but there may be some legacy behavior that expects 
these packets to appear on bonding master too."


to something like:

"These packets are mostly usable when stack receives it with the link on which 
they arrive, but they also must be available on aggregations. Some of the use 
cases include (but are not limited to): LLDP agents that must be able to 
operate both on enslaved interfaces as well as on bonds themselves; linux 
bridges that must be able to process/pass BPDUs from attached bonds when any 
kind of stp version is enabled on the network."


It's a bit longer, but clarifies the reasons more precisely (without going too 
deep into features like group_fwd_mask).


Re: [PATCH next] bonding: pass link-local packets to bonding master also.

2018-07-17 Thread Michal Soltys
On 2018-07-17 11:32, Michal Soltys wrote:
> On 07/17/2018 01:53 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
>> On Mon, Jul 16, 2018 at 2:24 PM, Jay Vosburgh
>>  wrote:
>>> Mahesh Bandewar  wrote:
>>>
>>>> From: Mahesh Bandewar 
>>>>
>>>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>>>> skb->dev set to link that packets arrived on") changed the behavior
>>>> of how link-local-multicast packets are processed. The change in
>>>> the behavior broke some legacy use cases where these packets are
>>>> expected to arrive on bonding master device also.
>>>>
>>>> This patch passes the packet to the stack with the link it arrived
>>>> on as well as passes to the bonding-master device to preserve the
>>>> legacy use case.
>>>
>>>  Michal, can you test this?  I'm travelling this week and won't
>>> be able to run the patch.
> 
> 
> Yes, will test today and report.
> 

The patch looks to be working fine - tested both passive bridge
(stp_state == 0) and with in-kernel implementation active (stp_state ==
1). No loops, no issues so far.



Re: [PATCH next] bonding: pass link-local packets to bonding master also.

2018-07-17 Thread Michal Soltys

On 07/17/2018 01:57 AM, Mahesh Bandewar (महेश बंडेवार) wrote:

On Mon, Jul 16, 2018 at 4:33 PM, Stephen Hemminger
 wrote:

On Sun, 15 Jul 2018 18:12:46 -0700
Mahesh Bandewar  wrote:


From: Mahesh Bandewar 

Commit b89f04c61efe ("bonding: deliver link-local packets with
skb->dev set to link that packets arrived on") changed the behavior
of how link-local-multicast packets are processed. The change in
the behavior broke some legacy use cases where these packets are
expected to arrive on bonding master device also.

This patch passes the packet to the stack with the link it arrived
on as well as passes to the bonding-master device to preserve the
legacy use case.

Reported-by: Michal Soltys 
Signed-off-by: Mahesh Bandewar 


Thanks for fixing this.

Why not add a Fixes: tag instead of just talking about the commit?
That helps the stable maintainers know which versions of the kernel
need the patch.



Well, I thought about it. It's definitely 'related' but not sure it
'fixes' in true sense. It definitely fixes the broken legacy case
though. Is that sufficient to add 'fixes' tag?



It's __not__ broken legacy case. It's normal behavior, starting with 
specification covering LLDP itself (IEEE Std 802.1AB-2016, page 18, '6.8 LLDP 
and Link Aggregation') and ending with a linux bridge actively doing stp via 
in-kernel implementation or with userspace helper (or inactively passing) and 
being blind to bpdus. Not mentioning a very recent kernel feature like 
per-port group_fwd_mask rendered useless in this case.


Unless you also consider attaching a bond to a linux bridge as a broken legacy 
use case. Among other things mentioned in the other thread.


In this context, the comment in code/log message IMHO (of the attached patch), 
should be changed - as it will be just confusing for anyone reading it in the 
future.


(and I'd very much like the fix to hit relevant stable kernels as well)


Re: [PATCH next] bonding: pass link-local packets to bonding master also.

2018-07-17 Thread Michal Soltys

On 07/17/2018 01:53 AM, Mahesh Bandewar (महेश बंडेवार) wrote:

On Mon, Jul 16, 2018 at 2:24 PM, Jay Vosburgh
 wrote:

Mahesh Bandewar  wrote:


From: Mahesh Bandewar 

Commit b89f04c61efe ("bonding: deliver link-local packets with
skb->dev set to link that packets arrived on") changed the behavior
of how link-local-multicast packets are processed. The change in
the behavior broke some legacy use cases where these packets are
expected to arrive on bonding master device also.

This patch passes the packet to the stack with the link it arrived
on as well as passes to the bonding-master device to preserve the
legacy use case.


 Michal, can you test this?  I'm travelling this week and won't
be able to run the patch.



Yes, will test today and report.


Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-14 Thread Michal Soltys

On 07/13/2018 02:15 AM, Mahesh Bandewar (महेश बंडेवार) wrote:

On Thu, Jul 12, 2018 at 4:14 PM, Michal Soltys  wrote:

On 2018-07-13 00:03, Jay Vosburgh wrote:

Mahesh Bandewar (महेश बंडेवार) wrote:


On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
 wrote:

Michal Soltys  wrote:


On 07/12/2018 04:51 PM, Jay Vosburgh wrote:

Mahesh Bandewar (महेश बंडेवार) wrote:


On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:


Hi,

As weird as that sounds, this is what I observed today after bumping
kernel version. I have a setup where 2 bonds are attached to linux
bridge and physically are connected to two switches doing MSTP (and
linux bridge is just passing them).

Initially I suspected some changes related to bridge code - but quick
peek at the code showed nothing suspicious - and the part of it that
explicitly passes stp frames if stp is not enabled has seen little
changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
regular non-bonded interfaces are attached everything works fine.

Just to be sure I detached the bond (802.3ad mode) and checked it with
simple tcpdump (ether proto \\stp) - and indeed no hello packets were
there (with them being present just fine on active enslaved interface,
or on the bond device in earlier kernels).

If time permits I'll bisect tommorow to pinpoint the commit, but from
quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
debian) and 4.17.3 (tested on archlinux) are failing.

Unless this is already a known issue (or you have any suggestions what
could be responsible).


I believe these are link-local-multicast messages and sometime back a
change went into to not pass those frames to the bonding master. This
could be the side effect of that.


  Mahesh, I suspect you're thinking of:

commit b89f04c61efe3b7756434d693b9203cc0cce002e
Author: Chonggang Li 
Date:   Sun Apr 16 12:02:18 2017 -0700

  bonding: deliver link-local packets with skb->dev set to link that 
packets arrived on

  Michal, are you able to revert this patch and test?

  -J

---
  -Jay Vosburgh, jay.vosbu...@canonical.com




Just tested - yes, reverting that patch solves the issues.


 Chonggang,

 Reading the changelog in your commit referenced above, I'm not
entirely sure what actual problem it is fixing.  Could you elaborate?

 As the patch appears to cause a regression, it needs to be
either fixed or reverted.

 Mahesh, you signed-off on it as well, perhaps you also have some
context?



I think the original idea behind it was to pass the LLDPDUs to the
stack on the interface that they came on since this is considered to
be link-local traffic and passing to bond-master would loose it's
"linklocal-ness". This is true for LLDP and if you change the skb->dev
of the packet, then you don't know which slave link it came on in
(from LLDP consumer's perspective).

I don't know much about STP but trunking two links and aggregating
this link info through bond-master seems wrong. Just like LLDP, you
are losing info specific to a link and the decision derived from that
info could be wrong.

Having said that, we determine "linklocal-ness" by looking at L2 and
bondmaster shares this with lts slaves. So it does seem fair to pass
those frames to the bonding-master but at the same time link-local
traffic is supposed to be limited to the physical link (LLDP/STP/LACP
etc). Your thoughts?


   I agree the whole thing sounds kind of weird, but I'm curious as
to what Michal's actual use case is; he presumably has some practical
use for this, since he noticed that the behavior changed.



The whole "link-local" term is a bit I don't know - at this point it
feels like too many things were thrown into single bag and it got
somewhat confusing (bpdu, lldp, pause frames, lacp, pae, qinq mulitcast
that afaik has its own address) - I added some examples in another reply
I did at the same time as you were typing this one =)


   Michal, you mentioned MSTP and using 802.3ad (LACP) mode; how
does that combination work rationally given that the bond might send and
receive traffic across multiple slaves?  Or does the switch side bundle
the ports together into a single logical interface for MSTP purposes?
On the TX side, I think the bond will likely balance all STP frames to
just one slave.



The basic concept - two "main" switches with "important" machines
connected to those. One switch dies and everything keeps working. With
no unused ports and so on.

In more details:

Originally I was trying MSTP daemon (on "important" machines) which
seems quite well and completely coded, but cannot really work correctly
- as afaik you can't put port (in linux bridge conext) in different
forwarding/blocking/etc. state per-region - itow per group of vlans (or
mstpd didn't know how to do that, or it wasn't implemented - I didn't
look too deep back then, though my interest resurfaced in recent da

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread Michal Soltys
On 2018-07-13 00:03, Jay Vosburgh wrote:
> Mahesh Bandewar (महेश बंडेवार) wrote:
> 
>>On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
>> wrote:
>>> Michal Soltys  wrote:
>>>
>>>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>>>
>>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>>>> linux bridge is just passing them).
>>>>>>>
>>>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>>>
>>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>>>> there (with them being present just fine on active enslaved interface,
>>>>>>> or on the bond device in earlier kernels).
>>>>>>>
>>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>>>
>>>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>>>> could be responsible).
>>>>>>>
>>>>>> I believe these are link-local-multicast messages and sometime back a
>>>>>> change went into to not pass those frames to the bonding master. This
>>>>>> could be the side effect of that.
>>>>>
>>>>>  Mahesh, I suspect you're thinking of:
>>>>>
>>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>>>> Author: Chonggang Li 
>>>>> Date:   Sun Apr 16 12:02:18 2017 -0700
>>>>>
>>>>>  bonding: deliver link-local packets with skb->dev set to link that 
>>>>> packets arrived on
>>>>>
>>>>>  Michal, are you able to revert this patch and test?
>>>>>
>>>>>  -J
>>>>>
>>>>> ---
>>>>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>>>>>
>>>>
>>>>
>>>>Just tested - yes, reverting that patch solves the issues.
>>>
>>> Chonggang,
>>>
>>> Reading the changelog in your commit referenced above, I'm not
>>> entirely sure what actual problem it is fixing.  Could you elaborate?
>>>
>>> As the patch appears to cause a regression, it needs to be
>>> either fixed or reverted.
>>>
>>> Mahesh, you signed-off on it as well, perhaps you also have some
>>> context?
>>>
>>
>>I think the original idea behind it was to pass the LLDPDUs to the
>>stack on the interface that they came on since this is considered to
>>be link-local traffic and passing to bond-master would loose it's
>>"linklocal-ness". This is true for LLDP and if you change the skb->dev
>>of the packet, then you don't know which slave link it came on in
>>(from LLDP consumer's perspective).
>>
>>I don't know much about STP but trunking two links and aggregating
>>this link info through bond-master seems wrong. Just like LLDP, you
>>are losing info specific to a link and the decision derived from that
>>info could be wrong.
>>
>>Having said that, we determine "linklocal-ness" by looking at L2 and
>>bondmaster shares this with lts slaves. So it does seem fair to pass
>>those frames to the bonding-master but at the same time link-local
>>traffic is supposed to be limited to the physical link (LLDP/STP/LACP
>>etc). Your thoughts?
> 
>   I agree the whole thing sounds

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread Michal Soltys
On 2018-07-12 23:26, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
>  wrote:
>> Michal Soltys  wrote:
>>
>>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>>
>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>>> linux bridge is just passing them).
>>>>>>
>>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>>
>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>>> there (with them being present just fine on active enslaved interface,
>>>>>> or on the bond device in earlier kernels).
>>>>>>
>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>>
>>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>>> could be responsible).
>>>>>>
>>>>> I believe these are link-local-multicast messages and sometime back a
>>>>> change went into to not pass those frames to the bonding master. This
>>>>> could be the side effect of that.
>>>>
>>>>  Mahesh, I suspect you're thinking of:
>>>>
>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>>> Author: Chonggang Li 
>>>> Date:   Sun Apr 16 12:02:18 2017 -0700
>>>>
>>>>  bonding: deliver link-local packets with skb->dev set to link that 
>>>> packets arrived on
>>>>
>>>>  Michal, are you able to revert this patch and test?
>>>>
>>>>  -J
>>>>
>>>> ---
>>>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>>>>
>>>
>>>
>>>Just tested - yes, reverting that patch solves the issues.
>>
>> Chonggang,
>>
>> Reading the changelog in your commit referenced above, I'm not
>> entirely sure what actual problem it is fixing.  Could you elaborate?
>>
>> As the patch appears to cause a regression, it needs to be
>> either fixed or reverted.
>>
>> Mahesh, you signed-off on it as well, perhaps you also have some
>> context?
>>
> 
> I think the original idea behind it was to pass the LLDPDUs to the
> stack on the interface that they came on since this is considered to
> be link-local traffic and passing to bond-master would loose it's
> "linklocal-ness". This is true for LLDP and if you change the skb->dev
> of the packet, then you don't know which slave link it came on in
> (from LLDP consumer's perspective).
> 
> I don't know much about STP but trunking two links and aggregating
> this link info through bond-master seems wrong. Just like LLDP, you
> are losing info specific to a link and the decision derived from that
> info could be wrong.
> 
> Having said that, we determine "linklocal-ness" by looking at L2 and
> bondmaster shares this with lts slaves. So it does seem fair to pass
> those frames to the bonding-master but at the same time link-local
> traffic is supposed to be limited to the physical link (LLDP/STP/LACP
> etc). Your thoughts?
> 

But, isn't bond de-facto considered the "physical link" ? Not directly
of course, but say an LLDP daemon would likely be more interested in
getting LLDP data from a bond device (or a bridge device, if the bond is
attached to one), than from its enslaved interfaces (and enslaved
interfaces can be changed, not mentioning potentially complex setup
itself, even if usually it's just lacp 

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread Michal Soltys

On 07/12/2018 04:51 PM, Jay Vosburgh wrote:

Mahesh Bandewar (महेश बंडेवार) wrote:


On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:


Hi,

As weird as that sounds, this is what I observed today after bumping
kernel version. I have a setup where 2 bonds are attached to linux
bridge and physically are connected to two switches doing MSTP (and
linux bridge is just passing them).

Initially I suspected some changes related to bridge code - but quick
peek at the code showed nothing suspicious - and the part of it that
explicitly passes stp frames if stp is not enabled has seen little
changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
regular non-bonded interfaces are attached everything works fine.

Just to be sure I detached the bond (802.3ad mode) and checked it with
simple tcpdump (ether proto \\stp) - and indeed no hello packets were
there (with them being present just fine on active enslaved interface,
or on the bond device in earlier kernels).

If time permits I'll bisect tommorow to pinpoint the commit, but from
quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
debian) and 4.17.3 (tested on archlinux) are failing.

Unless this is already a known issue (or you have any suggestions what
could be responsible).


I believe these are link-local-multicast messages and sometime back a
change went into to not pass those frames to the bonding master. This
could be the side effect of that.


Mahesh, I suspect you're thinking of:

commit b89f04c61efe3b7756434d693b9203cc0cce002e
Author: Chonggang Li 
Date:   Sun Apr 16 12:02:18 2017 -0700

 bonding: deliver link-local packets with skb->dev set to link that packets 
arrived on

Michal, are you able to revert this patch and test?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com




Just tested - yes, reverting that patch solves the issues.


[BUG] bonded interfaces drop bpdu (stp) frames

2018-07-11 Thread Michal Soltys
Hi,

As weird as that sounds, this is what I observed today after bumping
kernel version. I have a setup where 2 bonds are attached to linux
bridge and physically are connected to two switches doing MSTP (and
linux bridge is just passing them).

Initially I suspected some changes related to bridge code - but quick
peek at the code showed nothing suspicious - and the part of it that
explicitly passes stp frames if stp is not enabled has seen little
changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
regular non-bonded interfaces are attached everything works fine.

Just to be sure I detached the bond (802.3ad mode) and checked it with
simple tcpdump (ether proto \\stp) - and indeed no hello packets were
there (with them being present just fine on active enslaved interface,
or on the bond device in earlier kernels).

If time permits I'll bisect tommorow to pinpoint the commit, but from
quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
debian) and 4.17.3 (tested on archlinux) are failing.

Unless this is already a known issue (or you have any suggestions what
could be responsible).



Re: drr scheduler [mis]configuration question

2016-09-19 Thread Michal Soltys
> 
> At this point I'm a bit lost what I'm doing wrong.
> 

Just in case (erhm... for sake of completness and archiving) to answer
my own borderline silly question - handles and classids are in hex.


drr scheduler [mis]configuration question

2016-09-16 Thread Michal Soltys
Hi,

I have hit some weird (probably missing some detail) issue with drr.
Originally it was tested between two machines, then I quickly double
checked between namespaces (same behaviour) - the configuration follows:

# setup namespace

ip netns add drrtest
ip li add name left type veth peer name right
ip li set right netns drrtest

# setup 'left' interface with drr

tc qdisc add dev left handle 1:0 root drr
for x in {1..16}; do tc class add dev left classid 1:$x parent 1:0 drr; done
for x in {1..16}; do tc qdisc add dev left handle $((256+x)):0 parent 1:$x 
pfifo_fast; done
tc filter add dev left proto all pref 1 parent 1:0 handle 1 flow map key dst 
divisor 16 baseclass 1:1
ip add add 10.15.0.255/16 dev left
ip li set left up

The above creates simple drr setup with 16 classes, counted from 1:1 to 1:16
The flow filter should simply distribute it across those classes,
relevant piece of code would suggest everything is ok:

if (f->divisor)
classid %= f->divisor;

res->class   = 0;
res->classid = TC_H_MAKE(f->baseclass, f->baseclass + classid);

So 'classid' should 0-15 and after TC_H_MAKE() 1:1-1:16

# setup 'right' interface

ip netns exec drrtest bash

ip add add 10.15.1.249/16 dev right
ip li set right up


But with this setup, any transfer from 'left' to 'right' is blackholed.
If I change ip address of 'right' to 10.15.1.248/16, everything works
again - which would suggest some issue with proper classification.

At this point I'm a bit lost what I'm doing wrong.




Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it

2016-09-08 Thread Michal Soltys
On 2016-09-08 04:03, Toshiaki Makita wrote:
> On 2016/09/08 3:22, Michal Soltys wrote:
> ...
>> 4.7.2
>> git describe on that commit suggests it's been available since 4.6.x
>> 
>> What I did in details:
>> 
>> ip li add name port1b type veth peer name port1e
>> ip li add br0 type bridge
>> ip li set dev br0 type bridge vlan_default_pvid 0
>> ip li set dev br0 type bridge vlan_filtering 1
>> bridge vlan add dev br0 vid 10 self
>> bridge vlan add dev br0 vid 250 untagged pvid self
>> ip li add link br0 name vlan10 type vlan id 10
>> ip li set port1b master br0
>> 
>> At this point br0.vlan10 had outdated mac after br0 took port1b's one as
>> its own.
> 
> If the mac address of lower device is changed while vlan device is down,
> the address will be synchronized when vlan device becomes up.
> Please try "ip li set vlan10 up".
> 

Gah, yea upping interface updates it correctly. Originally I found this
issue on older kernel, so I quickly verified the behaviour on the
current one - but didn't do ip up.

Sorry for the noise.


Re: XPS configuration question (on tg3)

2016-09-07 Thread Michal Soltys
On 2016-09-07 02:19, Eric Dumazet wrote:
> On Tue, 2016-09-06 at 23:00 +0200, Michal Soltys wrote:
>> On 2016-09-06 22:21, Alexander Duyck wrote:
>> > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys <sol...@ziu.info> wrote:
>> >> Hi,
>> >>
>> >> I've been testing different configurations and I didn't manage to get XPS 
>> >> to "behave" correctly - so I'm probably misunderstanding or forgetting 
>> >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 
>> >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx 
>> >> and rx), 1 was unused (this got me scratching my head a bit) and the 
>> >> remaining one was for the last rx (though due to another bug recently 
>> >> fixed the 4th rx queue was inconfigurable on receive side). The names 
>> >> were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4.
>> >>
>> >> The XPS was configured as:
>> >>
>> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus
>> >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus
>> >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus
>> >>
>> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue 
>> >> only, 4-7 tx-1 and 8-15 tx-2.
>> >>
>> >> Just in case rx side could get in the way as far as flows go, relevant 
>> >> irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 
>> >> 10 - falling into groups defined by the above masks.
>> >>
>> >> I tested both with mq and multiq scheduler, essentially either this:
>> >>
>> >> qdisc mq 2: root
>> >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >>
>> >> or this (for the record, skbaction queue_mapping was behaving correctly 
>> >> with the one below):
>> >>
>> >> qdisc multiq 3: root refcnt 6 bands 3/5
>> >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >>
>> >> Now, do I understand correctly, that under the above setup - commands 
>> >> such as
>> >>
>> >> taskset 400 nc -p $prt host_ip 12345 > >> or
>> >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10
>> >>
>> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that 
>> >> supports affinity by itself) and sending data to some other host on the 
>> >> net - should *always* use tx-2 queue ?
>> >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 
>> >> > >>
>> >> In my case, what queue it used was basically random (on top of that it 
>> >> sometimes changed the used queue mid-transfer) what could be easily 
>> >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a 
>> >> bit at loss now, as I though xps configuration should be absolute.
>> >>
>> >> Well, I'd be greatful for some pointers / hints.
>> > 
>> > So it sounds like you have everything configured correctly.  The one
>> > question I would have is if we are certain the CPU pinning is working
>> > for the application.  You might try using something like perf to
>> > verify what is running on CPU 10, and what is running on the CPUs that
>> > the queues are associated with.
>> > 
>> 
>> I did verify with 'top' in this case. I'll double check tommorow just to
>> be sure. Other than testing, there was nothing else running on the machine.
>> 
>> > Also after you have configured things you may want to double check and
>> > verify the xps_cpus value is still set.  I know under some
>> > circumstances the value can be reset by a device driver if the number
>> > of queues changes, or if the interface toggles between being
>> > administratively up/down.
>> 
>> Hmm, none of this was happening during tests.
>> 
>> Are there any other circumstances

Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it

2016-09-07 Thread Michal Soltys
On 2016-09-07 02:44, Toshiaki Makita wrote:
> On 2016/09/07 6:59, Michal Soltys wrote:
>> Consider following scenario:
>> 
>> - create vlan aware bridge (say br0)
>> - setup br0's vlans, e.g.
>> 
>> bridge vlan add dev br0 vid 10 self
>> 
>> This will add necessary fdb entries directing appropriate traffic to the
>> bridge itself.
>> 
>> - create appropriate vlan interfaces on top of it, for example:
>> 
>> ip li add link br0 name br0.10 type vlan id 10
>> ip add add 10.0.0.1/8 dev br0.10 
>> 
>> This will add vlan devices on top of br0 and *inherit br0's mac address*.
>> 
>> - now after all of the above is done
>> 
>> ip li set eth0 master br0
>> 
>> This will attach interface eth0 to the bridge. With this being the first
>> interface attached, br0 will take it's mac address as its own. Any
>> further changes to br0's ports may cause the same, with the lowest mac
>> address of some port becoming br0's mac.
>> 
>> This will update fdb entries as well, but all vlan interfaces on top of
>> br0 (e.g. br0.10) will be using old mac address from the time when vlan
>> was created.
>> 
>> The side effect of it is that any traffic addressed to such interface
>> will be flooded to all ports (and br0 itself).
>> 
>> The only workaround I found is to either manually update mac addresses
>> with 'ip' or recreate vlans (bridge fdb refused to update relevant entries).
>> 
>> But if br0's mac changes due to some port changes - shouldn't it be
>> somehow propagated automatically to vlans created on top of it ?
> 
> This should have been addressed at least in kernel 4.7...
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=308453aa9156a3b8ee382c0949befb507a32b0c1
> 
> Which kernel version do you use?
> 

4.7.2
git describe on that commit suggests it's been available since 4.6.x

What I did in details:

ip li add name port1b type veth peer name port1e
ip li add br0 type bridge
ip li set dev br0 type bridge vlan_default_pvid 0
ip li set dev br0 type bridge vlan_filtering 1
bridge vlan add dev br0 vid 10 self
bridge vlan add dev br0 vid 250 untagged pvid self
ip li add link br0 name vlan10 type vlan id 10
ip li set port1b master br0

At this point br0.vlan10 had outdated mac after br0 took port1b's one as
its own.




Re: XPS configuration question (on tg3)

2016-09-07 Thread Michal Soltys
On 2016-09-07 02:19, Eric Dumazet wrote:
> On Tue, 2016-09-06 at 23:00 +0200, Michal Soltys wrote:
>> On 2016-09-06 22:21, Alexander Duyck wrote:
>> > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys <sol...@ziu.info> wrote:
>> >> Hi,
>> >>
>> >> I've been testing different configurations and I didn't manage to get XPS 
>> >> to "behave" correctly - so I'm probably misunderstanding or forgetting 
>> >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 
>> >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx 
>> >> and rx), 1 was unused (this got me scratching my head a bit) and the 
>> >> remaining one was for the last rx (though due to another bug recently 
>> >> fixed the 4th rx queue was inconfigurable on receive side). The names 
>> >> were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4.
>> >>
>> >> The XPS was configured as:
>> >>
>> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus
>> >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus
>> >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus
>> >>
>> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue 
>> >> only, 4-7 tx-1 and 8-15 tx-2.
>> >>
>> >> Just in case rx side could get in the way as far as flows go, relevant 
>> >> irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 
>> >> 10 - falling into groups defined by the above masks.
>> >>
>> >> I tested both with mq and multiq scheduler, essentially either this:
>> >>
>> >> qdisc mq 2: root
>> >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
>> >> 1 1 1
>> >>
>> >> or this (for the record, skbaction queue_mapping was behaving correctly 
>> >> with the one below):
>> >>
>> >> qdisc multiq 3: root refcnt 6 bands 3/5
>> >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
>> >> 1 1 1 1
>> >>
>> >> Now, do I understand correctly, that under the above setup - commands 
>> >> such as
>> >>
>> >> taskset 400 nc -p $prt host_ip 12345 > >> or
>> >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10
>> >>
>> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that 
>> >> supports affinity by itself) and sending data to some other host on the 
>> >> net - should *always* use tx-2 queue ?
>> >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 
>> >> > >>
>> >> In my case, what queue it used was basically random (on top of that it 
>> >> sometimes changed the used queue mid-transfer) what could be easily 
>> >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a 
>> >> bit at loss now, as I though xps configuration should be absolute.
>> >>
>> >> Well, I'd be greatful for some pointers / hints.
>> > 
>> > So it sounds like you have everything configured correctly.  The one
>> > question I would have is if we are certain the CPU pinning is working
>> > for the application.  You might try using something like perf to
>> > verify what is running on CPU 10, and what is running on the CPUs that
>> > the queues are associated with.
>> > 
>> 
>> I did verify with 'top' in this case. I'll double check tommorow just to
>> be sure. Other than testing, there was nothing else running on the machine.
>> 
>> > Also after you have configured things you may want to double check and
>> > verify the xps_cpus value is still set.  I know under some
>> > circumstances the value can be reset by a device driver if the number
>> > of queues changes, or if the interface toggles between being
>> > administratively up/down.
>> 
>> Hmm, none of this was happening during tests.
>> 
>> Are there any other circu

vlan aware bridge doesn't propagate mac changes to vlans on top of it

2016-09-06 Thread Michal Soltys
Consider following scenario:

- create vlan aware bridge (say br0)
- setup br0's vlans, e.g.

bridge vlan add dev br0 vid 10 self

This will add necessary fdb entries directing appropriate traffic to the
bridge itself.

- create appropriate vlan interfaces on top of it, for example:

ip li add link br0 name br0.10 type vlan id 10
ip add add 10.0.0.1/8 dev br0.10 

This will add vlan devices on top of br0 and *inherit br0's mac address*.

- now after all of the above is done

ip li set eth0 master br0

This will attach interface eth0 to the bridge. With this being the first
interface attached, br0 will take it's mac address as its own. Any
further changes to br0's ports may cause the same, with the lowest mac
address of some port becoming br0's mac.

This will update fdb entries as well, but all vlan interfaces on top of
br0 (e.g. br0.10) will be using old mac address from the time when vlan
was created.

The side effect of it is that any traffic addressed to such interface
will be flooded to all ports (and br0 itself).

The only workaround I found is to either manually update mac addresses
with 'ip' or recreate vlans (bridge fdb refused to update relevant entries).

But if br0's mac changes due to some port changes - shouldn't it be
somehow propagated automatically to vlans created on top of it ?



Re: XPS configuration question (on tg3)

2016-09-06 Thread Michal Soltys
On 2016-09-06 22:21, Alexander Duyck wrote:
> On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys <sol...@ziu.info> wrote:
>> Hi,
>>
>> I've been testing different configurations and I didn't manage to get XPS to 
>> "behave" correctly - so I'm probably misunderstanding or forgetting 
>> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 
>> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx and 
>> rx), 1 was unused (this got me scratching my head a bit) and the remaining 
>> one was for the last rx (though due to another bug recently fixed the 4th rx 
>> queue was inconfigurable on receive side). The names were: eth1b-0, 
>> eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4.
>>
>> The XPS was configured as:
>>
>> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus
>> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus
>> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus
>>
>> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue 
>> only, 4-7 tx-1 and 8-15 tx-2.
>>
>> Just in case rx side could get in the way as far as flows go, relevant irqs 
>> were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 10 - 
>> falling into groups defined by the above masks.
>>
>> I tested both with mq and multiq scheduler, essentially either this:
>>
>> qdisc mq 2: root
>> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
>> 1 1
>> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
>> 1 1
>> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
>> 1 1
>>
>> or this (for the record, skbaction queue_mapping was behaving correctly with 
>> the one below):
>>
>> qdisc multiq 3: root refcnt 6 bands 3/5
>> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
>> 1 1
>> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
>> 1 1
>> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
>> 1 1
>>
>> Now, do I understand correctly, that under the above setup - commands such as
>>
>> taskset 400 nc -p $prt host_ip 12345 > or
>> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10
>>
>> ITOW - pinning simple nc command on cpu #10 (or using a tool that supports 
>> affinity by itself) and sending data to some other host on the net - should 
>> *always* use tx-2 queue ?
>> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 
>> >
>> In my case, what queue it used was basically random (on top of that it 
>> sometimes changed the used queue mid-transfer) what could be easily 
>> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a bit 
>> at loss now, as I though xps configuration should be absolute.
>>
>> Well, I'd be greatful for some pointers / hints.
> 
> So it sounds like you have everything configured correctly.  The one
> question I would have is if we are certain the CPU pinning is working
> for the application.  You might try using something like perf to
> verify what is running on CPU 10, and what is running on the CPUs that
> the queues are associated with.
> 

I did verify with 'top' in this case. I'll double check tommorow just to
be sure. Other than testing, there was nothing else running on the machine.

> Also after you have configured things you may want to double check and
> verify the xps_cpus value is still set.  I know under some
> circumstances the value can be reset by a device driver if the number
> of queues changes, or if the interface toggles between being
> administratively up/down.

Hmm, none of this was happening during tests.

Are there any other circumstances where xps settings could be ignored or
changed during the test (that is during the actual transfer, not between
separate attempts) ?

One thing I'm a bit afraid is that kernel was not exactly the newest
(3.16), maybe I'm missing some crucial fixes, though xps was added much
earlier than that. Either way, I'll try to redo tests with current
kernel tommorow.



XPS configuration question (on tg3)

2016-09-06 Thread Michal Soltys
Hi,

I've been testing different configurations and I didn't manage to get XPS to 
"behave" correctly - so I'm probably misunderstanding or forgetting something. 
The nic in question (under tg3 driver - BCM5720 and BCM5719 models) was 
configured to 3 tx and 4 rx queues. 3 irqs were shared (tx and rx), 1 was 
unused (this got me scratching my head a bit) and the remaining one was for the 
last rx (though due to another bug recently fixed the 4th rx queue was 
inconfigurable on receive side). The names were: eth1b-0, eth1b-txrx-1, 
eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4.

The XPS was configured as:

echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus
echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus
echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus

So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue only, 
4-7 tx-1 and 8-15 tx-2.

Just in case rx side could get in the way as far as flows go, relevant irqs 
were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 10 - falling 
into groups defined by the above masks.

I tested both with mx and multiq scheduler, essentially either this:

qdisc mq 2: root
qdisc pfifo_fast 0: parent 2:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent 2:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent 2:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 

or this (for the record, skbaction queue_mapping was behaving correctly with 
the one below):

qdisc multiq 3: root refcnt 6 bands 3/5
qdisc pfifo_fast 31: parent 3:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 32: parent 3:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 33: parent 3:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 
1 

Now, do I understand correctly, that under the above setup - commands such as

taskset 400 nc -p $prt host_ip 12345 

[PATCH net-next v2 1/2] net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug

2016-08-02 Thread Michal Soltys
This patch simplifies how we update fsc and calculate vt from it - while
keeping the expected functionality identical with how hfsc behaves
curently. It also fixes a certain issue introduced with
a very old patch.

The idea is, that instead of correcting cl_vt before fsc curve update
(rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
cl_vt local to the current period - we can simply rely on virtual times
and curve values always being in sync - analogously to how rsc and usc
function, except that we use virtual time here.

Why hasn't it been done since the beginning this way ? The likely scenario
(basing on the code trying to correct curves whenever possible) was to
keep the virtual times as small as possible - as they have tendency to
"gallop" forward whenever their siblings and other fair sharing
subtrees are idling. On top of that, current code is subtly bugged, so
cumulative time (without any corrections) is always kept and used in
init_vf() when a new backlog period begins (using cl_cvtoff).

Is cumulative value safe ? Generally yes, though corner cases are easy
to create. For example consider:

1gbit interface
some 100kbit leaf, everything else idle

With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
it's virtual time, so in reality it's 1 times more. ITOW 38 bits are
needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
logarithms rounded up). It's getting somewhat dangerous, but also
requires setup excusing this kind of values not mentioning permanently
backlogged class for a year. In near most extreme case (10gbit, 10kbit
leaf), we have "enough" to hold ~13.6 days in 64 bits.

Well, the issue remains mostly theoretical and cl_cvtoff has been
working fine for all those years. Sensible configuration are de-facto
immune to this issue, and not so sensible can solve it with a cronjob
and its period inversely proportional to the insanity of such setup =)

Now let's explain the subtle bug mentioned earlier.

The issue is related to how offsets are kept and how we calculate
virtual times and update fair service curve(s). The issue itself is
subtle, but easy to observe with long m1 segments. It was introduced in
rather old patch:

Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
in HFSC scheduler"

(available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)

Originally when a new backlog period was started, cl_vtoff of each
sibling was updated with cl_cvtmax from past period - naturally moving
all cl_vt to proper starting point. That patch adjusted it so cumulative
offset is kept in the parent, and there is no need for traversing the
list (as any subsequent child activation derives new vt from already
active sibling(s)).

But with this change, cl_vtoff (of each sibling) is no longer persistent
across the inactivity periods, as it's calculated from parent's
cl_cvtoff on a new backlog period, conflicting with the following curve
correction from the previous period:

if (cl->cl_virtual.x == vt) {
cl->cl_virtual.x -= cl->cl_vtoff;
cl->cl_vtoff = 0;
}

This essentially tries to keep curve as if it was local to the period
and resets cl_vtoff (cumulative vt offset of the class) to 0 when
possible (read: when we have an intersection or if a new curve is below
the old one). But then it's recalculated from cl_cvtoff on next active
period.  Then rtsc_min() call preceding the above if() doesn't really
do what we expect it to do in such scenario - as it calculates the
minimum of corrected curve (from the previous backlog period) and the
new uncorrected curve (with offset derived from cl_cvtoff).

Example:

tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit

start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
start A, let it run 10s
pause A briefly to force rtsc_min()

At this point we would expect A to continue at 20mbit after a brief
moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
the effect of first correcting A (during 'start A'), and then - after
unpausing - calculating rtsc_min() from old corrected and new uncorrected
curve.

The patch fixes this bug and keepis vt and fsc in sync (virtual times
are cumulative, not local to the backlog period).

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 44 
 1 file changed, 12 insertions(+), 32 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 3ddc7bd..6329d7d 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -151,11 +151,8 @@ struct hfsc_class {
   (monotonic within a period) */

[PATCH net-next v2 2/2] net/sched/sch_hfsc.c: remove unused cl_myfadj

2016-08-02 Thread Michal Soltys
The code using this variable has been commented out in the past as it
was causing issues in upperlimited link-sharing scenarios.

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 6329d7d..000f1d3 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -142,8 +142,6 @@ struct hfsc_class {
   link-sharing, max(myf, cfmin) */
u64 cl_myf; /* my fit-time (calculated from this
   class's own upperlimit curve) */
-   u64 cl_myfadj;  /* my fit-time adjustment (to cancel
-  history dependence) */
u64 cl_cfmin;   /* earliest children's fit-time (used
   with cl_myf to obtain cl_f) */
u64 cl_cvtmin;  /* minimal virtual time among the
@@ -730,7 +728,6 @@ init_vf(struct hfsc_class *cl, unsigned int len)
/* compute myf */
cl->cl_myf = rtsc_y2x(>cl_ulimit,
  cl->cl_total);
-   cl->cl_myfadj = 0;
}
}
 
@@ -797,9 +794,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 
cur_time)
 
/* update f */
if (cl->cl_flags & HFSC_USC) {
+   cl->cl_myf = rtsc_y2x(>cl_ulimit, cl->cl_total);
+#if 0
cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit,
  cl->cl_total);
-#if 0
/*
 * This code causes classes to stay way under their
 * limit when multiple classes are used at gigabit
@@ -1471,7 +1469,6 @@ hfsc_reset_class(struct hfsc_class *cl)
cl->cl_parentperiod = 0;
cl->cl_f= 0;
cl->cl_myf  = 0;
-   cl->cl_myfadj   = 0;
cl->cl_cfmin= 0;
cl->cl_nactive  = 0;
 
-- 
2.1.3



[PATCH net-next v2 0/2] HFSC patches

2016-08-02 Thread Michal Soltys
Changes since v1:

- in the first patch's commit message, reference old patch using kernel.org's
  historical tree (instead of using gmane)
- in the second patch we just remove variable

Notes regarding patch #1/2:

This patch syncs virtual times with fair service curve and fixes a very old
subtle bug.

The detailed explanation is in the commit message. Additionally
I've made an illustration to help understand the issue better:

http://imgur.com/a/N8uMC

See the example at the bottom of the commit message - Am1_3 and Am2_3 is what
should happen with such queue setup, Am1_3real and Am2_3real is what actually
happens due to rtsc_min() calculating minimum from corrected and uncorrected
curves.

Michal Soltys (2):
  net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old
bug
  net/sched/sch_hfsc.c: remove unused cl_myfadj

 net/sched/sch_hfsc.c | 51 ++-
 1 file changed, 14 insertions(+), 37 deletions(-)

-- 
2.1.3



[PATCH 1/2] net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug

2016-07-31 Thread Michal Soltys
This patch simplifies how we update fsc and calculate vt from it - while
keeping the expected functionality identical with how hfsc behaves
curently. It also fixes a certain issue introduced with
a very old patch.

The idea is, that instead of correcting cl_vt before fsc curve update
(rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
cl_vt local to the current period - we can simply rely on virtual times
and curve values always being in sync - analogously to how rsc and usc
function, except that we use virtual time here.

Why hasn't it been done since the beginning this way ? The likely scenario
(basing on the code trying to correct curves whenever possible) was to
keep the virtual times as small as possible - as they have tendency to
"gallop" forward whenever their siblings and other fair sharing
subtrees are idling. On top of that, current code is subtly bugged, so
cumulative time (without any corrections) is always kept and used in
init_vf() when a new backlog period begins (using cl_cvtoff).

Is cumulative value safe ? Generally yes, though corner cases are easy
to create. For example consider:

1gbit interface
some 100kbit leaf, everything else idle

With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
it's virtual time, so in reality it's 1 times more. ITOW 38 bits are
needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
logarithms rounded up). It's getting somewhat dangerous, but also
requires setup excusing this kind of values not mentioning permanently
backlogged class for a year. In near most extreme case (10gbit, 10kbit
leaf), we have "enough" to hold ~13.6 days in 64 bits.

Well, the issue remains mostly theoretical and cl_cvtoff has been
working fine for all those years. Sensible configuration are de-facto
immune to this issue, and not so sensible can solve it with a cronjob
and its period inversely proportional to the insanity of such setup =)

Now let's explain the subtle bug mentioned earlier.

The issue is related to how offsets are kept and how we calculate
virtual times and update fit-time service curve(s). The issue itself is
subtle, but easy to observe with long m1 segments. It was introduced in
rather old patch (no history in current linux tree):

"O(1) children vtoff adjustment in HFSC scheduler"
http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281

Originally when a new backlog period was started, cl_vtoff of each
sibling was updated with cl_cvtmax from past period - naturally moving
all cl_vt to proper starting point. That patch adjusted it so cumulative
offset is kept in the parent, and there is no need for traversing the
list (as any subsequent child activation derives new vt from already
active sibling(s)).

But with this change, cl_vtoff (of each sibling) is no longer persistent
across the inactivity periods, as it's calculated from parent's
cl_cvtoff on a new backlog period, conflicting with the following curve
correction from the previous period:

if (cl->cl_virtual.x == vt) {
cl->cl_virtual.x -= cl->cl_vtoff;
cl->cl_vtoff = 0;
}

This essentially tries to keep curve as if it was local to the period
and resets cl_vtoff (cumulative vt offset of the class) to 0 when
possible (read: when we have an intersection or if a new curve is below
the old one). But then it's recalculated from cl_cvtoff on next active
period. Then rtsc_min() call preceding the above if() doesn't really
do what we expect it to do in such scenario - as it calculates the
minimum of corrected curve (from the previous backlog period) and the
new uncorrected curve (with offset derived from cl_cvtoff).

Example:

tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit

start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
start A, let it run 10s
pause A briefly to force rtsc_min()

At this point we would expect A to continue at 20mbit after a brief
moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
the effect of first correcting A (during 'start A'), and then - after
unpausing - calculating rtsc_min() from old corrected and new uncorrected
curve.

The patch fixes this bug and keepis vt and fsc in sync (virtual times
are cumulative, not local to the backlog period).

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 44 
 1 file changed, 12 insertions(+), 32 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 3ddc7bd..6329d7d 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -151,11 +151,8 @@ struct hfsc_class {
   (monotonic within a period) */
   

[PATCH 2/2] net/sched/sch_hfsc.c: comment out cl_myfadj

2016-07-31 Thread Michal Soltys
The part of the code using it is commented out, so the needless use
of this variable can be commented out as well.

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 6329d7d..99ddc38 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -142,8 +142,11 @@ struct hfsc_class {
   link-sharing, max(myf, cfmin) */
u64 cl_myf; /* my fit-time (calculated from this
   class's own upperlimit curve) */
+#if 0
+/* code using cl_myfadj is disabled */
u64 cl_myfadj;  /* my fit-time adjustment (to cancel
   history dependence) */
+#endif
u64 cl_cfmin;   /* earliest children's fit-time (used
   with cl_myf to obtain cl_f) */
u64 cl_cvtmin;  /* minimal virtual time among the
@@ -730,7 +733,10 @@ init_vf(struct hfsc_class *cl, unsigned int len)
/* compute myf */
cl->cl_myf = rtsc_y2x(>cl_ulimit,
  cl->cl_total);
+#if 0
+/* code using cl_myfadj is disabled */
cl->cl_myfadj = 0;
+#endif
}
}
 
@@ -797,9 +803,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 
cur_time)
 
/* update f */
if (cl->cl_flags & HFSC_USC) {
+   cl->cl_myf = rtsc_y2x(>cl_ulimit, cl->cl_total);
+#if 0
cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit,
  cl->cl_total);
-#if 0
/*
 * This code causes classes to stay way under their
 * limit when multiple classes are used at gigabit
@@ -1471,7 +1478,10 @@ hfsc_reset_class(struct hfsc_class *cl)
cl->cl_parentperiod = 0;
cl->cl_f= 0;
cl->cl_myf  = 0;
+#if 0
+/* code using cl_myfadj is disabled */
cl->cl_myfadj   = 0;
+#endif
cl->cl_cfmin= 0;
cl->cl_nactive  = 0;
 
-- 
2.1.3



[PATCH 0/2] HFSC patches

2016-07-31 Thread Michal Soltys
- first patch:

This patch syncs virtual times with fair service curve and fixes a very old
subtle bug.

The detailed explanation is in the commit message. Additionally
I've made an illustration to help understand the issue better:

http://imgur.com/a/N8uMC

See the example at the bottom of the commit message - Am1_3 and Am2_3 is what
should happen with such queue setup, Am1_3real and Am2_3real is what actually
happens due to rtsc_min() calculating minimum from corrected and uncorrected
curves.

- second patch:

This is trivial patch that comments out one unused variable, related to another
commented out piece of code.


Michal Soltys (2):
  net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old
bug
  net/sched/sch_hfsc.c: comment out cl_myfadj

 net/sched/sch_hfsc.c | 56 +---
 1 file changed, 23 insertions(+), 33 deletions(-)

-- 
2.1.3



Re: question: tg3 driver/nics and inconsistent RX ring count

2016-07-26 Thread Michal Soltys
On 2016-07-26 22:06, Alexander Duyck wrote:
> On Tue, Jul 26, 2016 at 12:52 PM, Michal Soltys <sol...@ziu.info> wrote:
>> Hi,
>>
>> I have a few of BCM5720 and BCM5719 kinds sitting in Dell R320 and R520
>> servers - and all of them have certain peculiarity: they claim to have
>> up to 4 TX and RX rings (and this can be set/verified just fine through
>> ethtool -l/-L, with driver defaulting to 4 rings), indirection table
>> (ethtool -x) looks fine as well:
>>
>> RX flow hash indirection table for eth1b with 3 RX ring(s):
>> 0:  0 1 2 3 0 1 2 3
>> 8:  0 1 2 3 0 1 2 3
>> ..
>>
>> But this "3 RX ring(s)" is actually a real limit if I try to adjust
>> anything, for example all those commands would fail:
>>
>> ethtool -X eth1b equal 4
>> ethtool -X eth1b weight 1 1 1 1
>>
>> But would work fine for 3 and less rings. This was quickly tested with
>> different kernel/ethtool combinations, from old 3.16 to 4.6.y with same
>> exact results. Nothing fancier (-N/-U) is supported either.
>>
>> Any hints/comments about the cause of this and/or possible workarounds ?
> 
> Well a quick look at the driver code makes it seem the problem lies in
> tg3_get_rxnfc.  I suspect the bug is related to the following lines:
> 
> /* The first interrupt vector only
>  * handles link interrupts.
>  */
> info->data -= 1;
> 
> I'm not sure what the number of interrupt vectors has to do with the
> number of rings.  Perhaps someone more familiar with the driver can
> point out why you would subtract 1 from tp->rxq_cnt to get the number
> of queues when it seems like it should be tp->rxq_cnt.
> 
> Hope that helps.
> 
> - Alex
> 

Ah thanks, seems to be the case then. Quick git blame shows it's been
since the very introduction of RSS indirection configurability (ca.
2011) in this driver, 90415477bf1356f72acc34063ff52441fc10a754.

I've CCed the author, maybe he will be able to shed some light.


question: tg3 driver/nics and inconsistent RX ring count

2016-07-26 Thread Michal Soltys
Hi,

I have a few of BCM5720 and BCM5719 kinds sitting in Dell R320 and R520
servers - and all of them have certain peculiarity: they claim to have
up to 4 TX and RX rings (and this can be set/verified just fine through
ethtool -l/-L, with driver defaulting to 4 rings), indirection table
(ethtool -x) looks fine as well:

RX flow hash indirection table for eth1b with 3 RX ring(s):
0:  0 1 2 3 0 1 2 3
8:  0 1 2 3 0 1 2 3
..

But this "3 RX ring(s)" is actually a real limit if I try to adjust
anything, for example all those commands would fail:

ethtool -X eth1b equal 4
ethtool -X eth1b weight 1 1 1 1

But would work fine for 3 and less rings. This was quickly tested with
different kernel/ethtool combinations, from old 3.16 to 4.6.y with same
exact results. Nothing fancier (-N/-U) is supported either.

Any hints/comments about the cause of this and/or possible workarounds ?



[iproute PATCH 1/1] man/man8/tc-flow.8: minor corrections

2016-07-23 Thread Michal Soltys
- baseclass: major handle must match that of class's, Y defaults to 1
- flow map example: maps to 1-256, not 1-257

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 man/man8/tc-flow.8 | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/man/man8/tc-flow.8 b/man/man8/tc-flow.8
index f1b7e2a..54f6bf7 100644
--- a/man/man8/tc-flow.8
+++ b/man/man8/tc-flow.8
@@ -73,8 +73,10 @@ An offset for the resulting class ID.
 .I ID
 may be
 .BR root ", " none
-or a hexadecimal class ID in the form [\fIX\fB:\fR]\fIY\fR. If \fIX\fR is
-omitted, it is assumed to be zero.
+or a hexadecimal class ID in the form [\fIX\fB:\fR]\fIY\fR. \fIX\fR must
+match qdisc's/class's major handle (if omitted, the correct value is chosen
+automatically). If the whole \fBbaseclass\fR is omitted, \fIY\fR defaults
+to 1.
 .TP
 .BI divisor " NUM"
 Number of buckets to use for sorting into. Keys are calculated modulo
@@ -239,7 +241,7 @@ tc filter add ... flow hash \\
divisor 1024
 .EE
 .TP
-Map destination IPs of 192.168.0.0/24 to classids 1-257:
+Map destination IPs of 192.168.0.0/24 to classids 1-256:
 
 .EX
 tc filter add ... flow map \\
-- 
2.1.3



Re: [PATCH -next] hfsc: reduce hfsc_sched to 14 cachelines

2016-07-04 Thread Michal Soltys
On 2016-07-04 16:22, Florian Westphal wrote:
> hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14
> cachelines by placing level after filter_cnt (covering 4 byte hole) and
> reducing period/nactive/flags to u32 (period is just a counter,
> incremented when class becomes active -- 2**32 is plenty for this
> purpose, also, long is only 32bit wide on 32bit platforms anyway).
> 
> cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period
> member is already u32, so no precision is lost there either.
> 

It should be fine, even if it overflowed (which theoretically isn't that
hard: 1500 mtu, 1gbit interface, 900mbit transfer (meaning some process
throttling itself to 900mbit, not hfsc upperlimiting it) => ~16 hours to
overflow or ITOW 75000 period changes/s) - what really matters (in
init_vf()) is if the period is different.

For the record, I have 2 patches that will trim some stuff further.
Unfortunately I have another 2 that will near surely put it back at
[hopefully only] 16 (if they get accepted that is).

But there're some other candidates that might help (some not that tiny
functions defined as inline that are called in more than 1 place). E.g.
update_cfmin() is called from 3 places.


[PATCH 1/1] iproute2: unmangle netdev/my emails in man pages (hfsc, stab)

2016-07-02 Thread Michal Soltys
No other man pages do so, hiding netdev is kind of silly and I don't
mind having my own address normally visible.
---
 man/man7/tc-hfsc.7 | 4 ++--
 man/man8/tc-hfsc.8 | 4 ++--
 man/man8/tc-stab.8 | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/man/man7/tc-hfsc.7 b/man/man7/tc-hfsc.7
index ca04961..5ae5e6b 100644
--- a/man/man7/tc-hfsc.7
+++ b/man/man7/tc-hfsc.7
@@ -555,8 +555,8 @@ Please refer to \fBtc\-stab\fR(8)
 .
 \fBtc\fR(8), \fBtc\-hfsc\fR(8), \fBtc\-stab\fR(8)
 
-Please direct bugreports and patches to: <net...@vger.kernel.org>
+Please direct bugreports and patches to: <netdev@vger.kernel.org>
 .
 .SH "AUTHOR"
 .
-Manpage created by Michal Soltys (sol...@ziu.info)
+Manpage created by Michal Soltys (sol...@ziu.info)
diff --git a/man/man8/tc-hfsc.8 b/man/man8/tc-hfsc.8
index 5444118..fd0df8f 100644
--- a/man/man8/tc-hfsc.8
+++ b/man/man8/tc-hfsc.8
@@ -54,8 +54,8 @@ parameters, you will specify linear service curve.
 .
 \fBtc\fR(8), \fBtc\-hfsc\fR(7), \fBtc\-stab\fR(8)
 
-Please direct bugreports and patches to: <net...@vger.kernel.org>
+Please direct bugreports and patches to: <netdev@vger.kernel.org>
 .
 .SH "AUTHOR"
 .
-Manpage created by Michal Soltys (sol...@ziu.info)
+Manpage created by Michal Soltys (sol...@ziu.info)
diff --git a/man/man8/tc-stab.8 b/man/man8/tc-stab.8
index 02caa7d..03a0659 100644
--- a/man/man8/tc-stab.8
+++ b/man/man8/tc-stab.8
@@ -156,8 +156,8 @@ it's good to use \fBethtool\fR to turn off offloading 
features.
 .br
 \fB[2]\fR http://www.faqs.org/rfcs/rfc2684.html
 
-Please direct bugreports and patches to: <net...@vger.kernel.org>
+Please direct bugreports and patches to: <netdev@vger.kernel.org>
 .
 .SH "AUTHOR"
 .
-Manpage created by Michal Soltys (sol...@ziu.info)
+Manpage created by Michal Soltys (sol...@ziu.info)
-- 
2.1.3



[net-next PATCH 0/5] HFSC patches, part 1

2016-06-29 Thread Michal Soltys
Hi,

It's revised version of part of the patches I submitted really, really long
time ago (back then I asked Patrick to ignore them as I found some issues
shortly after submitting).

Anyway this is the first set with very simple fixes/changes though some of them
relatively subtle (I tried to do very exhaustive commit messages explaining what
and why with those).

The patches are against net-next tree.

The second set will be heavier - or rather with more complex explanations, 
among those I have:

- a fix to subtle issue introduced in
  http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281
  along with simplifying related stuff
- update times to 96 bits (which allows to "just" use 32 bit shifts and
  improves curve definition accuracy at more extreme low/high speeds)
- add curve "merging" instead of just selecting in convex case (computations
  mirror those from concave intersection)

But these are eventually for later.

Michal Soltys (5):
  net/sched/sch_hfsc.c: handle corner cases where head may change invalidating 
calculated deadline
  net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()
  net/sched/sch_hfsc.c: remove leftover dlist and droplist
  net/sched/sch_hfsc.c: go passive after vt update
  net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()

 net/sched/sch_hfsc.c | 54 ++--
 1 file changed, 27 insertions(+), 27 deletions(-)

-- 
2.1.3



[net-next PATCH 5/5] net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()

2016-06-29 Thread Michal Soltys
cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 4eef857..dff92ea 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -940,7 +940,7 @@ static void
 hfsc_change_fsc(struct hfsc_class *cl, struct tc_service_curve *fsc)
 {
sc2isc(fsc, >cl_fsc);
-   rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vt, cl->cl_total);
+   rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vtoff + cl->cl_vt, 
cl->cl_total);
cl->cl_flags |= HFSC_FSC;
 }
 
-- 
2.1.3



[net-next PATCH 2/5] net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()

2016-06-29 Thread Michal Soltys
The condition can only succeed on wrong configurations.

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 6d6df6b..e2244bb 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -882,7 +882,7 @@ qdisc_peek_len(struct Qdisc *sch)
unsigned int len;
 
skb = sch->ops->peek(sch);
-   if (skb == NULL) {
+   if (unlikely(skb == NULL)) {
qdisc_warn_nonwc("qdisc_peek_len", sch);
return 0;
}
-- 
2.1.3



[net-next PATCH 1/5] net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline

2016-06-29 Thread Michal Soltys
Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).

Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.

But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().

Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 8cb5eff..6d6df6b 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1594,8 +1594,17 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, 
struct sk_buff **to_free)
return err;
}
 
-   if (cl->qdisc->q.qlen == 1)
+   if (cl->qdisc->q.qlen == 1) {
set_active(cl, qdisc_pkt_len(skb));
+   /*
+* If this is the first packet, isolate the head so an eventual
+* head drop before the first dequeue operation has no chance
+* to invalidate the deadline.
+*/
+   if (cl->cl_flags & HFSC_RSC)
+   cl->qdisc->ops->peek(cl->qdisc);
+
+   }
 
qdisc_qstats_backlog_inc(sch, skb);
sch->q.qlen++;
-- 
2.1.3



[net-next PATCH 3/5] net/sched/sch_hfsc.c: remove leftover dlist and droplist

2016-06-29 Thread Michal Soltys
This is update to:
commit a09ceb0e08140a ("sched: remove qdisc->drop")

That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index e2244bb..df07f06 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -130,7 +130,6 @@ struct hfsc_class {
struct rb_node vt_node; /* parent's vt_tree member */
struct rb_root cf_tree; /* active children sorted by cl_f */
struct rb_node cf_node; /* parent's cf_heap member */
-   struct list_head dlist; /* drop list member */
 
u64 cl_total;   /* total work in bytes */
u64 cl_cumul;   /* cumulative work in bytes done by
@@ -177,8 +176,6 @@ struct hfsc_sched {
struct hfsc_class root; /* root class */
struct Qdisc_class_hash clhash; /* class hash */
struct rb_root eligible;/* eligible tree */
-   struct list_head droplist;  /* active leaf class list (for
-  dropping) */
struct qdisc_watchdog watchdog; /* watchdog timer */
 };
 
@@ -858,7 +855,6 @@ set_active(struct hfsc_class *cl, unsigned int len)
if (cl->cl_flags & HFSC_FSC)
init_vf(cl, len);
 
-   list_add_tail(>dlist, >sched->droplist);
 }
 
 static void
@@ -867,8 +863,6 @@ set_passive(struct hfsc_class *cl)
if (cl->cl_flags & HFSC_RSC)
eltree_remove(cl);
 
-   list_del(>dlist);
-
/*
 * vttree is now handled in update_vf() so that update_vf(cl, 0, 0)
 * needs to be called explicitly to remove a class from vttree.
@@ -1443,7 +1437,6 @@ hfsc_init_qdisc(struct Qdisc *sch, struct nlattr *opt)
if (err < 0)
return err;
q->eligible = RB_ROOT;
-   INIT_LIST_HEAD(>droplist);
 
q->root.cl_common.classid = sch->handle;
q->root.refcnt  = 1;
@@ -1527,7 +1520,6 @@ hfsc_reset_qdisc(struct Qdisc *sch)
hfsc_reset_class(cl);
}
q->eligible = RB_ROOT;
-   INIT_LIST_HEAD(>droplist);
qdisc_watchdog_cancel(>watchdog);
sch->qstats.backlog = 0;
sch->q.qlen = 0;
-- 
2.1.3



[net-next PATCH 4/5] net/sched/sch_hfsc.c: go passive after vt update

2016-06-29 Thread Michal Soltys
When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.

Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.

One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).

Signed-off-by: Michal Soltys <sol...@ziu.info>
---
 net/sched/sch_hfsc.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index df07f06..4eef857 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -778,6 +778,20 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 
cur_time)
else
go_passive = 0;
 
+   /* update vt */
+   cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total)
+   - cl->cl_vtoff + cl->cl_vtadj;
+
+   /*
+* if vt of the class is smaller than cvtmin,
+* the class was skipped in the past due to non-fit.
+* if so, we need to adjust vtadj.
+*/
+   if (cl->cl_vt < cl->cl_parent->cl_cvtmin) {
+   cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt;
+   cl->cl_vt = cl->cl_parent->cl_cvtmin;
+   }
+
if (go_passive) {
/* no more active child, going passive */
 
@@ -794,25 +808,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 
cur_time)
continue;
}
 
-   /*
-* update vt and f
-*/
-   cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total)
-   - cl->cl_vtoff + cl->cl_vtadj;
-
-   /*
-* if vt of the class is smaller than cvtmin,
-* the class was skipped in the past due to non-fit.
-* if so, we need to adjust vtadj.
-*/
-   if (cl->cl_vt < cl->cl_parent->cl_cvtmin) {
-   cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt;
-   cl->cl_vt = cl->cl_parent->cl_cvtmin;
-   }
-
/* update the vt tree */
vttree_update(cl);
 
+   /* update f */
if (cl->cl_flags & HFSC_USC) {
cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit,
  cl->cl_total);
-- 
2.1.3



Re: switch / linux STP interoperation issues.

2016-06-24 Thread Michal Soltys

On 2016-06-24 23:09, Elad Raz wrote:

On Fri, Jun 24, 2016 at 10:14 PM, Michal Soltys <sol...@ziu.info> wrote:

Hi,
 
The switch respected BPDUs sent to it (if applicable) - for example it
complied properly if it's priority was less (numerically higher) than
linux's - showing linux box as root bridge, marking one port as root, the
other as alternate/blocking.

The linux box itself was completely deaf to any BPDUs arriving to it (e.g.
if it's priority was lower) and just keept pushing its own data units all
the time with little care (quickly leading to loops in some scenarios).
Whether it was builtin stp implementation, or whether it was mstpd's
stp/rstp/mstp - the behaviour was the same.



Linux is fresh stock archlinux (so vanilla 4.6.2 kernel and the most (or
almost) recent userland utils - iproute2, etc.), running on relatively
recent poweredge dell.

I'm kind of lost at this point - am I missing some basic
options/sysctls/sysfs ? Is there some known incompatibility here somewhere
between switch/linux/nic/versions/etc. ? Some by-default enabled BPDU
filtering maybe ?
 

Any suggestions / hints appreciated.


Please see Ido Schimmel's fix "[net] bridge: Fix incorrect
re-injection of STP packets",
https://patchwork.ozlabs.org/patch/629768/



Thanks !

Will test it on monday asap.



switch / linux STP interoperation issues.

2016-06-24 Thread Michal Soltys

Hi,

In the last week I've been trying to get STP on the linux side (both its 
builtin STP implementation as well as mstpd userspace daemon). Initially 
I started with more complex setups (vlan aware bridge, bonds, mst) and 
gradually (with identical problems on each step) ended with the most 
basic setup that can be summarized by:


brctl addbr br0
brctl addif br0 eno1
brctl addif br0 eno2
brctl stp br0 on
ip li set eno1 up
ip li set eno2 up
ip li set br0 up

The same config on switch's side (cisco 2960-x in its most basic 
incarnation) - in the other words two cables between linux machine and 
the switch, enabled stp, access ports in vlan1.


The end effect of this setup (and any of the more complex previous ones):

The switch respected BPDUs sent to it (if applicable) - for example it 
complied properly if it's priority was less (numerically higher) than 
linux's - showing linux box as root bridge, marking one port as root, 
the other as alternate/blocking.


The linux box itself was completely deaf to any BPDUs arriving to it 
(e.g. if it's priority was lower) and just keept pushing its own data 
units all the time with little care (quickly leading to loops in some 
scenarios). Whether it was builtin stp implementation, or whether it was 
mstpd's stp/rstp/mstp - the behaviour was the same.


With the bridge itself happily claiming to be the root (despite lower 
priority):


br0
 bridge id  a000.000af77cddc4
 designated roota000.000af77cddc4
 root port 0



enp8s0f0 (3)
 port id8003state 
forwarding
 designated roota000.000af77cddc4   path cost 
   4

 designated bridge  a000.000af77cddc4


(and analogous output from mstpctl tool)

tcpdump looked like:

17:33:28.701425 00:0a:f7:7c:dd:c4 > 01:80:c2:00:00:00, 802.3, length 52: 
LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03
: STP 802.1d, Config, Flags [none], bridge-id 
a000.00:0a:f7:7c:dd:c4.8003, length 35
message-age 0.00s, max-age 20.00s, hello-time 2.00s, 
forwarding-delay 15.00s

root-id a000.00:0a:f7:7c:dd:c4, root-pathcost 0
17:33:29.026185 18:8b:45:6f:38:86 > 01:80:c2:00:00:00, 802.3, length 60: 
LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03
: STP 802.1d, Config, Flags [none], bridge-id 
2001.18:8b:45:6f:38:80.8006, length 43
message-age 0.00s, max-age 20.00s, hello-time 2.00s, 
forwarding-delay 15.00s

root-id 2001.18:8b:45:6f:38:80, root-pathcost 0

The first sent by linux box, the second by the switch (the above from 
basic stp scenario on both sides).



The cards in question used:
Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit 
Ethernet PCIe (rev 01) handled by:


driver: tg3
version: 3.137
firmware-version: FFV7.10.17 bc 5719-v1.37


Linux is fresh stock archlinux (so vanilla 4.6.2 kernel and the most (or 
almost) recent userland utils - iproute2, etc.), running on relatively 
recent poweredge dell.


I'm kind of lost at this point - am I missing some basic 
options/sysctls/sysfs ? Is there some known incompatibility here 
somewhere between switch/linux/nic/versions/etc. ? Some by-default 
enabled BPDU filtering maybe ?



Any suggestions / hints appreciated.