Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.
On 2018-07-19 18:20, Michal Soltys wrote: > On 07/19/2018 01:41 AM, Mahesh Bandewar wrote: >> From: Mahesh Bandewar >> >> Commit b89f04c61efe ("bonding: deliver link-local packets with >> skb->dev set to link that packets arrived on") changed the behavior >> of how link-local-multicast packets are processed. The change in >> the behavior broke some legacy use cases where these packets are >> expected to arrive on bonding master device also. >> >> This patch passes the packet to the stack with the link it arrived >> on as well as passes to the bonding-master device to preserve the >> legacy use case. >> >> Fixes: b89f04c61efe ("bonding: deliver link-local packets with >> skb->dev set to link that packets arrived on") >> Reported-by: Michal Soltys >> Signed-off-by: Mahesh Bandewar >> --- >> v2: Added Fixes tag. >> v1: Initial patch. >> drivers/net/bonding/bond_main.c | 17 +++-- >> 1 file changed, 15 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index 9a2ea3c1f949..1d3b7d8448f2 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -1177,9 +1177,22 @@ static rx_handler_result_t >> bond_handle_frame(struct sk_buff **pskb) >> } >> } >> - /* don't change skb->dev for link-local packets */ >> - if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) >> + /* Link-local multicast packets should be passed to the >> + * stack on the link they arrive as well as pass them to the >> + * bond-master device. These packets are mostly usable when >> + * stack receives it with the link on which they arrive >> + * (e.g. LLDP) but there may be some legacy behavior that >> + * expects these packets to appear on bonding master too. > > I'd really change the comment from: > > "These packets are mostly usable when stack receives it with the link on > which they arrive (e.g. LLDP) but there may be some legacy behavior that > expects these packets to appear on bonding master too." > > to something like: > > "These packets are mostly usable when stack receives it with the link on > which they arrive, but they also must be available on aggregations. Some > of the use cases include (but are not limited to): LLDP agents that must > be able to operate both on enslaved interfaces as well as on bonds > themselves; linux bridges that must be able to process/pass BPDUs from > attached bonds when any kind of stp version is enabled on the network." > > It's a bit longer, but clarifies the reasons more precisely (without > going too deep into features like group_fwd_mask). > Anyway, any chance for that patch to get merged ? It would be great to get the correct functionality back asap. As for the comment, I'll submit a trivial patch expanding/clarifying it later (or I can resubmit adjusted v3 if it's ok with Mahesh).
Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.
On 07/19/2018 01:41 AM, Mahesh Bandewar wrote: From: Mahesh Bandewar Commit b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link that packets arrived on") changed the behavior of how link-local-multicast packets are processed. The change in the behavior broke some legacy use cases where these packets are expected to arrive on bonding master device also. This patch passes the packet to the stack with the link it arrived on as well as passes to the bonding-master device to preserve the legacy use case. Fixes: b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link that packets arrived on") Reported-by: Michal Soltys Signed-off-by: Mahesh Bandewar --- v2: Added Fixes tag. v1: Initial patch. drivers/net/bonding/bond_main.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 9a2ea3c1f949..1d3b7d8448f2 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1177,9 +1177,22 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb) } } - /* don't change skb->dev for link-local packets */ - if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) + /* Link-local multicast packets should be passed to the +* stack on the link they arrive as well as pass them to the +* bond-master device. These packets are mostly usable when +* stack receives it with the link on which they arrive +* (e.g. LLDP) but there may be some legacy behavior that +* expects these packets to appear on bonding master too. I'd really change the comment from: "These packets are mostly usable when stack receives it with the link on which they arrive (e.g. LLDP) but there may be some legacy behavior that expects these packets to appear on bonding master too." to something like: "These packets are mostly usable when stack receives it with the link on which they arrive, but they also must be available on aggregations. Some of the use cases include (but are not limited to): LLDP agents that must be able to operate both on enslaved interfaces as well as on bonds themselves; linux bridges that must be able to process/pass BPDUs from attached bonds when any kind of stp version is enabled on the network." It's a bit longer, but clarifies the reasons more precisely (without going too deep into features like group_fwd_mask).
Re: [PATCH next] bonding: pass link-local packets to bonding master also.
On 2018-07-17 11:32, Michal Soltys wrote: > On 07/17/2018 01:53 AM, Mahesh Bandewar (महेश बंडेवार) wrote: >> On Mon, Jul 16, 2018 at 2:24 PM, Jay Vosburgh >> wrote: >>> Mahesh Bandewar wrote: >>> >>>> From: Mahesh Bandewar >>>> >>>> Commit b89f04c61efe ("bonding: deliver link-local packets with >>>> skb->dev set to link that packets arrived on") changed the behavior >>>> of how link-local-multicast packets are processed. The change in >>>> the behavior broke some legacy use cases where these packets are >>>> expected to arrive on bonding master device also. >>>> >>>> This patch passes the packet to the stack with the link it arrived >>>> on as well as passes to the bonding-master device to preserve the >>>> legacy use case. >>> >>> Michal, can you test this? I'm travelling this week and won't >>> be able to run the patch. > > > Yes, will test today and report. > The patch looks to be working fine - tested both passive bridge (stp_state == 0) and with in-kernel implementation active (stp_state == 1). No loops, no issues so far.
Re: [PATCH next] bonding: pass link-local packets to bonding master also.
On 07/17/2018 01:57 AM, Mahesh Bandewar (महेश बंडेवार) wrote: On Mon, Jul 16, 2018 at 4:33 PM, Stephen Hemminger wrote: On Sun, 15 Jul 2018 18:12:46 -0700 Mahesh Bandewar wrote: From: Mahesh Bandewar Commit b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link that packets arrived on") changed the behavior of how link-local-multicast packets are processed. The change in the behavior broke some legacy use cases where these packets are expected to arrive on bonding master device also. This patch passes the packet to the stack with the link it arrived on as well as passes to the bonding-master device to preserve the legacy use case. Reported-by: Michal Soltys Signed-off-by: Mahesh Bandewar Thanks for fixing this. Why not add a Fixes: tag instead of just talking about the commit? That helps the stable maintainers know which versions of the kernel need the patch. Well, I thought about it. It's definitely 'related' but not sure it 'fixes' in true sense. It definitely fixes the broken legacy case though. Is that sufficient to add 'fixes' tag? It's __not__ broken legacy case. It's normal behavior, starting with specification covering LLDP itself (IEEE Std 802.1AB-2016, page 18, '6.8 LLDP and Link Aggregation') and ending with a linux bridge actively doing stp via in-kernel implementation or with userspace helper (or inactively passing) and being blind to bpdus. Not mentioning a very recent kernel feature like per-port group_fwd_mask rendered useless in this case. Unless you also consider attaching a bond to a linux bridge as a broken legacy use case. Among other things mentioned in the other thread. In this context, the comment in code/log message IMHO (of the attached patch), should be changed - as it will be just confusing for anyone reading it in the future. (and I'd very much like the fix to hit relevant stable kernels as well)
Re: [PATCH next] bonding: pass link-local packets to bonding master also.
On 07/17/2018 01:53 AM, Mahesh Bandewar (महेश बंडेवार) wrote: On Mon, Jul 16, 2018 at 2:24 PM, Jay Vosburgh wrote: Mahesh Bandewar wrote: From: Mahesh Bandewar Commit b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link that packets arrived on") changed the behavior of how link-local-multicast packets are processed. The change in the behavior broke some legacy use cases where these packets are expected to arrive on bonding master device also. This patch passes the packet to the stack with the link it arrived on as well as passes to the bonding-master device to preserve the legacy use case. Michal, can you test this? I'm travelling this week and won't be able to run the patch. Yes, will test today and report.
Re: [BUG] bonded interfaces drop bpdu (stp) frames
On 07/13/2018 02:15 AM, Mahesh Bandewar (महेश बंडेवार) wrote: On Thu, Jul 12, 2018 at 4:14 PM, Michal Soltys wrote: On 2018-07-13 00:03, Jay Vosburgh wrote: Mahesh Bandewar (महेश बंडेवार) wrote: On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh wrote: Michal Soltys wrote: On 07/12/2018 04:51 PM, Jay Vosburgh wrote: Mahesh Bandewar (महेश बंडेवार) wrote: On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: Hi, As weird as that sounds, this is what I observed today after bumping kernel version. I have a setup where 2 bonds are attached to linux bridge and physically are connected to two switches doing MSTP (and linux bridge is just passing them). Initially I suspected some changes related to bridge code - but quick peek at the code showed nothing suspicious - and the part of it that explicitly passes stp frames if stp is not enabled has seen little changes (e.g. per-port group_fwd_mask added recently). Furthermore - if regular non-bonded interfaces are attached everything works fine. Just to be sure I detached the bond (802.3ad mode) and checked it with simple tcpdump (ether proto \\stp) - and indeed no hello packets were there (with them being present just fine on active enslaved interface, or on the bond device in earlier kernels). If time permits I'll bisect tommorow to pinpoint the commit, but from quick todays test - 4.9.x is working fine, while 4.16.16 (tested on debian) and 4.17.3 (tested on archlinux) are failing. Unless this is already a known issue (or you have any suggestions what could be responsible). I believe these are link-local-multicast messages and sometime back a change went into to not pass those frames to the bonding master. This could be the side effect of that. Mahesh, I suspect you're thinking of: commit b89f04c61efe3b7756434d693b9203cc0cce002e Author: Chonggang Li Date: Sun Apr 16 12:02:18 2017 -0700 bonding: deliver link-local packets with skb->dev set to link that packets arrived on Michal, are you able to revert this patch and test? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com Just tested - yes, reverting that patch solves the issues. Chonggang, Reading the changelog in your commit referenced above, I'm not entirely sure what actual problem it is fixing. Could you elaborate? As the patch appears to cause a regression, it needs to be either fixed or reverted. Mahesh, you signed-off on it as well, perhaps you also have some context? I think the original idea behind it was to pass the LLDPDUs to the stack on the interface that they came on since this is considered to be link-local traffic and passing to bond-master would loose it's "linklocal-ness". This is true for LLDP and if you change the skb->dev of the packet, then you don't know which slave link it came on in (from LLDP consumer's perspective). I don't know much about STP but trunking two links and aggregating this link info through bond-master seems wrong. Just like LLDP, you are losing info specific to a link and the decision derived from that info could be wrong. Having said that, we determine "linklocal-ness" by looking at L2 and bondmaster shares this with lts slaves. So it does seem fair to pass those frames to the bonding-master but at the same time link-local traffic is supposed to be limited to the physical link (LLDP/STP/LACP etc). Your thoughts? I agree the whole thing sounds kind of weird, but I'm curious as to what Michal's actual use case is; he presumably has some practical use for this, since he noticed that the behavior changed. The whole "link-local" term is a bit I don't know - at this point it feels like too many things were thrown into single bag and it got somewhat confusing (bpdu, lldp, pause frames, lacp, pae, qinq mulitcast that afaik has its own address) - I added some examples in another reply I did at the same time as you were typing this one =) Michal, you mentioned MSTP and using 802.3ad (LACP) mode; how does that combination work rationally given that the bond might send and receive traffic across multiple slaves? Or does the switch side bundle the ports together into a single logical interface for MSTP purposes? On the TX side, I think the bond will likely balance all STP frames to just one slave. The basic concept - two "main" switches with "important" machines connected to those. One switch dies and everything keeps working. With no unused ports and so on. In more details: Originally I was trying MSTP daemon (on "important" machines) which seems quite well and completely coded, but cannot really work correctly - as afaik you can't put port (in linux bridge conext) in different forwarding/blocking/etc. state per-region - itow per group of vlans (or mstpd didn't know how to do that, or it wasn't implemented - I didn't look too deep back then, though my interest resurfaced in recent da
Re: [BUG] bonded interfaces drop bpdu (stp) frames
On 2018-07-13 00:03, Jay Vosburgh wrote: > Mahesh Bandewar (महेश बंडेवार) wrote: > >>On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh >> wrote: >>> Michal Soltys wrote: >>> >>>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote: >>>>> Mahesh Bandewar (महेश बंडेवार) wrote: >>>>> >>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> As weird as that sounds, this is what I observed today after bumping >>>>>>> kernel version. I have a setup where 2 bonds are attached to linux >>>>>>> bridge and physically are connected to two switches doing MSTP (and >>>>>>> linux bridge is just passing them). >>>>>>> >>>>>>> Initially I suspected some changes related to bridge code - but quick >>>>>>> peek at the code showed nothing suspicious - and the part of it that >>>>>>> explicitly passes stp frames if stp is not enabled has seen little >>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if >>>>>>> regular non-bonded interfaces are attached everything works fine. >>>>>>> >>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with >>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were >>>>>>> there (with them being present just fine on active enslaved interface, >>>>>>> or on the bond device in earlier kernels). >>>>>>> >>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from >>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on >>>>>>> debian) and 4.17.3 (tested on archlinux) are failing. >>>>>>> >>>>>>> Unless this is already a known issue (or you have any suggestions what >>>>>>> could be responsible). >>>>>>> >>>>>> I believe these are link-local-multicast messages and sometime back a >>>>>> change went into to not pass those frames to the bonding master. This >>>>>> could be the side effect of that. >>>>> >>>>> Mahesh, I suspect you're thinking of: >>>>> >>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e >>>>> Author: Chonggang Li >>>>> Date: Sun Apr 16 12:02:18 2017 -0700 >>>>> >>>>> bonding: deliver link-local packets with skb->dev set to link that >>>>> packets arrived on >>>>> >>>>> Michal, are you able to revert this patch and test? >>>>> >>>>> -J >>>>> >>>>> --- >>>>> -Jay Vosburgh, jay.vosbu...@canonical.com >>>>> >>>> >>>> >>>>Just tested - yes, reverting that patch solves the issues. >>> >>> Chonggang, >>> >>> Reading the changelog in your commit referenced above, I'm not >>> entirely sure what actual problem it is fixing. Could you elaborate? >>> >>> As the patch appears to cause a regression, it needs to be >>> either fixed or reverted. >>> >>> Mahesh, you signed-off on it as well, perhaps you also have some >>> context? >>> >> >>I think the original idea behind it was to pass the LLDPDUs to the >>stack on the interface that they came on since this is considered to >>be link-local traffic and passing to bond-master would loose it's >>"linklocal-ness". This is true for LLDP and if you change the skb->dev >>of the packet, then you don't know which slave link it came on in >>(from LLDP consumer's perspective). >> >>I don't know much about STP but trunking two links and aggregating >>this link info through bond-master seems wrong. Just like LLDP, you >>are losing info specific to a link and the decision derived from that >>info could be wrong. >> >>Having said that, we determine "linklocal-ness" by looking at L2 and >>bondmaster shares this with lts slaves. So it does seem fair to pass >>those frames to the bonding-master but at the same time link-local >>traffic is supposed to be limited to the physical link (LLDP/STP/LACP >>etc). Your thoughts? > > I agree the whole thing sounds
Re: [BUG] bonded interfaces drop bpdu (stp) frames
On 2018-07-12 23:26, Mahesh Bandewar (महेश बंडेवार) wrote: > On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh > wrote: >> Michal Soltys wrote: >> >>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote: >>>> Mahesh Bandewar (महेश बंडेवार) wrote: >>>> >>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> As weird as that sounds, this is what I observed today after bumping >>>>>> kernel version. I have a setup where 2 bonds are attached to linux >>>>>> bridge and physically are connected to two switches doing MSTP (and >>>>>> linux bridge is just passing them). >>>>>> >>>>>> Initially I suspected some changes related to bridge code - but quick >>>>>> peek at the code showed nothing suspicious - and the part of it that >>>>>> explicitly passes stp frames if stp is not enabled has seen little >>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if >>>>>> regular non-bonded interfaces are attached everything works fine. >>>>>> >>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with >>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were >>>>>> there (with them being present just fine on active enslaved interface, >>>>>> or on the bond device in earlier kernels). >>>>>> >>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from >>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on >>>>>> debian) and 4.17.3 (tested on archlinux) are failing. >>>>>> >>>>>> Unless this is already a known issue (or you have any suggestions what >>>>>> could be responsible). >>>>>> >>>>> I believe these are link-local-multicast messages and sometime back a >>>>> change went into to not pass those frames to the bonding master. This >>>>> could be the side effect of that. >>>> >>>> Mahesh, I suspect you're thinking of: >>>> >>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e >>>> Author: Chonggang Li >>>> Date: Sun Apr 16 12:02:18 2017 -0700 >>>> >>>> bonding: deliver link-local packets with skb->dev set to link that >>>> packets arrived on >>>> >>>> Michal, are you able to revert this patch and test? >>>> >>>> -J >>>> >>>> --- >>>> -Jay Vosburgh, jay.vosbu...@canonical.com >>>> >>> >>> >>>Just tested - yes, reverting that patch solves the issues. >> >> Chonggang, >> >> Reading the changelog in your commit referenced above, I'm not >> entirely sure what actual problem it is fixing. Could you elaborate? >> >> As the patch appears to cause a regression, it needs to be >> either fixed or reverted. >> >> Mahesh, you signed-off on it as well, perhaps you also have some >> context? >> > > I think the original idea behind it was to pass the LLDPDUs to the > stack on the interface that they came on since this is considered to > be link-local traffic and passing to bond-master would loose it's > "linklocal-ness". This is true for LLDP and if you change the skb->dev > of the packet, then you don't know which slave link it came on in > (from LLDP consumer's perspective). > > I don't know much about STP but trunking two links and aggregating > this link info through bond-master seems wrong. Just like LLDP, you > are losing info specific to a link and the decision derived from that > info could be wrong. > > Having said that, we determine "linklocal-ness" by looking at L2 and > bondmaster shares this with lts slaves. So it does seem fair to pass > those frames to the bonding-master but at the same time link-local > traffic is supposed to be limited to the physical link (LLDP/STP/LACP > etc). Your thoughts? > But, isn't bond de-facto considered the "physical link" ? Not directly of course, but say an LLDP daemon would likely be more interested in getting LLDP data from a bond device (or a bridge device, if the bond is attached to one), than from its enslaved interfaces (and enslaved interfaces can be changed, not mentioning potentially complex setup itself, even if usually it's just lacp
Re: [BUG] bonded interfaces drop bpdu (stp) frames
On 07/12/2018 04:51 PM, Jay Vosburgh wrote: Mahesh Bandewar (महेश बंडेवार) wrote: On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: Hi, As weird as that sounds, this is what I observed today after bumping kernel version. I have a setup where 2 bonds are attached to linux bridge and physically are connected to two switches doing MSTP (and linux bridge is just passing them). Initially I suspected some changes related to bridge code - but quick peek at the code showed nothing suspicious - and the part of it that explicitly passes stp frames if stp is not enabled has seen little changes (e.g. per-port group_fwd_mask added recently). Furthermore - if regular non-bonded interfaces are attached everything works fine. Just to be sure I detached the bond (802.3ad mode) and checked it with simple tcpdump (ether proto \\stp) - and indeed no hello packets were there (with them being present just fine on active enslaved interface, or on the bond device in earlier kernels). If time permits I'll bisect tommorow to pinpoint the commit, but from quick todays test - 4.9.x is working fine, while 4.16.16 (tested on debian) and 4.17.3 (tested on archlinux) are failing. Unless this is already a known issue (or you have any suggestions what could be responsible). I believe these are link-local-multicast messages and sometime back a change went into to not pass those frames to the bonding master. This could be the side effect of that. Mahesh, I suspect you're thinking of: commit b89f04c61efe3b7756434d693b9203cc0cce002e Author: Chonggang Li Date: Sun Apr 16 12:02:18 2017 -0700 bonding: deliver link-local packets with skb->dev set to link that packets arrived on Michal, are you able to revert this patch and test? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com Just tested - yes, reverting that patch solves the issues.
[BUG] bonded interfaces drop bpdu (stp) frames
Hi, As weird as that sounds, this is what I observed today after bumping kernel version. I have a setup where 2 bonds are attached to linux bridge and physically are connected to two switches doing MSTP (and linux bridge is just passing them). Initially I suspected some changes related to bridge code - but quick peek at the code showed nothing suspicious - and the part of it that explicitly passes stp frames if stp is not enabled has seen little changes (e.g. per-port group_fwd_mask added recently). Furthermore - if regular non-bonded interfaces are attached everything works fine. Just to be sure I detached the bond (802.3ad mode) and checked it with simple tcpdump (ether proto \\stp) - and indeed no hello packets were there (with them being present just fine on active enslaved interface, or on the bond device in earlier kernels). If time permits I'll bisect tommorow to pinpoint the commit, but from quick todays test - 4.9.x is working fine, while 4.16.16 (tested on debian) and 4.17.3 (tested on archlinux) are failing. Unless this is already a known issue (or you have any suggestions what could be responsible).
Re: drr scheduler [mis]configuration question
> > At this point I'm a bit lost what I'm doing wrong. > Just in case (erhm... for sake of completness and archiving) to answer my own borderline silly question - handles and classids are in hex.
drr scheduler [mis]configuration question
Hi, I have hit some weird (probably missing some detail) issue with drr. Originally it was tested between two machines, then I quickly double checked between namespaces (same behaviour) - the configuration follows: # setup namespace ip netns add drrtest ip li add name left type veth peer name right ip li set right netns drrtest # setup 'left' interface with drr tc qdisc add dev left handle 1:0 root drr for x in {1..16}; do tc class add dev left classid 1:$x parent 1:0 drr; done for x in {1..16}; do tc qdisc add dev left handle $((256+x)):0 parent 1:$x pfifo_fast; done tc filter add dev left proto all pref 1 parent 1:0 handle 1 flow map key dst divisor 16 baseclass 1:1 ip add add 10.15.0.255/16 dev left ip li set left up The above creates simple drr setup with 16 classes, counted from 1:1 to 1:16 The flow filter should simply distribute it across those classes, relevant piece of code would suggest everything is ok: if (f->divisor) classid %= f->divisor; res->class = 0; res->classid = TC_H_MAKE(f->baseclass, f->baseclass + classid); So 'classid' should 0-15 and after TC_H_MAKE() 1:1-1:16 # setup 'right' interface ip netns exec drrtest bash ip add add 10.15.1.249/16 dev right ip li set right up But with this setup, any transfer from 'left' to 'right' is blackholed. If I change ip address of 'right' to 10.15.1.248/16, everything works again - which would suggest some issue with proper classification. At this point I'm a bit lost what I'm doing wrong.
Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it
On 2016-09-08 04:03, Toshiaki Makita wrote: > On 2016/09/08 3:22, Michal Soltys wrote: > ... >> 4.7.2 >> git describe on that commit suggests it's been available since 4.6.x >> >> What I did in details: >> >> ip li add name port1b type veth peer name port1e >> ip li add br0 type bridge >> ip li set dev br0 type bridge vlan_default_pvid 0 >> ip li set dev br0 type bridge vlan_filtering 1 >> bridge vlan add dev br0 vid 10 self >> bridge vlan add dev br0 vid 250 untagged pvid self >> ip li add link br0 name vlan10 type vlan id 10 >> ip li set port1b master br0 >> >> At this point br0.vlan10 had outdated mac after br0 took port1b's one as >> its own. > > If the mac address of lower device is changed while vlan device is down, > the address will be synchronized when vlan device becomes up. > Please try "ip li set vlan10 up". > Gah, yea upping interface updates it correctly. Originally I found this issue on older kernel, so I quickly verified the behaviour on the current one - but didn't do ip up. Sorry for the noise.
Re: XPS configuration question (on tg3)
On 2016-09-07 02:19, Eric Dumazet wrote: > On Tue, 2016-09-06 at 23:00 +0200, Michal Soltys wrote: >> On 2016-09-06 22:21, Alexander Duyck wrote: >> > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys <sol...@ziu.info> wrote: >> >> Hi, >> >> >> >> I've been testing different configurations and I didn't manage to get XPS >> >> to "behave" correctly - so I'm probably misunderstanding or forgetting >> >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 >> >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx >> >> and rx), 1 was unused (this got me scratching my head a bit) and the >> >> remaining one was for the last rx (though due to another bug recently >> >> fixed the 4th rx queue was inconfigurable on receive side). The names >> >> were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4. >> >> >> >> The XPS was configured as: >> >> >> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus >> >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus >> >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus >> >> >> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue >> >> only, 4-7 tx-1 and 8-15 tx-2. >> >> >> >> Just in case rx side could get in the way as far as flows go, relevant >> >> irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to >> >> 10 - falling into groups defined by the above masks. >> >> >> >> I tested both with mq and multiq scheduler, essentially either this: >> >> >> >> qdisc mq 2: root >> >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> >> >> or this (for the record, skbaction queue_mapping was behaving correctly >> >> with the one below): >> >> >> >> qdisc multiq 3: root refcnt 6 bands 3/5 >> >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> >> >> Now, do I understand correctly, that under the above setup - commands >> >> such as >> >> >> >> taskset 400 nc -p $prt host_ip 12345 > >> or >> >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10 >> >> >> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that >> >> supports affinity by itself) and sending data to some other host on the >> >> net - should *always* use tx-2 queue ? >> >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 >> >> > >> >> >> In my case, what queue it used was basically random (on top of that it >> >> sometimes changed the used queue mid-transfer) what could be easily >> >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a >> >> bit at loss now, as I though xps configuration should be absolute. >> >> >> >> Well, I'd be greatful for some pointers / hints. >> > >> > So it sounds like you have everything configured correctly. The one >> > question I would have is if we are certain the CPU pinning is working >> > for the application. You might try using something like perf to >> > verify what is running on CPU 10, and what is running on the CPUs that >> > the queues are associated with. >> > >> >> I did verify with 'top' in this case. I'll double check tommorow just to >> be sure. Other than testing, there was nothing else running on the machine. >> >> > Also after you have configured things you may want to double check and >> > verify the xps_cpus value is still set. I know under some >> > circumstances the value can be reset by a device driver if the number >> > of queues changes, or if the interface toggles between being >> > administratively up/down. >> >> Hmm, none of this was happening during tests. >> >> Are there any other circumstances
Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it
On 2016-09-07 02:44, Toshiaki Makita wrote: > On 2016/09/07 6:59, Michal Soltys wrote: >> Consider following scenario: >> >> - create vlan aware bridge (say br0) >> - setup br0's vlans, e.g. >> >> bridge vlan add dev br0 vid 10 self >> >> This will add necessary fdb entries directing appropriate traffic to the >> bridge itself. >> >> - create appropriate vlan interfaces on top of it, for example: >> >> ip li add link br0 name br0.10 type vlan id 10 >> ip add add 10.0.0.1/8 dev br0.10 >> >> This will add vlan devices on top of br0 and *inherit br0's mac address*. >> >> - now after all of the above is done >> >> ip li set eth0 master br0 >> >> This will attach interface eth0 to the bridge. With this being the first >> interface attached, br0 will take it's mac address as its own. Any >> further changes to br0's ports may cause the same, with the lowest mac >> address of some port becoming br0's mac. >> >> This will update fdb entries as well, but all vlan interfaces on top of >> br0 (e.g. br0.10) will be using old mac address from the time when vlan >> was created. >> >> The side effect of it is that any traffic addressed to such interface >> will be flooded to all ports (and br0 itself). >> >> The only workaround I found is to either manually update mac addresses >> with 'ip' or recreate vlans (bridge fdb refused to update relevant entries). >> >> But if br0's mac changes due to some port changes - shouldn't it be >> somehow propagated automatically to vlans created on top of it ? > > This should have been addressed at least in kernel 4.7... > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=308453aa9156a3b8ee382c0949befb507a32b0c1 > > Which kernel version do you use? > 4.7.2 git describe on that commit suggests it's been available since 4.6.x What I did in details: ip li add name port1b type veth peer name port1e ip li add br0 type bridge ip li set dev br0 type bridge vlan_default_pvid 0 ip li set dev br0 type bridge vlan_filtering 1 bridge vlan add dev br0 vid 10 self bridge vlan add dev br0 vid 250 untagged pvid self ip li add link br0 name vlan10 type vlan id 10 ip li set port1b master br0 At this point br0.vlan10 had outdated mac after br0 took port1b's one as its own.
Re: XPS configuration question (on tg3)
On 2016-09-07 02:19, Eric Dumazet wrote: > On Tue, 2016-09-06 at 23:00 +0200, Michal Soltys wrote: >> On 2016-09-06 22:21, Alexander Duyck wrote: >> > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys <sol...@ziu.info> wrote: >> >> Hi, >> >> >> >> I've been testing different configurations and I didn't manage to get XPS >> >> to "behave" correctly - so I'm probably misunderstanding or forgetting >> >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 >> >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx >> >> and rx), 1 was unused (this got me scratching my head a bit) and the >> >> remaining one was for the last rx (though due to another bug recently >> >> fixed the 4th rx queue was inconfigurable on receive side). The names >> >> were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4. >> >> >> >> The XPS was configured as: >> >> >> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus >> >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus >> >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus >> >> >> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue >> >> only, 4-7 tx-1 and 8-15 tx-2. >> >> >> >> Just in case rx side could get in the way as far as flows go, relevant >> >> irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to >> >> 10 - falling into groups defined by the above masks. >> >> >> >> I tested both with mq and multiq scheduler, essentially either this: >> >> >> >> qdisc mq 2: root >> >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> >> >> or this (for the record, skbaction queue_mapping was behaving correctly >> >> with the one below): >> >> >> >> qdisc multiq 3: root refcnt 6 bands 3/5 >> >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> >> >> Now, do I understand correctly, that under the above setup - commands >> >> such as >> >> >> >> taskset 400 nc -p $prt host_ip 12345 > >> or >> >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10 >> >> >> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that >> >> supports affinity by itself) and sending data to some other host on the >> >> net - should *always* use tx-2 queue ? >> >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 >> >> > >> >> >> In my case, what queue it used was basically random (on top of that it >> >> sometimes changed the used queue mid-transfer) what could be easily >> >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a >> >> bit at loss now, as I though xps configuration should be absolute. >> >> >> >> Well, I'd be greatful for some pointers / hints. >> > >> > So it sounds like you have everything configured correctly. The one >> > question I would have is if we are certain the CPU pinning is working >> > for the application. You might try using something like perf to >> > verify what is running on CPU 10, and what is running on the CPUs that >> > the queues are associated with. >> > >> >> I did verify with 'top' in this case. I'll double check tommorow just to >> be sure. Other than testing, there was nothing else running on the machine. >> >> > Also after you have configured things you may want to double check and >> > verify the xps_cpus value is still set. I know under some >> > circumstances the value can be reset by a device driver if the number >> > of queues changes, or if the interface toggles between being >> > administratively up/down. >> >> Hmm, none of this was happening during tests. >> >> Are there any other circu
vlan aware bridge doesn't propagate mac changes to vlans on top of it
Consider following scenario: - create vlan aware bridge (say br0) - setup br0's vlans, e.g. bridge vlan add dev br0 vid 10 self This will add necessary fdb entries directing appropriate traffic to the bridge itself. - create appropriate vlan interfaces on top of it, for example: ip li add link br0 name br0.10 type vlan id 10 ip add add 10.0.0.1/8 dev br0.10 This will add vlan devices on top of br0 and *inherit br0's mac address*. - now after all of the above is done ip li set eth0 master br0 This will attach interface eth0 to the bridge. With this being the first interface attached, br0 will take it's mac address as its own. Any further changes to br0's ports may cause the same, with the lowest mac address of some port becoming br0's mac. This will update fdb entries as well, but all vlan interfaces on top of br0 (e.g. br0.10) will be using old mac address from the time when vlan was created. The side effect of it is that any traffic addressed to such interface will be flooded to all ports (and br0 itself). The only workaround I found is to either manually update mac addresses with 'ip' or recreate vlans (bridge fdb refused to update relevant entries). But if br0's mac changes due to some port changes - shouldn't it be somehow propagated automatically to vlans created on top of it ?
Re: XPS configuration question (on tg3)
On 2016-09-06 22:21, Alexander Duyck wrote: > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltys <sol...@ziu.info> wrote: >> Hi, >> >> I've been testing different configurations and I didn't manage to get XPS to >> "behave" correctly - so I'm probably misunderstanding or forgetting >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx and >> rx), 1 was unused (this got me scratching my head a bit) and the remaining >> one was for the last rx (though due to another bug recently fixed the 4th rx >> queue was inconfigurable on receive side). The names were: eth1b-0, >> eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4. >> >> The XPS was configured as: >> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus >> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue >> only, 4-7 tx-1 and 8-15 tx-2. >> >> Just in case rx side could get in the way as far as flows go, relevant irqs >> were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 10 - >> falling into groups defined by the above masks. >> >> I tested both with mq and multiq scheduler, essentially either this: >> >> qdisc mq 2: root >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 >> 1 1 >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 >> 1 1 >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 >> 1 1 >> >> or this (for the record, skbaction queue_mapping was behaving correctly with >> the one below): >> >> qdisc multiq 3: root refcnt 6 bands 3/5 >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 >> 1 1 >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 >> 1 1 >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 >> 1 1 >> >> Now, do I understand correctly, that under the above setup - commands such as >> >> taskset 400 nc -p $prt host_ip 12345 > or >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10 >> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that supports >> affinity by itself) and sending data to some other host on the net - should >> *always* use tx-2 queue ? >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 >> > >> In my case, what queue it used was basically random (on top of that it >> sometimes changed the used queue mid-transfer) what could be easily >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a bit >> at loss now, as I though xps configuration should be absolute. >> >> Well, I'd be greatful for some pointers / hints. > > So it sounds like you have everything configured correctly. The one > question I would have is if we are certain the CPU pinning is working > for the application. You might try using something like perf to > verify what is running on CPU 10, and what is running on the CPUs that > the queues are associated with. > I did verify with 'top' in this case. I'll double check tommorow just to be sure. Other than testing, there was nothing else running on the machine. > Also after you have configured things you may want to double check and > verify the xps_cpus value is still set. I know under some > circumstances the value can be reset by a device driver if the number > of queues changes, or if the interface toggles between being > administratively up/down. Hmm, none of this was happening during tests. Are there any other circumstances where xps settings could be ignored or changed during the test (that is during the actual transfer, not between separate attempts) ? One thing I'm a bit afraid is that kernel was not exactly the newest (3.16), maybe I'm missing some crucial fixes, though xps was added much earlier than that. Either way, I'll try to redo tests with current kernel tommorow.
XPS configuration question (on tg3)
Hi, I've been testing different configurations and I didn't manage to get XPS to "behave" correctly - so I'm probably misunderstanding or forgetting something. The nic in question (under tg3 driver - BCM5720 and BCM5719 models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx and rx), 1 was unused (this got me scratching my head a bit) and the remaining one was for the last rx (though due to another bug recently fixed the 4th rx queue was inconfigurable on receive side). The names were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4. The XPS was configured as: echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue only, 4-7 tx-1 and 8-15 tx-2. Just in case rx side could get in the way as far as flows go, relevant irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to 10 - falling into groups defined by the above masks. I tested both with mx and multiq scheduler, essentially either this: qdisc mq 2: root qdisc pfifo_fast 0: parent 2:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent 2:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent 2:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 or this (for the record, skbaction queue_mapping was behaving correctly with the one below): qdisc multiq 3: root refcnt 6 bands 3/5 qdisc pfifo_fast 31: parent 3:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 32: parent 3:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 33: parent 3:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Now, do I understand correctly, that under the above setup - commands such as taskset 400 nc -p $prt host_ip 12345
[PATCH net-next v2 1/2] net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug
This patch simplifies how we update fsc and calculate vt from it - while keeping the expected functionality identical with how hfsc behaves curently. It also fixes a certain issue introduced with a very old patch. The idea is, that instead of correcting cl_vt before fsc curve update (rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep cl_vt local to the current period - we can simply rely on virtual times and curve values always being in sync - analogously to how rsc and usc function, except that we use virtual time here. Why hasn't it been done since the beginning this way ? The likely scenario (basing on the code trying to correct curves whenever possible) was to keep the virtual times as small as possible - as they have tendency to "gallop" forward whenever their siblings and other fair sharing subtrees are idling. On top of that, current code is subtly bugged, so cumulative time (without any corrections) is always kept and used in init_vf() when a new backlog period begins (using cl_cvtoff). Is cumulative value safe ? Generally yes, though corner cases are easy to create. For example consider: 1gbit interface some 100kbit leaf, everything else idle With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and it's virtual time, so in reality it's 1 times more. ITOW 38 bits are needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all logarithms rounded up). It's getting somewhat dangerous, but also requires setup excusing this kind of values not mentioning permanently backlogged class for a year. In near most extreme case (10gbit, 10kbit leaf), we have "enough" to hold ~13.6 days in 64 bits. Well, the issue remains mostly theoretical and cl_cvtoff has been working fine for all those years. Sensible configuration are de-facto immune to this issue, and not so sensible can solve it with a cronjob and its period inversely proportional to the insanity of such setup =) Now let's explain the subtle bug mentioned earlier. The issue is related to how offsets are kept and how we calculate virtual times and update fair service curve(s). The issue itself is subtle, but easy to observe with long m1 segments. It was introduced in rather old patch: Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment in HFSC scheduler" (available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git) Originally when a new backlog period was started, cl_vtoff of each sibling was updated with cl_cvtmax from past period - naturally moving all cl_vt to proper starting point. That patch adjusted it so cumulative offset is kept in the parent, and there is no need for traversing the list (as any subsequent child activation derives new vt from already active sibling(s)). But with this change, cl_vtoff (of each sibling) is no longer persistent across the inactivity periods, as it's calculated from parent's cl_cvtoff on a new backlog period, conflicting with the following curve correction from the previous period: if (cl->cl_virtual.x == vt) { cl->cl_virtual.x -= cl->cl_vtoff; cl->cl_vtoff = 0; } This essentially tries to keep curve as if it was local to the period and resets cl_vtoff (cumulative vt offset of the class) to 0 when possible (read: when we have an intersection or if a new curve is below the old one). But then it's recalculated from cl_cvtoff on next active period. Then rtsc_min() call preceding the above if() doesn't really do what we expect it to do in such scenario - as it calculates the minimum of corrected curve (from the previous backlog period) and the new uncorrected curve (with offset derived from cl_cvtoff). Example: tc class add dev $ife parent 1:0 classid 1:1 hfsc ls m2 100mbit ul m2 100mbit tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle) pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle) start A, let it run 10s pause A briefly to force rtsc_min() At this point we would expect A to continue at 20mbit after a brief moment of 80mbit. But instead A will use 80mbit for full 10s again. It's the effect of first correcting A (during 'start A'), and then - after unpausing - calculating rtsc_min() from old corrected and new uncorrected curve. The patch fixes this bug and keepis vt and fsc in sync (virtual times are cumulative, not local to the backlog period). Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 44 1 file changed, 12 insertions(+), 32 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 3ddc7bd..6329d7d 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -151,11 +151,8 @@ struct hfsc_class { (monotonic within a period) */
[PATCH net-next v2 2/2] net/sched/sch_hfsc.c: remove unused cl_myfadj
The code using this variable has been commented out in the past as it was causing issues in upperlimited link-sharing scenarios. Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 6329d7d..000f1d3 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -142,8 +142,6 @@ struct hfsc_class { link-sharing, max(myf, cfmin) */ u64 cl_myf; /* my fit-time (calculated from this class's own upperlimit curve) */ - u64 cl_myfadj; /* my fit-time adjustment (to cancel - history dependence) */ u64 cl_cfmin; /* earliest children's fit-time (used with cl_myf to obtain cl_f) */ u64 cl_cvtmin; /* minimal virtual time among the @@ -730,7 +728,6 @@ init_vf(struct hfsc_class *cl, unsigned int len) /* compute myf */ cl->cl_myf = rtsc_y2x(>cl_ulimit, cl->cl_total); - cl->cl_myfadj = 0; } } @@ -797,9 +794,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time) /* update f */ if (cl->cl_flags & HFSC_USC) { + cl->cl_myf = rtsc_y2x(>cl_ulimit, cl->cl_total); +#if 0 cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit, cl->cl_total); -#if 0 /* * This code causes classes to stay way under their * limit when multiple classes are used at gigabit @@ -1471,7 +1469,6 @@ hfsc_reset_class(struct hfsc_class *cl) cl->cl_parentperiod = 0; cl->cl_f= 0; cl->cl_myf = 0; - cl->cl_myfadj = 0; cl->cl_cfmin= 0; cl->cl_nactive = 0; -- 2.1.3
[PATCH net-next v2 0/2] HFSC patches
Changes since v1: - in the first patch's commit message, reference old patch using kernel.org's historical tree (instead of using gmane) - in the second patch we just remove variable Notes regarding patch #1/2: This patch syncs virtual times with fair service curve and fixes a very old subtle bug. The detailed explanation is in the commit message. Additionally I've made an illustration to help understand the issue better: http://imgur.com/a/N8uMC See the example at the bottom of the commit message - Am1_3 and Am2_3 is what should happen with such queue setup, Am1_3real and Am2_3real is what actually happens due to rtsc_min() calculating minimum from corrected and uncorrected curves. Michal Soltys (2): net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug net/sched/sch_hfsc.c: remove unused cl_myfadj net/sched/sch_hfsc.c | 51 ++- 1 file changed, 14 insertions(+), 37 deletions(-) -- 2.1.3
[PATCH 1/2] net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug
This patch simplifies how we update fsc and calculate vt from it - while keeping the expected functionality identical with how hfsc behaves curently. It also fixes a certain issue introduced with a very old patch. The idea is, that instead of correcting cl_vt before fsc curve update (rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep cl_vt local to the current period - we can simply rely on virtual times and curve values always being in sync - analogously to how rsc and usc function, except that we use virtual time here. Why hasn't it been done since the beginning this way ? The likely scenario (basing on the code trying to correct curves whenever possible) was to keep the virtual times as small as possible - as they have tendency to "gallop" forward whenever their siblings and other fair sharing subtrees are idling. On top of that, current code is subtly bugged, so cumulative time (without any corrections) is always kept and used in init_vf() when a new backlog period begins (using cl_cvtoff). Is cumulative value safe ? Generally yes, though corner cases are easy to create. For example consider: 1gbit interface some 100kbit leaf, everything else idle With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and it's virtual time, so in reality it's 1 times more. ITOW 38 bits are needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all logarithms rounded up). It's getting somewhat dangerous, but also requires setup excusing this kind of values not mentioning permanently backlogged class for a year. In near most extreme case (10gbit, 10kbit leaf), we have "enough" to hold ~13.6 days in 64 bits. Well, the issue remains mostly theoretical and cl_cvtoff has been working fine for all those years. Sensible configuration are de-facto immune to this issue, and not so sensible can solve it with a cronjob and its period inversely proportional to the insanity of such setup =) Now let's explain the subtle bug mentioned earlier. The issue is related to how offsets are kept and how we calculate virtual times and update fit-time service curve(s). The issue itself is subtle, but easy to observe with long m1 segments. It was introduced in rather old patch (no history in current linux tree): "O(1) children vtoff adjustment in HFSC scheduler" http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281 Originally when a new backlog period was started, cl_vtoff of each sibling was updated with cl_cvtmax from past period - naturally moving all cl_vt to proper starting point. That patch adjusted it so cumulative offset is kept in the parent, and there is no need for traversing the list (as any subsequent child activation derives new vt from already active sibling(s)). But with this change, cl_vtoff (of each sibling) is no longer persistent across the inactivity periods, as it's calculated from parent's cl_cvtoff on a new backlog period, conflicting with the following curve correction from the previous period: if (cl->cl_virtual.x == vt) { cl->cl_virtual.x -= cl->cl_vtoff; cl->cl_vtoff = 0; } This essentially tries to keep curve as if it was local to the period and resets cl_vtoff (cumulative vt offset of the class) to 0 when possible (read: when we have an intersection or if a new curve is below the old one). But then it's recalculated from cl_cvtoff on next active period. Then rtsc_min() call preceding the above if() doesn't really do what we expect it to do in such scenario - as it calculates the minimum of corrected curve (from the previous backlog period) and the new uncorrected curve (with offset derived from cl_cvtoff). Example: tc class add dev $ife parent 1:0 classid 1:1 hfsc ls m2 100mbit ul m2 100mbit tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle) pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle) start A, let it run 10s pause A briefly to force rtsc_min() At this point we would expect A to continue at 20mbit after a brief moment of 80mbit. But instead A will use 80mbit for full 10s again. It's the effect of first correcting A (during 'start A'), and then - after unpausing - calculating rtsc_min() from old corrected and new uncorrected curve. The patch fixes this bug and keepis vt and fsc in sync (virtual times are cumulative, not local to the backlog period). Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 44 1 file changed, 12 insertions(+), 32 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 3ddc7bd..6329d7d 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -151,11 +151,8 @@ struct hfsc_class { (monotonic within a period) */
[PATCH 2/2] net/sched/sch_hfsc.c: comment out cl_myfadj
The part of the code using it is commented out, so the needless use of this variable can be commented out as well. Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 6329d7d..99ddc38 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -142,8 +142,11 @@ struct hfsc_class { link-sharing, max(myf, cfmin) */ u64 cl_myf; /* my fit-time (calculated from this class's own upperlimit curve) */ +#if 0 +/* code using cl_myfadj is disabled */ u64 cl_myfadj; /* my fit-time adjustment (to cancel history dependence) */ +#endif u64 cl_cfmin; /* earliest children's fit-time (used with cl_myf to obtain cl_f) */ u64 cl_cvtmin; /* minimal virtual time among the @@ -730,7 +733,10 @@ init_vf(struct hfsc_class *cl, unsigned int len) /* compute myf */ cl->cl_myf = rtsc_y2x(>cl_ulimit, cl->cl_total); +#if 0 +/* code using cl_myfadj is disabled */ cl->cl_myfadj = 0; +#endif } } @@ -797,9 +803,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time) /* update f */ if (cl->cl_flags & HFSC_USC) { + cl->cl_myf = rtsc_y2x(>cl_ulimit, cl->cl_total); +#if 0 cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit, cl->cl_total); -#if 0 /* * This code causes classes to stay way under their * limit when multiple classes are used at gigabit @@ -1471,7 +1478,10 @@ hfsc_reset_class(struct hfsc_class *cl) cl->cl_parentperiod = 0; cl->cl_f= 0; cl->cl_myf = 0; +#if 0 +/* code using cl_myfadj is disabled */ cl->cl_myfadj = 0; +#endif cl->cl_cfmin= 0; cl->cl_nactive = 0; -- 2.1.3
[PATCH 0/2] HFSC patches
- first patch: This patch syncs virtual times with fair service curve and fixes a very old subtle bug. The detailed explanation is in the commit message. Additionally I've made an illustration to help understand the issue better: http://imgur.com/a/N8uMC See the example at the bottom of the commit message - Am1_3 and Am2_3 is what should happen with such queue setup, Am1_3real and Am2_3real is what actually happens due to rtsc_min() calculating minimum from corrected and uncorrected curves. - second patch: This is trivial patch that comments out one unused variable, related to another commented out piece of code. Michal Soltys (2): net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug net/sched/sch_hfsc.c: comment out cl_myfadj net/sched/sch_hfsc.c | 56 +--- 1 file changed, 23 insertions(+), 33 deletions(-) -- 2.1.3
Re: question: tg3 driver/nics and inconsistent RX ring count
On 2016-07-26 22:06, Alexander Duyck wrote: > On Tue, Jul 26, 2016 at 12:52 PM, Michal Soltys <sol...@ziu.info> wrote: >> Hi, >> >> I have a few of BCM5720 and BCM5719 kinds sitting in Dell R320 and R520 >> servers - and all of them have certain peculiarity: they claim to have >> up to 4 TX and RX rings (and this can be set/verified just fine through >> ethtool -l/-L, with driver defaulting to 4 rings), indirection table >> (ethtool -x) looks fine as well: >> >> RX flow hash indirection table for eth1b with 3 RX ring(s): >> 0: 0 1 2 3 0 1 2 3 >> 8: 0 1 2 3 0 1 2 3 >> .. >> >> But this "3 RX ring(s)" is actually a real limit if I try to adjust >> anything, for example all those commands would fail: >> >> ethtool -X eth1b equal 4 >> ethtool -X eth1b weight 1 1 1 1 >> >> But would work fine for 3 and less rings. This was quickly tested with >> different kernel/ethtool combinations, from old 3.16 to 4.6.y with same >> exact results. Nothing fancier (-N/-U) is supported either. >> >> Any hints/comments about the cause of this and/or possible workarounds ? > > Well a quick look at the driver code makes it seem the problem lies in > tg3_get_rxnfc. I suspect the bug is related to the following lines: > > /* The first interrupt vector only > * handles link interrupts. > */ > info->data -= 1; > > I'm not sure what the number of interrupt vectors has to do with the > number of rings. Perhaps someone more familiar with the driver can > point out why you would subtract 1 from tp->rxq_cnt to get the number > of queues when it seems like it should be tp->rxq_cnt. > > Hope that helps. > > - Alex > Ah thanks, seems to be the case then. Quick git blame shows it's been since the very introduction of RSS indirection configurability (ca. 2011) in this driver, 90415477bf1356f72acc34063ff52441fc10a754. I've CCed the author, maybe he will be able to shed some light.
question: tg3 driver/nics and inconsistent RX ring count
Hi, I have a few of BCM5720 and BCM5719 kinds sitting in Dell R320 and R520 servers - and all of them have certain peculiarity: they claim to have up to 4 TX and RX rings (and this can be set/verified just fine through ethtool -l/-L, with driver defaulting to 4 rings), indirection table (ethtool -x) looks fine as well: RX flow hash indirection table for eth1b with 3 RX ring(s): 0: 0 1 2 3 0 1 2 3 8: 0 1 2 3 0 1 2 3 .. But this "3 RX ring(s)" is actually a real limit if I try to adjust anything, for example all those commands would fail: ethtool -X eth1b equal 4 ethtool -X eth1b weight 1 1 1 1 But would work fine for 3 and less rings. This was quickly tested with different kernel/ethtool combinations, from old 3.16 to 4.6.y with same exact results. Nothing fancier (-N/-U) is supported either. Any hints/comments about the cause of this and/or possible workarounds ?
[iproute PATCH 1/1] man/man8/tc-flow.8: minor corrections
- baseclass: major handle must match that of class's, Y defaults to 1 - flow map example: maps to 1-256, not 1-257 Signed-off-by: Michal Soltys <sol...@ziu.info> --- man/man8/tc-flow.8 | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/man/man8/tc-flow.8 b/man/man8/tc-flow.8 index f1b7e2a..54f6bf7 100644 --- a/man/man8/tc-flow.8 +++ b/man/man8/tc-flow.8 @@ -73,8 +73,10 @@ An offset for the resulting class ID. .I ID may be .BR root ", " none -or a hexadecimal class ID in the form [\fIX\fB:\fR]\fIY\fR. If \fIX\fR is -omitted, it is assumed to be zero. +or a hexadecimal class ID in the form [\fIX\fB:\fR]\fIY\fR. \fIX\fR must +match qdisc's/class's major handle (if omitted, the correct value is chosen +automatically). If the whole \fBbaseclass\fR is omitted, \fIY\fR defaults +to 1. .TP .BI divisor " NUM" Number of buckets to use for sorting into. Keys are calculated modulo @@ -239,7 +241,7 @@ tc filter add ... flow hash \\ divisor 1024 .EE .TP -Map destination IPs of 192.168.0.0/24 to classids 1-257: +Map destination IPs of 192.168.0.0/24 to classids 1-256: .EX tc filter add ... flow map \\ -- 2.1.3
Re: [PATCH -next] hfsc: reduce hfsc_sched to 14 cachelines
On 2016-07-04 16:22, Florian Westphal wrote: > hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14 > cachelines by placing level after filter_cnt (covering 4 byte hole) and > reducing period/nactive/flags to u32 (period is just a counter, > incremented when class becomes active -- 2**32 is plenty for this > purpose, also, long is only 32bit wide on 32bit platforms anyway). > > cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period > member is already u32, so no precision is lost there either. > It should be fine, even if it overflowed (which theoretically isn't that hard: 1500 mtu, 1gbit interface, 900mbit transfer (meaning some process throttling itself to 900mbit, not hfsc upperlimiting it) => ~16 hours to overflow or ITOW 75000 period changes/s) - what really matters (in init_vf()) is if the period is different. For the record, I have 2 patches that will trim some stuff further. Unfortunately I have another 2 that will near surely put it back at [hopefully only] 16 (if they get accepted that is). But there're some other candidates that might help (some not that tiny functions defined as inline that are called in more than 1 place). E.g. update_cfmin() is called from 3 places.
[PATCH 1/1] iproute2: unmangle netdev/my emails in man pages (hfsc, stab)
No other man pages do so, hiding netdev is kind of silly and I don't mind having my own address normally visible. --- man/man7/tc-hfsc.7 | 4 ++-- man/man8/tc-hfsc.8 | 4 ++-- man/man8/tc-stab.8 | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/man/man7/tc-hfsc.7 b/man/man7/tc-hfsc.7 index ca04961..5ae5e6b 100644 --- a/man/man7/tc-hfsc.7 +++ b/man/man7/tc-hfsc.7 @@ -555,8 +555,8 @@ Please refer to \fBtc\-stab\fR(8) . \fBtc\fR(8), \fBtc\-hfsc\fR(8), \fBtc\-stab\fR(8) -Please direct bugreports and patches to: <net...@vger.kernel.org> +Please direct bugreports and patches to: <netdev@vger.kernel.org> . .SH "AUTHOR" . -Manpage created by Michal Soltys (sol...@ziu.info) +Manpage created by Michal Soltys (sol...@ziu.info) diff --git a/man/man8/tc-hfsc.8 b/man/man8/tc-hfsc.8 index 5444118..fd0df8f 100644 --- a/man/man8/tc-hfsc.8 +++ b/man/man8/tc-hfsc.8 @@ -54,8 +54,8 @@ parameters, you will specify linear service curve. . \fBtc\fR(8), \fBtc\-hfsc\fR(7), \fBtc\-stab\fR(8) -Please direct bugreports and patches to: <net...@vger.kernel.org> +Please direct bugreports and patches to: <netdev@vger.kernel.org> . .SH "AUTHOR" . -Manpage created by Michal Soltys (sol...@ziu.info) +Manpage created by Michal Soltys (sol...@ziu.info) diff --git a/man/man8/tc-stab.8 b/man/man8/tc-stab.8 index 02caa7d..03a0659 100644 --- a/man/man8/tc-stab.8 +++ b/man/man8/tc-stab.8 @@ -156,8 +156,8 @@ it's good to use \fBethtool\fR to turn off offloading features. .br \fB[2]\fR http://www.faqs.org/rfcs/rfc2684.html -Please direct bugreports and patches to: <net...@vger.kernel.org> +Please direct bugreports and patches to: <netdev@vger.kernel.org> . .SH "AUTHOR" . -Manpage created by Michal Soltys (sol...@ziu.info) +Manpage created by Michal Soltys (sol...@ziu.info) -- 2.1.3
[net-next PATCH 0/5] HFSC patches, part 1
Hi, It's revised version of part of the patches I submitted really, really long time ago (back then I asked Patrick to ignore them as I found some issues shortly after submitting). Anyway this is the first set with very simple fixes/changes though some of them relatively subtle (I tried to do very exhaustive commit messages explaining what and why with those). The patches are against net-next tree. The second set will be heavier - or rather with more complex explanations, among those I have: - a fix to subtle issue introduced in http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281 along with simplifying related stuff - update times to 96 bits (which allows to "just" use 32 bit shifts and improves curve definition accuracy at more extreme low/high speeds) - add curve "merging" instead of just selecting in convex case (computations mirror those from concave intersection) But these are eventually for later. Michal Soltys (5): net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() net/sched/sch_hfsc.c: remove leftover dlist and droplist net/sched/sch_hfsc.c: go passive after vt update net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() net/sched/sch_hfsc.c | 54 ++-- 1 file changed, 27 insertions(+), 27 deletions(-) -- 2.1.3
[net-next PATCH 5/5] net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()
cl->cl_vt alone is relative only to the current backlog period, while the curve operates on cumulative virtual time. This patch adds missing cl->cl_vtoff. Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 4eef857..dff92ea 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -940,7 +940,7 @@ static void hfsc_change_fsc(struct hfsc_class *cl, struct tc_service_curve *fsc) { sc2isc(fsc, >cl_fsc); - rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vt, cl->cl_total); + rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vtoff + cl->cl_vt, cl->cl_total); cl->cl_flags |= HFSC_FSC; } -- 2.1.3
[net-next PATCH 2/5] net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()
The condition can only succeed on wrong configurations. Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 6d6df6b..e2244bb 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -882,7 +882,7 @@ qdisc_peek_len(struct Qdisc *sch) unsigned int len; skb = sch->ops->peek(sch); - if (skb == NULL) { + if (unlikely(skb == NULL)) { qdisc_warn_nonwc("qdisc_peek_len", sch); return 0; } -- 2.1.3
[net-next PATCH 1/5] net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline
Realtime scheduling implemented in HFSC uses head of the queue to make the decision about which packet to schedule next. But in case of any head drop, the deadline calculated for the previous head is not necessarily correct for the next head (unless both packets have the same length). Thanks to peek() function used during dequeue - which internally is a dequeue operation - hfsc is almost safe from this issue, as peek() dequeues and isolates the head storing it temporarily until the real dequeue happens. But there is one exception: if after the class activation a drop happens before the first dequeue operation, there's never a chance to do the peek(). Adding peek() call in enqueue - if this is the first packet in a new backlog period AND the scheduler has realtime curve defined - fixes that one corner case. The 1st hfsc_dequeue() will use that peeked packet, similarly as every subsequent hfsc_dequeue() call uses packet peeked by the previous call. Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 8cb5eff..6d6df6b 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -1594,8 +1594,17 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) return err; } - if (cl->qdisc->q.qlen == 1) + if (cl->qdisc->q.qlen == 1) { set_active(cl, qdisc_pkt_len(skb)); + /* +* If this is the first packet, isolate the head so an eventual +* head drop before the first dequeue operation has no chance +* to invalidate the deadline. +*/ + if (cl->cl_flags & HFSC_RSC) + cl->qdisc->ops->peek(cl->qdisc); + + } qdisc_qstats_backlog_inc(sch, skb); sch->q.qlen++; -- 2.1.3
[net-next PATCH 3/5] net/sched/sch_hfsc.c: remove leftover dlist and droplist
This is update to: commit a09ceb0e08140a ("sched: remove qdisc->drop") That commit removed qdisc->drop, but left alone dlist and droplist that no longer serve any meaningful purpose. Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 8 1 file changed, 8 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index e2244bb..df07f06 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -130,7 +130,6 @@ struct hfsc_class { struct rb_node vt_node; /* parent's vt_tree member */ struct rb_root cf_tree; /* active children sorted by cl_f */ struct rb_node cf_node; /* parent's cf_heap member */ - struct list_head dlist; /* drop list member */ u64 cl_total; /* total work in bytes */ u64 cl_cumul; /* cumulative work in bytes done by @@ -177,8 +176,6 @@ struct hfsc_sched { struct hfsc_class root; /* root class */ struct Qdisc_class_hash clhash; /* class hash */ struct rb_root eligible;/* eligible tree */ - struct list_head droplist; /* active leaf class list (for - dropping) */ struct qdisc_watchdog watchdog; /* watchdog timer */ }; @@ -858,7 +855,6 @@ set_active(struct hfsc_class *cl, unsigned int len) if (cl->cl_flags & HFSC_FSC) init_vf(cl, len); - list_add_tail(>dlist, >sched->droplist); } static void @@ -867,8 +863,6 @@ set_passive(struct hfsc_class *cl) if (cl->cl_flags & HFSC_RSC) eltree_remove(cl); - list_del(>dlist); - /* * vttree is now handled in update_vf() so that update_vf(cl, 0, 0) * needs to be called explicitly to remove a class from vttree. @@ -1443,7 +1437,6 @@ hfsc_init_qdisc(struct Qdisc *sch, struct nlattr *opt) if (err < 0) return err; q->eligible = RB_ROOT; - INIT_LIST_HEAD(>droplist); q->root.cl_common.classid = sch->handle; q->root.refcnt = 1; @@ -1527,7 +1520,6 @@ hfsc_reset_qdisc(struct Qdisc *sch) hfsc_reset_class(cl); } q->eligible = RB_ROOT; - INIT_LIST_HEAD(>droplist); qdisc_watchdog_cancel(>watchdog); sch->qstats.backlog = 0; sch->q.qlen = 0; -- 2.1.3
[net-next PATCH 4/5] net/sched/sch_hfsc.c: go passive after vt update
When a class is going passive, it should update its cl_vt first to be consistent with the last dequeue operation. Otherwise its cl_vt will be one packet behind and parent's cvtmax might not be updated as well. One possible side effect is if some class goes passive and subsequently goes active /without/ its parent going passive - with cl_vt lagging one packet behind - comparison made in init_vf() will be affected (same period). Signed-off-by: Michal Soltys <sol...@ziu.info> --- net/sched/sch_hfsc.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index df07f06..4eef857 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -778,6 +778,20 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time) else go_passive = 0; + /* update vt */ + cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total) + - cl->cl_vtoff + cl->cl_vtadj; + + /* +* if vt of the class is smaller than cvtmin, +* the class was skipped in the past due to non-fit. +* if so, we need to adjust vtadj. +*/ + if (cl->cl_vt < cl->cl_parent->cl_cvtmin) { + cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt; + cl->cl_vt = cl->cl_parent->cl_cvtmin; + } + if (go_passive) { /* no more active child, going passive */ @@ -794,25 +808,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time) continue; } - /* -* update vt and f -*/ - cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total) - - cl->cl_vtoff + cl->cl_vtadj; - - /* -* if vt of the class is smaller than cvtmin, -* the class was skipped in the past due to non-fit. -* if so, we need to adjust vtadj. -*/ - if (cl->cl_vt < cl->cl_parent->cl_cvtmin) { - cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt; - cl->cl_vt = cl->cl_parent->cl_cvtmin; - } - /* update the vt tree */ vttree_update(cl); + /* update f */ if (cl->cl_flags & HFSC_USC) { cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit, cl->cl_total); -- 2.1.3
Re: switch / linux STP interoperation issues.
On 2016-06-24 23:09, Elad Raz wrote: On Fri, Jun 24, 2016 at 10:14 PM, Michal Soltys <sol...@ziu.info> wrote: Hi, The switch respected BPDUs sent to it (if applicable) - for example it complied properly if it's priority was less (numerically higher) than linux's - showing linux box as root bridge, marking one port as root, the other as alternate/blocking. The linux box itself was completely deaf to any BPDUs arriving to it (e.g. if it's priority was lower) and just keept pushing its own data units all the time with little care (quickly leading to loops in some scenarios). Whether it was builtin stp implementation, or whether it was mstpd's stp/rstp/mstp - the behaviour was the same. Linux is fresh stock archlinux (so vanilla 4.6.2 kernel and the most (or almost) recent userland utils - iproute2, etc.), running on relatively recent poweredge dell. I'm kind of lost at this point - am I missing some basic options/sysctls/sysfs ? Is there some known incompatibility here somewhere between switch/linux/nic/versions/etc. ? Some by-default enabled BPDU filtering maybe ? Any suggestions / hints appreciated. Please see Ido Schimmel's fix "[net] bridge: Fix incorrect re-injection of STP packets", https://patchwork.ozlabs.org/patch/629768/ Thanks ! Will test it on monday asap.
switch / linux STP interoperation issues.
Hi, In the last week I've been trying to get STP on the linux side (both its builtin STP implementation as well as mstpd userspace daemon). Initially I started with more complex setups (vlan aware bridge, bonds, mst) and gradually (with identical problems on each step) ended with the most basic setup that can be summarized by: brctl addbr br0 brctl addif br0 eno1 brctl addif br0 eno2 brctl stp br0 on ip li set eno1 up ip li set eno2 up ip li set br0 up The same config on switch's side (cisco 2960-x in its most basic incarnation) - in the other words two cables between linux machine and the switch, enabled stp, access ports in vlan1. The end effect of this setup (and any of the more complex previous ones): The switch respected BPDUs sent to it (if applicable) - for example it complied properly if it's priority was less (numerically higher) than linux's - showing linux box as root bridge, marking one port as root, the other as alternate/blocking. The linux box itself was completely deaf to any BPDUs arriving to it (e.g. if it's priority was lower) and just keept pushing its own data units all the time with little care (quickly leading to loops in some scenarios). Whether it was builtin stp implementation, or whether it was mstpd's stp/rstp/mstp - the behaviour was the same. With the bridge itself happily claiming to be the root (despite lower priority): br0 bridge id a000.000af77cddc4 designated roota000.000af77cddc4 root port 0 enp8s0f0 (3) port id8003state forwarding designated roota000.000af77cddc4 path cost 4 designated bridge a000.000af77cddc4 (and analogous output from mstpctl tool) tcpdump looked like: 17:33:28.701425 00:0a:f7:7c:dd:c4 > 01:80:c2:00:00:00, 802.3, length 52: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03 : STP 802.1d, Config, Flags [none], bridge-id a000.00:0a:f7:7c:dd:c4.8003, length 35 message-age 0.00s, max-age 20.00s, hello-time 2.00s, forwarding-delay 15.00s root-id a000.00:0a:f7:7c:dd:c4, root-pathcost 0 17:33:29.026185 18:8b:45:6f:38:86 > 01:80:c2:00:00:00, 802.3, length 60: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03 : STP 802.1d, Config, Flags [none], bridge-id 2001.18:8b:45:6f:38:80.8006, length 43 message-age 0.00s, max-age 20.00s, hello-time 2.00s, forwarding-delay 15.00s root-id 2001.18:8b:45:6f:38:80, root-pathcost 0 The first sent by linux box, the second by the switch (the above from basic stp scenario on both sides). The cards in question used: Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) handled by: driver: tg3 version: 3.137 firmware-version: FFV7.10.17 bc 5719-v1.37 Linux is fresh stock archlinux (so vanilla 4.6.2 kernel and the most (or almost) recent userland utils - iproute2, etc.), running on relatively recent poweredge dell. I'm kind of lost at this point - am I missing some basic options/sysctls/sysfs ? Is there some known incompatibility here somewhere between switch/linux/nic/versions/etc. ? Some by-default enabled BPDU filtering maybe ? Any suggestions / hints appreciated.