Re: [PATCH net-next] net: enetc: automatically select IERB module

2021-04-20 Thread Vladimir Oltean
On Tue, Apr 20, 2021 at 04:28:21PM +0200, Michael Walle wrote:
> Now that enetc supports flow control we have to make sure the settings in
> the IERB are correct. Therefore, we actually depend on the enetc-ierb
> module. Previously it was possible that this module was disabled while the
> enetc was enabled. Fix it by automatically select the enetc-ierb module.
> 
> Fixes: e7d48e5fbf30 ("net: enetc: add a mini driver for the Integrated 
> Endpoint Register Block")
> Signed-off-by: Michael Walle 
> ---

Acked-by: Vladimir Oltean 


Re: [EXT] Re: [net-next] net: dsa: felix: disable always guard band bit for TAS config

2021-04-20 Thread Vladimir Oltean
On Tue, Apr 20, 2021 at 01:30:51PM +0300, Vladimir Oltean wrote:
> On Tue, Apr 20, 2021 at 10:28:45AM +, Xiaoliang Yang wrote:
> > Hi Vladimir,
> > 
> > On Tue, Apr 20, 2021 at 16:27:10AM +0800, Vladimir Oltean wrote:
> > >
> > > On Tue, Apr 20, 2021 at 03:06:40AM +, Xiaoliang Yang wrote:
> > >> Hi Vladimir.
> > >>
> > >> On Mon, Apr 19, 2021 at 20:38PM +0800, Vladimir Oltean wrote:
> > >> >
> > >> >What is a scheduled queue? When time-aware scheduling is enabled on 
> > >> >the port, why are some queues scheduled and some not?
> > >>
> > >> The felix vsc9959 device can set SCH_TRAFFIC_QUEUES field bits to 
> > >> define which queue is scheduled. Only the set queues serves schedule 
> > >> traffic. In this driver we set all 8 queues to be scheduled in 
> > >> default, so all the traffic are schedule queues to schedule queue.
> > >
> > > I understand this, what I don't really understand is the distinction
> > > that the switch makes between 'scheduled' and 'non-scheduled'
> > > traffic.  What else does this distinction affect, apart from the
> > > guard bands added implicitly here? The tc-taprio qdisc has no notion
> > > of 'scheduled' queues, all queues are 'scheduled'. Do we ever need
> > > to set the scheduled queues mask to something other than 0xff? If
> > > so, when and why?
> > 
> > Yes, it seems only affect the guard band. If disabling always guard
> > band bit, we can use SCH_TRAFFIC_QUEUES to determine which queue is
> > non-scheduled queue. Only the non-scheduled queue traffic will reserve
> > the guard band. But tc-taprio qdisc cannot set scheduled or
> > non-scheduled queue now. Adding this feature can be discussed in
> > future. 
> > 
> > It is not reasonable to add guardband in each queue traffic in
> > default, so I disable the always guard band bit for TAS config.
> 
> Ok, if true, then it makes sense to disable ALWAYS_GUARD_BAND_SCH_Q.

One question though. I know that Felix overruns the time gate, i.e. when
the time interval has any value larger than 32 ns, the switch port is
happy to send any packet of any size, regardless of whether the duration
of transmission exceeds the gate size or not. In doing so, it violates
this requirement from IEEE 802.1Q-2018 clause 8.6.8.4 Enhancements for
scheduled traffic:

-[ cut here ]-
In addition to the other checks carried out by the transmission selection 
algorithm, a frame on a traffic class
queue is not available for transmission [as required for tests (a) and (b) in 
8.6.8] if the transmission gate is in
the closed state or if there is insufficient time available to transmit the 
entirety of that frame before the next
gate-close event (3.97) associated with that queue. A per-traffic class 
counter, TransmissionOverrun
(12.29.1.1.2), is incremented if the implementation detects that a frame from a 
given queue is still being
transmitted by the MAC when the gate-close event for that queue occurs.

NOTE 1—It is assumed that the implementation has knowledge of the transmission 
overheads that are involved in
transmitting a frame on a given Port and can therefore determine how long the 
transmission of a frame will take.
However, there can be reasons why the frame size, and therefore the length of 
time needed for its transmission, is
unknown; for example, where cut-through is supported, or where frame preemption 
is supported and there is no way of
telling in advance how many times a given frame will be preempted before its 
transmission is complete. It is desirable
that the schedule for such traffic is designed to accommodate the intended 
pattern of transmission without overrunning
the next gate-close event for the traffic classes concerned.
-[ cut here ]-

Is this not the reason why the guard bands were added, to make the
scheduler stop sending any frame for 1 MAX_SDU in advance of the gate
close event, so that it doesn't overrun the gate?


Re: [net-next] net: dsa: felix: disable always guard band bit for TAS config

2021-04-20 Thread Vladimir Oltean
On Mon, Apr 19, 2021 at 06:25:30PM +0800, Xiaoliang Yang wrote:
> ALWAYS_GUARD_BAND_SCH_Q bit in TAS config register is descripted as
> this:
>   0: Guard band is implemented for nonschedule queues to schedule
>  queues transition.
>   1: Guard band is implemented for any queue to schedule queue
>  transition.
> 
> The driver set guard band be implemented for any queue to schedule queue
> transition before, which will make each GCL time slot reserve a guard
> band time that can pass the max SDU frame. Because guard band time could
> not be set in tc-taprio now, it will use about 12000ns to pass 1500B max
> SDU. This limits each GCL time interval to be more than 12000ns.
> 
> This patch change the guard band to be only implemented for nonschedule
> queues to schedule queues transition, so that there is no need to reserve
> guard band on each GCL. Users can manually add guard band time for each
> schedule queues in their configuration if they want.
> 
> Signed-off-by: Xiaoliang Yang 
> ---

Reviewed-by: Vladimir Oltean 


Re: [EXT] Re: [net-next] net: dsa: felix: disable always guard band bit for TAS config

2021-04-20 Thread Vladimir Oltean
On Tue, Apr 20, 2021 at 10:28:45AM +, Xiaoliang Yang wrote:
> Hi Vladimir,
> 
> On Tue, Apr 20, 2021 at 16:27:10AM +0800, Vladimir Oltean wrote:
> >
> > On Tue, Apr 20, 2021 at 03:06:40AM +, Xiaoliang Yang wrote:
> >> Hi Vladimir.
> >>
> >> On Mon, Apr 19, 2021 at 20:38PM +0800, Vladimir Oltean wrote:
> >> >
> >> >What is a scheduled queue? When time-aware scheduling is enabled on 
> >> >the port, why are some queues scheduled and some not?
> >>
> >> The felix vsc9959 device can set SCH_TRAFFIC_QUEUES field bits to 
> >> define which queue is scheduled. Only the set queues serves schedule 
> >> traffic. In this driver we set all 8 queues to be scheduled in 
> >> default, so all the traffic are schedule queues to schedule queue.
> >
> > I understand this, what I don't really understand is the distinction
> > that the switch makes between 'scheduled' and 'non-scheduled'
> > traffic.  What else does this distinction affect, apart from the
> > guard bands added implicitly here? The tc-taprio qdisc has no notion
> > of 'scheduled' queues, all queues are 'scheduled'. Do we ever need
> > to set the scheduled queues mask to something other than 0xff? If
> > so, when and why?
> 
> Yes, it seems only affect the guard band. If disabling always guard
> band bit, we can use SCH_TRAFFIC_QUEUES to determine which queue is
> non-scheduled queue. Only the non-scheduled queue traffic will reserve
> the guard band. But tc-taprio qdisc cannot set scheduled or
> non-scheduled queue now. Adding this feature can be discussed in
> future. 
> 
> It is not reasonable to add guardband in each queue traffic in
> default, so I disable the always guard band bit for TAS config.

Ok, if true, then it makes sense to disable ALWAYS_GUARD_BAND_SCH_Q.


Re: [EXT] Re: [net-next] net: dsa: felix: disable always guard band bit for TAS config

2021-04-20 Thread Vladimir Oltean
On Tue, Apr 20, 2021 at 03:06:40AM +, Xiaoliang Yang wrote:
> Hi Vladimir.
> 
> On Mon, Apr 19, 2021 at 20:38PM +0800, Vladimir Oltean wrote:
> >
> >What is a scheduled queue? When time-aware scheduling is enabled on
> >the port, why are some queues scheduled and some not?
> 
> The felix vsc9959 device can set SCH_TRAFFIC_QUEUES field bits to
> define which queue is scheduled. Only the set queues serves schedule
> traffic. In this driver we set all 8 queues to be scheduled in
> default, so all the traffic are schedule queues to schedule queue.

I understand this, what I don't really understand is the distinction
that the switch makes between 'scheduled' and 'non-scheduled' traffic.
What else does this distinction affect, apart from the guard bands added
implicitly here? The tc-taprio qdisc has no notion of 'scheduled'
queues, all queues are 'scheduled'. Do we ever need to set the scheduled
queues mask to something other than 0xff? If so, when and why?


Re: [net-next 3/3] net: mscc: ocelot: support PTP Sync one-step timestamping

2021-04-20 Thread Vladimir Oltean
On Tue, Apr 20, 2021 at 07:33:39AM +, Y.b. Lu wrote:
> > > + /* For two-step timestamp, retrieve ptp_cmd in DSA_SKB_CB_PRIV
> > > +  * and timestamp ID in clone->cb[0].
> > > +  * For one-step timestamp, retrieve ptp_cmd in DSA_SKB_CB_PRIV.
> > > +  */
> > > + u8 *ptp_cmd = DSA_SKB_CB_PRIV(skb);
> >
> > This is fine in the sense that it works, but please consider creating 
> > something
> > similar to sja1105:
> >
> > struct ocelot_skb_cb {
> > u8 ptp_cmd; /* For both one-step and two-step timestamping */
> > u8 ts_id; /* Only for two-step timestamping */ };
> >
> > #define OCELOT_SKB_CB(skb) \
> > ((struct ocelot_skb_cb *)DSA_SKB_CB_PRIV(skb))
> >
> > And then access as OCELOT_SKB_CB(skb)->ptp_cmd,
> > OCELOT_SKB_CB(clone)->ts_id.
> >
> > and put a comment to explain that this is done in order to have common code
> > between Felix DSA and Ocelot switchdev. Basically Ocelot will not use the 
> > first
> > 8 bytes of skb->cb, but there's enough space for this to not make any
> > difference. The original skb will hold only ptp_cmd, the clone will only 
> > hold
> > ts_id, but it helps to have the same structure in place.
> >
> > If you create this ocelot_skb_cb structure, I expect the comment above to be
> > fairly redundant, you can consider removing it.
> >
>
> You're right to define the structure.
> Considering patch #1, move skb cloning to drivers, and populate 
> DSA_SKB_CB(skb)->clone if needs to do so (per suggestion).
> Can we totally drop dsa_skb_cb in dsa core? The only usage of it is holding a 
> skb clone pointer, for only felix and sja1105.
> Actually we can move such pointer in _skb_cb, instead of reserving 
> the space of skb for any drivers.
>
> Do you think so?

The trouble with skb->cb is that it isn't zero-initialized. But somebody
needs to initialize the clone pointer to NULL, otherwise you don't know
if this is a valid pointer or not. Because dsa_skb_tx_timestamp() is
called before p->xmit(), the driver has no way to initialize the clone
pointer by itself. So this was done directly from dsa_slave_xmit(), and
not from any driver-specific hook. So this is why there is a
DSA_SKB_CB(skb)->clone and not SJA1105_SKB_CB(skb)->clone. The
alternative would be to memset(skb->cb, 0, 48) which is a bit
sub-optimal because it initializes more than it needs. Alternatively, it
might be possible to introduce a new property in struct dsa_device_ops
which holds sizeof(struct sja1105_skb_cb), and the generic code will
only zero-initialize this number of bytes.
I don't know, if you can get it to work in a way that does not incur a
noticeable performance penalty, I'm okay with whatever you come up with.


Re: [net-next] net: dsa: felix: disable always guard band bit for TAS config

2021-04-19 Thread Vladimir Oltean
Hi Xiaoliang,

On Mon, Apr 19, 2021 at 06:25:30PM +0800, Xiaoliang Yang wrote:
> ALWAYS_GUARD_BAND_SCH_Q bit in TAS config register is descripted as
> this:
>   0: Guard band is implemented for nonschedule queues to schedule
>  queues transition.
>   1: Guard band is implemented for any queue to schedule queue
>  transition.
>
> The driver set guard band be implemented for any queue to schedule queue
> transition before, which will make each GCL time slot reserve a guard
> band time that can pass the max SDU frame. Because guard band time could
> not be set in tc-taprio now, it will use about 12000ns to pass 1500B max
> SDU. This limits each GCL time interval to be more than 12000ns.
>
> This patch change the guard band to be only implemented for nonschedule
> queues to schedule queues transition, so that there is no need to reserve
> guard band on each GCL. Users can manually add guard band time for each
> schedule queues in their configuration if they want.
>
> Signed-off-by: Xiaoliang Yang 
> ---

What is a scheduled queue? When time-aware scheduling is enabled on the
port, why are some queues scheduled and some not?


Re: [net-next 3/3] net: mscc: ocelot: support PTP Sync one-step timestamping

2021-04-18 Thread Vladimir Oltean
On Fri, Apr 16, 2021 at 08:36:55PM +0800, Yangbo Lu wrote:
> Although HWTSTAMP_TX_ONESTEP_SYNC existed in ioctl for hardware timestamp
> configuration, the PTP Sync one-step timestamping had never been supported.
> 
> This patch is to truely support it.

Actually the ocelot switchdev driver does support one-step timestamping,
just the felix DSA driver does not.

> The hardware timestamp request type is
> stored in DSA_SKB_CB_PRIV first byte per skb, so that corresponding
> configuration could be done during transmitting. Non-onestep-Sync packet
> with one-step timestamp request should fall back to use two-step timestamp.
> 
> Signed-off-by: Yangbo Lu 
> ---
>  drivers/net/ethernet/mscc/ocelot.c | 57 ++
>  drivers/net/ethernet/mscc/ocelot_net.c |  5 +--
>  include/soc/mscc/ocelot.h  |  1 +
>  net/dsa/tag_ocelot.c   | 25 ++-
>  net/dsa/tag_ocelot_8021q.c | 39 +-
>  5 files changed, 72 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mscc/ocelot.c 
> b/drivers/net/ethernet/mscc/ocelot.c
> index 541d3b4076be..69d36b6241ff 100644
> --- a/drivers/net/ethernet/mscc/ocelot.c
> +++ b/drivers/net/ethernet/mscc/ocelot.c
> @@ -6,6 +6,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ocelot.h"
>  #include "ocelot_vcap.h"
> @@ -546,6 +547,50 @@ static void ocelot_port_add_txtstamp_skb(struct ocelot 
> *ocelot, int port,
>   spin_unlock(_port->ts_id_lock);
>  }
>  
> +bool ocelot_ptp_rew_op(struct sk_buff *skb, struct sk_buff *clone, u32 
> *rew_op)
> +{
> + /* For two-step timestamp, retrieve ptp_cmd in DSA_SKB_CB_PRIV
> +  * and timestamp ID in clone->cb[0].
> +  * For one-step timestamp, retrieve ptp_cmd in DSA_SKB_CB_PRIV.
> +  */
> + u8 *ptp_cmd = DSA_SKB_CB_PRIV(skb);

This is fine in the sense that it works, but please consider creating
something similar to sja1105:

struct ocelot_skb_cb {
u8 ptp_cmd; /* For both one-step and two-step timestamping */
u8 ts_id; /* Only for two-step timestamping */
};

#define OCELOT_SKB_CB(skb) \
((struct ocelot_skb_cb *)DSA_SKB_CB_PRIV(skb))

And then access as OCELOT_SKB_CB(skb)->ptp_cmd, OCELOT_SKB_CB(clone)->ts_id.

and put a comment to explain that this is done in order to have common
code between Felix DSA and Ocelot switchdev. Basically Ocelot will not
use the first 8 bytes of skb->cb, but there's enough space for this to
not make any difference. The original skb will hold only ptp_cmd, the
clone will only hold ts_id, but it helps to have the same structure in
place.

If you create this ocelot_skb_cb structure, I expect the comment above
to be fairly redundant, you can consider removing it.

> +
> + if (clone) {
> + *rew_op = *ptp_cmd;
> + *rew_op |= clone->cb[0] << 3;
> + } else if (*ptp_cmd) {
> + *rew_op = *ptp_cmd;
> + } else {
> + return false;
> + }
> +
> + return true;

Just make this function return an u32. If the packet isn't PTP, the
rew_op will be 0.

> +}
> +EXPORT_SYMBOL(ocelot_ptp_rew_op);
> +
> +static bool ocelot_ptp_is_onestep_sync(struct sk_buff *skb)
> +{
> + struct ptp_header *hdr;
> + unsigned int ptp_class;
> + u8 msgtype, twostep;
> +
> + ptp_class = ptp_classify_raw(skb);
> + if (ptp_class == PTP_CLASS_NONE)
> + return false;
> +
> + hdr = ptp_parse_header(skb, ptp_class);
> + if (!hdr)
> + return false;
> +
> + msgtype = ptp_get_msgtype(hdr, ptp_class);
> + twostep = hdr->flag_field[0] & 0x2;
> +
> + if (msgtype == PTP_MSGTYPE_SYNC && twostep == 0)
> + return true;
> +
> + return false;
> +}
> +

This is generic, but if you were to move it to net/core/ptp_classifier.c,
I think you would have to pass the output of ptp_classify_raw() as an
"unsigned int type" argument. So I think I would leave it the way it is
for now - inside of ocelot - until somebody else needs something
similar, and we see what is the required prototype.

>  int ocelot_port_txtstamp_request(struct ocelot *ocelot, int port,
>struct sk_buff *skb,
>struct sk_buff **clone)
> @@ -553,12 +598,24 @@ int ocelot_port_txtstamp_request(struct ocelot *ocelot, 
> int port,
>   struct ocelot_port *ocelot_port = ocelot->ports[port];
>   u8 ptp_cmd = ocelot_port->ptp_cmd;
>  
> + /* Store ptp_cmd in first byte of DSA_SKB_CB_PRIV per skb */
> + if (ptp_cmd == IFH_REW_OP_ORIGIN_PTP) {
> + if (ocelot_ptp_is_onestep_sync(skb)) {
> + *(u8 *)DSA_SKB_CB_PRIV(skb) = ptp_cmd;
> + return 0;
> + }
> +
> + /* Fall back to two-step timestamping */
> + ptp_cmd = IFH_REW_OP_TWO_STEP_PTP;
> + }
> +
>   if (ptp_cmd == IFH_REW_OP_TWO_STEP_PTP) {
>   *clone = skb_clone_sk(skb);
>   if 

Re: [net-next 2/3] net: mscc: ocelot: convert to ocelot_port_txtstamp_request()

2021-04-18 Thread Vladimir Oltean
On Fri, Apr 16, 2021 at 08:36:54PM +0800, Yangbo Lu wrote:
> Convert to a common ocelot_port_txtstamp_request() for TX timestamp
> request handling.
> 
> Signed-off-by: Yangbo Lu 
> ---
>  drivers/net/dsa/ocelot/felix.c | 14 +-
>  drivers/net/ethernet/mscc/ocelot.c | 24 +---
>  drivers/net/ethernet/mscc/ocelot_net.c | 18 +++---
>  include/soc/mscc/ocelot.h  |  5 +++--
>  4 files changed, 36 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
> index cdec2f5e271c..5f2cf0f31253 100644
> --- a/drivers/net/dsa/ocelot/felix.c
> +++ b/drivers/net/dsa/ocelot/felix.c
> @@ -1399,18 +1399,14 @@ static bool felix_txtstamp(struct dsa_switch *ds, int 
> port,
>  struct sk_buff *skb, struct sk_buff **clone)
>  {
>   struct ocelot *ocelot = ds->priv;
> - struct ocelot_port *ocelot_port = ocelot->ports[port];
>  
> - if (ocelot->ptp && ocelot_port->ptp_cmd == IFH_REW_OP_TWO_STEP_PTP) {
> - *clone = skb_clone_sk(skb);
> - if (!(*clone))
> - return false;
> + if (!ocelot->ptp)
> + return false;
>  
> - ocelot_port_add_txtstamp_skb(ocelot, port, *clone);
> - return true;
> - }
> + if (ocelot_port_txtstamp_request(ocelot, port, skb, clone))
> + return false;
>  
> - return false;
> + return true;

Considering the changes you'll have to make to patch 1 (changing the
return value and populating DSA_SKB_CB(skb)->clone at the end of this
function:

Reviewed-by: Vladimir Oltean 


Re: [net-next 1/3] net: dsa: optimize tx timestamp request handling

2021-04-18 Thread Vladimir Oltean
On Fri, Apr 16, 2021 at 08:36:53PM +0800, Yangbo Lu wrote:
> Optimization could be done on dsa_skb_tx_timestamp(), and dsa device
> drivers should adapt to it.
> 
> - Check SKBTX_HW_TSTAMP request flag at the very beginning, instead of in
>   port_txtstamp, so that most skbs not requiring tx timestamp just return.

Agree that this is a trivial performance optimization with no downside
that we should be making.

> - No longer to identify PTP packets, and limit tx timestamping only for PTP
>   packets. If device driver likes, let device driver do.

Agree that DSA has a way too heavy hand in imposing upon the driver
which packets should be timestampable and which ones shouldn't.

For example, I have a latency measurement tool called isochron which is
based on hardware timestamping of non-PTP packets (in order to not
disturb the PTP state machines):
https://github.com/vladimiroltean/tsn-scripts

I can't use it on DSA interfaces, for rather artificial reasons.

> - It is a waste to clone skb directly in dsa_skb_tx_timestamp().
>   For one-step timestamping, a clone is not needed. For any failure of
>   port_txtstamp (this may usually happen), the skb clone has to be freed.
>   So put skb cloning into port_txtstamp where it really needs.

Mostly agree. For two-step timestamping, it is an operation which all
drivers need to do, so it is in the common potion. If we want to support
one-step, we need to avoid cloning the PTP packets.

> Signed-off-by: Yangbo Lu 
> ---
>  Documentation/networking/timestamping.rst |  7 +--
>  .../net/dsa/hirschmann/hellcreek_hwtstamp.c   | 20 --
>  .../net/dsa/hirschmann/hellcreek_hwtstamp.h   |  2 +-
>  drivers/net/dsa/mv88e6xxx/hwtstamp.c  | 21 +--
>  drivers/net/dsa/mv88e6xxx/hwtstamp.h  |  6 +++---
>  drivers/net/dsa/ocelot/felix.c| 11 ++
>  drivers/net/dsa/sja1105/sja1105_ptp.c |  6 +-
>  drivers/net/dsa/sja1105/sja1105_ptp.h |  2 +-
>  include/net/dsa.h |  2 +-
>  net/dsa/slave.c   | 20 +-
>  10 files changed, 57 insertions(+), 40 deletions(-)
> 
> diff --git a/Documentation/networking/timestamping.rst 
> b/Documentation/networking/timestamping.rst
> index f682e88fa87e..7f04a699a5d1 100644
> --- a/Documentation/networking/timestamping.rst
> +++ b/Documentation/networking/timestamping.rst
> @@ -635,8 +635,8 @@ in generic code: a BPF classifier (``ptp_classify_raw``) 
> is used to identify
>  PTP event messages (any other packets, including PTP general messages, are 
> not
>  timestamped), and provides two hooks to drivers:
>  
> -- ``.port_txtstamp()``: The driver is passed a clone of the timestampable skb
> -  to be transmitted, before actually transmitting it. Typically, a switch 
> will
> +- ``.port_txtstamp()``: A clone of the timestampable skb to be transmitted
> +  is needed, before actually transmitting it. Typically, a switch will
>have a PTP TX timestamp register (or sometimes a FIFO) where the timestamp
>becomes available. There may be an IRQ that is raised upon this timestamp's
>availability, or the driver might have to poll after invoking
> @@ -645,6 +645,9 @@ timestamped), and provides two hooks to drivers:
>later use (when the timestamp becomes available). Each skb is annotated 
> with
>a pointer to its clone, in ``DSA_SKB_CB(skb)->clone``, to ease the driver's
>job of keeping track of which clone belongs to which skb.
> +  But one-step timestamping request is handled differently with above 
> two-step
> +  timestamping. The skb clone is no longer needed since hardware will insert
> +  TX time information on packet during egress.

Bonus points for updating the documentation, but I don't quite like the
end result. Please feel free to restructure more, in order to have a
clearer and more coherent explanation.

Also, this paragraph from right above is no longer true:

In code, DSA provides for most of the infrastructure for timestamping 
already,
in generic code: a BPF classifier (``ptp_classify_raw``) is used to 
identify
PTP event messages (any other packets, including PTP general messages, 
are not
timestamped), and provides two hooks to drivers:

It's nothing like that anymore. It's more of a passthrough now with your
changes, the BPF classifier is not run by the DSA core but optionally by
individual taggers.

Here is my attempt of rewriting this documentation paragraph, feel free
to take which parts you consider relevant:

-[cut here]-

In the generic layer, DSA provides the following infrastructure for PTP
timestamping:

- ``.port_txtstamp()``: a hook called prior to the transmission of
  packets with a hardware TX timestamping request from user space.
  This is required for two-step timestamping, since the hardware
  timestamp becomes available after the actual MAC transmission, so 

Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-14 Thread Vladimir Oltean
On Wed, Apr 14, 2021 at 08:39:53PM +0200, Tobias Waldekranz wrote:
> In order to have two entries for the same destination, they must belong
> to different FIDs. But that FID is also used for automatic learning. So
> if all ports use their own FID, all the switched traffic will have to be
> flooded instead, since any address learned on lan0 will be invisible to
> lan1,2,3 and vice versa.

Can you explain a bit more what do you mean when you say that the FID
for the CPU port is also used for automatic learning? Since when does
mv88e6xxx learn frames sent by tag_dsa.c?

The way Ocelot switches work, and this is also the mechanism that I plan
to build on top of, is explained in include/soc/mscc/ocelot.h (copied
here for your convenience):

/* Port Group IDs (PGID) are masks of destination ports.
 *
 * For L2 forwarding, the switch performs 3 lookups in the PGID table for each
 * frame, and forwards the frame to the ports that are present in the logical
 * AND of all 3 PGIDs.
 *
 * These PGID lookups are:
 * - In one of PGID[0-63]: for the destination masks. There are 2 paths by
 *   which the switch selects a destination PGID:
 * - The {DMAC, VID} is present in the MAC table. In that case, the
 *   destination PGID is given by the DEST_IDX field of the MAC table entry
 *   that matched.
 * - The {DMAC, VID} is not present in the MAC table (it is unknown). The
 *   frame is disseminated as being either unicast, multicast or broadcast,
 *   and according to that, the destination PGID is chosen as being the
 *   value contained by ANA_FLOODING_FLD_UNICAST,
 *   ANA_FLOODING_FLD_MULTICAST or ANA_FLOODING_FLD_BROADCAST.
 *   The destination PGID can be an unicast set: the first PGIDs, 0 to
 *   ocelot->num_phys_ports - 1, or a multicast set: the PGIDs from
 *   ocelot->num_phys_ports to 63. By convention, a unicast PGID corresponds to
 *   a physical port and has a single bit set in the destination ports mask:
 *   that corresponding to the port number itself. In contrast, a multicast
 *   PGID will have potentially more than one single bit set in the destination
 *   ports mask.
 * - In one of PGID[64-79]: for the aggregation mask. The switch classifier
 *   dissects each frame and generates a 4-bit Link Aggregation Code which is
 *   used for this second PGID table lookup. The goal of link aggregation is to
 *   hash multiple flows within the same LAG on to different destination ports.
 *   The first lookup will result in a PGID with all the LAG members present in
 *   the destination ports mask, and the second lookup, by Link Aggregation
 *   Code, will ensure that each flow gets forwarded only to a single port out
 *   of that mask (there are no duplicates).
 * - In one of PGID[80-90]: for the source mask. The third time, the PGID table
 *   is indexed with the ingress port (plus 80). These PGIDs answer the
 *   question "is port i allowed to forward traffic to port j?" If yes, then
 *   BIT(j) of PGID 80+i will be found set. The third PGID lookup can be used
 *   to enforce the L2 forwarding matrix imposed by e.g. a Linux bridge.
 */

/* Reserve some destination PGIDs at the end of the range:
 * PGID_BLACKHOLE: used for not forwarding the frames
 * PGID_CPU: used for whitelisting certain MAC addresses, such as the addresses
 *   of the switch port net devices, towards the CPU port module.
 * PGID_UC: the flooding destinations for unknown unicast traffic.
 * PGID_MC: the flooding destinations for non-IP multicast traffic.
 * PGID_MCIPV4: the flooding destinations for IPv4 multicast traffic.
 * PGID_MCIPV6: the flooding destinations for IPv6 multicast traffic.
 * PGID_BC: the flooding destinations for broadcast traffic.
 */

Basically the frame is forwarded towards:

PGID_DST[MAC table -> destination] & PGID_AGGR[aggregation code] & 
PGID_SRC[source port]

This is also how we set up LAGs in ocelot_set_aggr_pgids: as far as
PGID_DST is concerned, all traffic towards a LAG is 'sort of multicast'
(even for unicast traffic), in the sense that the destination port mask
is all ones for the physical ports in that LAG. We then reduce the
destination port mask through PGID_AGGR, in the sense that every
aggregation code (of which there can be 16) has a single bit set,
corresponding to either one of the physical ports in the LAG. So every
packet does indeed see no more than one destination port in the end.

For multiple CPU ports with static assignment to user ports, it would be
sufficient, given the Ocelot architecture, to install a single 'multicast'
entry per address in the MAC table, with a DEST_IDX having two bits set,
one for each CPU port. Then, we would let the third lookup (PGID_SRC,
equivalent to the Marvell's port VLANs, AFAIU) enforce the bounding box
for every packet such that it goes to one CPU port or to another.

This, however, has implications upon the DSA API. In my current attempts
for the 'RX filtering in DSA' series, host addresses are reference-counted
by DSA, 

Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-12 Thread Vladimir Oltean
On Tue, Apr 13, 2021 at 12:26:52AM +0200, Tobias Waldekranz wrote:
> On Tue, Apr 13, 2021 at 01:06, Vladimir Oltean  wrote:
> > On Mon, Apr 12, 2021 at 11:49:22PM +0200, Tobias Waldekranz wrote:
> >> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean  wrote:
> >> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
> >> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun  wrote:
> >> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> >> >> > Tobias Waldekranz  wrote:
> >> >> >
> >> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> >> >> typically do a great job with balancing. This will happen without the
> >> >> >> user having to do any configuration at all. It would also perform 
> >> >> >> well
> >> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> >> >> the same.
> >> >> >
> >> >> > TLDR: The problem with LAGs how they are currently implemented is that
> >> >> > for Turris Omnia, basically in 1/16 of configurations the traffic 
> >> >> > would
> >> >> > go via one CPU port anyway.
> >> >> >
> >> >> >
> >> >> >
> >> >> > One potencial problem that I see with using LAGs for aggregating CPU
> >> >> > ports on mv88e6xxx is how these switches determine the port for a
> >> >> > packet: only the src and dst MAC address is used for the hash that
> >> >> > chooses the port.
> >> >> >
> >> >> > The most common scenario for Turris Omnia, for example, where we have 
> >> >> > 2
> >> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> >> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> >> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we 
> >> >> > simply
> >> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> >> >> > chance that all packets would go through one CPU port.
> >> >> >
> >> >> > In order to have real load balancing in this scenario, we would either
> >> >> > have to recompute the LAG mask table depending on the MAC addresses, 
> >> >> > or
> >> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> >> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> >> >> > switches of the mv88e6xxx family, but not for Omnia.)
> >> >> 
> >> >> I thought that the option to associate each port netdev with a DSA
> >> >> master would only be used on transmit. Are you saying that there is a
> >> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> >> >> ports depending on the incoming port?
> >> >> 
> >> >> The reason that the traffic is directed towards the CPU is that some
> >> >> kind of entry in the ATU says so, and the destination of that entry will
> >> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> >> >> any kind of balancing. What am I missing?
> >> >> 
> >> >> Transmit is easy; you are already in the CPU, so you can use an
> >> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> >> >> in that case.
> >> >
> >> > Say a user port receives a broadcast frame. Based on your understanding
> >> > where user-to-CPU port assignments are used only for TX, which CPU port
> >> > should be selected by the switch for this broadcast packet, and by which
> >> > mechanism?
> >> 
> >> AFAIK, the only option available to you (again, if there is no LAG set
> >> up) is to statically choose one CPU port per entry. But hopefully Marek
> >> can teach me some new tricks!
> >> 
> >> So for any known (since the broadcast address is loaded in the ATU it is
> >> known) destination (b/m/u-cast), you can only "load balance" based on
> >> the DA. You would also have to make sure that unknown unicast and
> >> unknown multicast is only allowed to egress one of the CPU ports.
> >> 
> >> If you have a LAG OTOH, you could include all CPU ports in the port
> >> vectors of 

Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-12 Thread Vladimir Oltean
On Tue, Apr 13, 2021 at 12:04:57AM +0200, Marek Behun wrote:
> On Mon, 12 Apr 2021 19:32:11 +0300
> Vladimir Oltean  wrote:
>
> > On Mon, Apr 12, 2021 at 11:00:45PM +0800, DENG Qingfang wrote:
> > > On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> > > >
> > > > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > > > the ones that are are assigned statically to the same CPU port. It's a
> > > > compromise between flexibility and simplicity, and I would go for
> > > > simplicity here. That's the most you can achieve with static assignment,
> > > > just put the CPU ports in a LAG if you want better dynamic load 
> > > > balancing
> > > > (for details read on below).
> > > >
> > >
> > > Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
> > > which make it not ideal in router application (Router WAN <--> ISP BRAS
> > > traffic will always have the same DA/SA and thus use only one port).
> >
> > Is this supposed to make a difference? Choose a better switch vendor!
>
> :-) Are you saying that we shall abandon trying to make the DSA
> subsystem work with better performace for our routers, in order to
> punish ourselves for our bad decision to use Marvell switches?

No, not at all, I just don't understand what is the point you and
Qingfang are trying to make. LAG is useful in general for load balancing.
With the particular case of point-to-point links with Marvell Linkstreet,
not so much. Okay. With a different workload, maybe it is useful with
Marvell Linkstreet too. Again okay. Same for static assignment,
sometimes it is what is needed and sometimes it just isn't.
It was proposed that you write up a user space program that picks the
CPU port assignment based on your favorite metric and just tells DSA to
reconfigure itself, either using a custom fancy static assignment based
on traffic rate (read MIB counters every minute) or simply based on LAG.
All the data laid out so far would indicate that this would give you the
flexibility you need, however you didn't leave any comment on that,
either acknowledging or explaining why it wouldn't be what you want.


Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-12 Thread Vladimir Oltean
On Mon, Apr 12, 2021 at 11:49:22PM +0200, Tobias Waldekranz wrote:
> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean  wrote:
> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun  wrote:
> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> >> > Tobias Waldekranz  wrote:
> >> >
> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> >> typically do a great job with balancing. This will happen without the
> >> >> user having to do any configuration at all. It would also perform well
> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> >> the same.
> >> >
> >> > TLDR: The problem with LAGs how they are currently implemented is that
> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> >> > go via one CPU port anyway.
> >> >
> >> >
> >> >
> >> > One potencial problem that I see with using LAGs for aggregating CPU
> >> > ports on mv88e6xxx is how these switches determine the port for a
> >> > packet: only the src and dst MAC address is used for the hash that
> >> > chooses the port.
> >> >
> >> > The most common scenario for Turris Omnia, for example, where we have 2
> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> >> > chance that all packets would go through one CPU port.
> >> >
> >> > In order to have real load balancing in this scenario, we would either
> >> > have to recompute the LAG mask table depending on the MAC addresses, or
> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> >> > switches of the mv88e6xxx family, but not for Omnia.)
> >> 
> >> I thought that the option to associate each port netdev with a DSA
> >> master would only be used on transmit. Are you saying that there is a
> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> >> ports depending on the incoming port?
> >> 
> >> The reason that the traffic is directed towards the CPU is that some
> >> kind of entry in the ATU says so, and the destination of that entry will
> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> >> any kind of balancing. What am I missing?
> >> 
> >> Transmit is easy; you are already in the CPU, so you can use an
> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> >> in that case.
> >
> > Say a user port receives a broadcast frame. Based on your understanding
> > where user-to-CPU port assignments are used only for TX, which CPU port
> > should be selected by the switch for this broadcast packet, and by which
> > mechanism?
> 
> AFAIK, the only option available to you (again, if there is no LAG set
> up) is to statically choose one CPU port per entry. But hopefully Marek
> can teach me some new tricks!
> 
> So for any known (since the broadcast address is loaded in the ATU it is
> known) destination (b/m/u-cast), you can only "load balance" based on
> the DA. You would also have to make sure that unknown unicast and
> unknown multicast is only allowed to egress one of the CPU ports.
> 
> If you have a LAG OTOH, you could include all CPU ports in the port
> vectors of those same entries. The LAG mask would then do the final
> filtering so that you only send a single copy to the CPU.

I forgot that mv88e6xxx keeps the broadcast address in the ATU. I wanted
to know what is done in the flooding case, therefore I should have asked
about unknown destination traffic. It is sent to one CPU but not the
other based on what information?

And for destinations loaded into the ATU, how is user port isolation
performed? Say lan0 and lan1 have the same MAC address of 00:01:02:03:04:05,
but lan0 goes to the eth0 DSA master and lan1 goes to eth1. How many ATU
entries would there be for host addresses, and towards which port would
they point? Are they isolated by a port private VLAN or something along
those lines?


Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-12 Thread Vladimir Oltean
On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
> On Mon, Apr 12, 2021 at 21:30, Marek Behun  wrote:
> > On Mon, 12 Apr 2021 14:46:11 +0200
> > Tobias Waldekranz  wrote:
> >
> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> typically do a great job with balancing. This will happen without the
> >> user having to do any configuration at all. It would also perform well
> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> the same.
> >
> > TLDR: The problem with LAGs how they are currently implemented is that
> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> > go via one CPU port anyway.
> >
> >
> >
> > One potencial problem that I see with using LAGs for aggregating CPU
> > ports on mv88e6xxx is how these switches determine the port for a
> > packet: only the src and dst MAC address is used for the hash that
> > chooses the port.
> >
> > The most common scenario for Turris Omnia, for example, where we have 2
> > CPU ports and 5 user ports, is that into these 5 user ports the user
> > plugs 5 simple devices (no switches, so only one peer MAC address for
> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> > chance that all packets would go through one CPU port.
> >
> > In order to have real load balancing in this scenario, we would either
> > have to recompute the LAG mask table depending on the MAC addresses, or
> > rewrite the LAG mask table somewhat randomly periodically. (This could
> > be in theory offloaded onto the Z80 internal CPU for some of the
> > switches of the mv88e6xxx family, but not for Omnia.)
> 
> I thought that the option to associate each port netdev with a DSA
> master would only be used on transmit. Are you saying that there is a
> way to configure an mv88e6xxx chip to steer packets to different CPU
> ports depending on the incoming port?
> 
> The reason that the traffic is directed towards the CPU is that some
> kind of entry in the ATU says so, and the destination of that entry will
> either be a port vector or a LAG. Of those two, only the LAG will offer
> any kind of balancing. What am I missing?
> 
> Transmit is easy; you are already in the CPU, so you can use an
> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> in that case.

Say a user port receives a broadcast frame. Based on your understanding
where user-to-CPU port assignments are used only for TX, which CPU port
should be selected by the switch for this broadcast packet, and by which
mechanism?


Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-12 Thread Vladimir Oltean
On Mon, Apr 12, 2021 at 11:00:45PM +0800, DENG Qingfang wrote:
> On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> >
> > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > the ones that are are assigned statically to the same CPU port. It's a
> > compromise between flexibility and simplicity, and I would go for
> > simplicity here. That's the most you can achieve with static assignment,
> > just put the CPU ports in a LAG if you want better dynamic load balancing
> > (for details read on below).
> >
>
> Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
> which make it not ideal in router application (Router WAN <--> ISP BRAS
> traffic will always have the same DA/SA and thus use only one port).

Is this supposed to make a difference? Choose a better switch vendor!


Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-12 Thread Vladimir Oltean
On Mon, Apr 12, 2021 at 02:46:11PM +0200, Tobias Waldekranz wrote:
> On Sun, Apr 11, 2021 at 21:50, Vladimir Oltean  wrote:
> > On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> >> On Sat, 10 Apr 2021 15:34:46 +0200
> >> Ansuel Smith  wrote:
> >> 
> >> > Hi,
> >> > this is a respin of the Marek series in hope that this time we can
> >> > finally make some progress with dsa supporting multi-cpu port.
> >> > 
> >> > This implementation is similar to the Marek series but with some tweaks.
> >> > This adds support for multiple-cpu port but leave the driver the
> >> > decision of the type of logic to use about assigning a CPU port to the
> >> > various port. The driver can also provide no preference and the CPU port
> >> > is decided using a round-robin way.
> >> 
> >> In the last couple of months I have been giving some thought to this
> >> problem, and came up with one important thing: if there are multiple
> >> upstream ports, it would make a lot of sense to dynamically reallocate
> >> them to each user port, based on which user port is actually used, and
> >> at what speed.
> >> 
> >> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> >> ports support at most 1 Gbps. Round-robin would assign:
> >>   CPU port 0 - Port 0
> >>   CPU port 1 - Port 1
> >>   CPU port 0 - Port 2
> >>   CPU port 1 - Port 3
> >>   CPU port 0 - Port 4
> >> 
> >> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> >> with 1, 3 and 4 free:
> >>   CPU port 0 - Port 0 (plugged)
> >>   CPU port 1 - Port 1 (free)
> >>   CPU port 0 - Port 2 (plugged)
> >>   CPU port 1 - Port 3 (free)
> >>   CPU port 0 - Port 4 (free)
> >> 
> >> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> >> CPU, and the second CPU port is not used at all.
> >> 
> >> A mechanism for automatic reassignment of CPU ports would be ideal here.
> >> 
> >> What do you guys think?
> >
> > The reason why I don't think this is such a great idea is because the
> > CPU port assignment is a major reconfiguration step which should at the
> > very least be done while the network is down, to avoid races with the
> > data path (something which this series does not appear to handle).
> > And if you allow the static user-port-to-CPU-port assignment to change
> > every time a link goes up/down, I don't think you really want to force
> > the network down through the entire switch basically.
> >
> > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > the ones that are are assigned statically to the same CPU port. It's a
> > compromise between flexibility and simplicity, and I would go for
> > simplicity here. That's the most you can achieve with static assignment,
> > just put the CPU ports in a LAG if you want better dynamic load balancing
> > (for details read on below).
> 
> I agree. Unless you only have a few really wideband flows, a LAG will
> typically do a great job with balancing. This will happen without the
> user having to do any configuration at all. It would also perform well
> in "router-on-a-stick"-setups where the incoming and outgoing port is
> the same.
> 
> ...
> 
> > But there is something which is even more interesting about Felix with
> > the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
> > posted the follow-up, things have happened, and now both Felix and the
> > Marvell driver support LAG offload via the bonding and/or team drivers.
> > At least for Felix, when using the ocelot-8021q tagged, it should be
> > possible to put the two CPU ports in a hardware LAG, and the two DSA
> > masters in a software LAG, and let the bond/team upper of the DSA
> > masters be the CPU port.
> >
> > I would like us to keep the door open for both alternatives, and to have
> > a way to switch between static user-to-CPU port assignment, and LAG.
> > I think that if there are multiple 'ethernet = ' phandles present in the
> > device tree, DSA should populate a list of valid DSA masters, and then
> > call into the driver to allow it to select which master it prefers for
> > each user port. This is similar to what Ansuel added with 
> > 'port_get_preferred_cpu',
> > except that I chose "DSA master" and not "CPU port" for a specific reason.
> > For LAG, the DSA master would be bond0.
> 
> I do not see why we wo

Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-11 Thread Vladimir Oltean
On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> > On Sat, 10 Apr 2021 15:34:46 +0200
> > Ansuel Smith  wrote:
> > 
> > > Hi,
> > > this is a respin of the Marek series in hope that this time we can
> > > finally make some progress with dsa supporting multi-cpu port.
> > > 
> > > This implementation is similar to the Marek series but with some tweaks.
> > > This adds support for multiple-cpu port but leave the driver the
> > > decision of the type of logic to use about assigning a CPU port to the
> > > various port. The driver can also provide no preference and the CPU port
> > > is decided using a round-robin way.
> > 
> > In the last couple of months I have been giving some thought to this
> > problem, and came up with one important thing: if there are multiple
> > upstream ports, it would make a lot of sense to dynamically reallocate
> > them to each user port, based on which user port is actually used, and
> > at what speed.
> > 
> > For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> > ports support at most 1 Gbps. Round-robin would assign:
> >   CPU port 0 - Port 0
> >   CPU port 1 - Port 1
> >   CPU port 0 - Port 2
> >   CPU port 1 - Port 3
> >   CPU port 0 - Port 4
> > 
> > Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> > with 1, 3 and 4 free:
> >   CPU port 0 - Port 0 (plugged)
> >   CPU port 1 - Port 1 (free)
> >   CPU port 0 - Port 2 (plugged)
> >   CPU port 1 - Port 3 (free)
> >   CPU port 0 - Port 4 (free)
> > 
> > We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> > CPU, and the second CPU port is not used at all.
> > 
> > A mechanism for automatic reassignment of CPU ports would be ideal here.
> > 
> > What do you guys think?
> 
> The reason why I don't think this is such a great idea is because the
> CPU port assignment is a major reconfiguration step which should at the
> very least be done while the network is down, to avoid races with the
> data path (something which this series does not appear to handle).
> And if you allow the static user-port-to-CPU-port assignment to change
> every time a link goes up/down, I don't think you really want to force
> the network down through the entire switch basically.
> 
> So I'd be tempted to say 'tough luck' if all your ports are not up, and
> the ones that are are assigned statically to the same CPU port. It's a
> compromise between flexibility and simplicity, and I would go for
> simplicity here. That's the most you can achieve with static assignment,
> just put the CPU ports in a LAG if you want better dynamic load balancing
> (for details read on below).

Just one more small comment, because I got so carried away with
describing what I already had in mind, that I forgot to completely
address your idea.

I think that DSA should provide the means to do what you want but not
the policy. Meaning that you can always write a user space program that
monitors the NETLINK_ROUTE rtnetlink through a socket and listens for
link state change events on it with poll(), then does whatever (like
moves the static user-to-CPU port mapping in the way that is adequate to
your network's requirements). The link up/down events are already
emitted, and the patch set here gives user space the rope to hang itself.

If you need inspiration, one user of the rtnetlink socket that I know of
is ptp4l:
https://github.com/richardcochran/linuxptp/blob/master/rtnl.c


Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

2021-04-11 Thread Vladimir Oltean
On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> On Sat, 10 Apr 2021 15:34:46 +0200
> Ansuel Smith  wrote:
> 
> > Hi,
> > this is a respin of the Marek series in hope that this time we can
> > finally make some progress with dsa supporting multi-cpu port.
> > 
> > This implementation is similar to the Marek series but with some tweaks.
> > This adds support for multiple-cpu port but leave the driver the
> > decision of the type of logic to use about assigning a CPU port to the
> > various port. The driver can also provide no preference and the CPU port
> > is decided using a round-robin way.
> 
> In the last couple of months I have been giving some thought to this
> problem, and came up with one important thing: if there are multiple
> upstream ports, it would make a lot of sense to dynamically reallocate
> them to each user port, based on which user port is actually used, and
> at what speed.
> 
> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> ports support at most 1 Gbps. Round-robin would assign:
>   CPU port 0 - Port 0
>   CPU port 1 - Port 1
>   CPU port 0 - Port 2
>   CPU port 1 - Port 3
>   CPU port 0 - Port 4
> 
> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> with 1, 3 and 4 free:
>   CPU port 0 - Port 0 (plugged)
>   CPU port 1 - Port 1 (free)
>   CPU port 0 - Port 2 (plugged)
>   CPU port 1 - Port 3 (free)
>   CPU port 0 - Port 4 (free)
> 
> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> CPU, and the second CPU port is not used at all.
> 
> A mechanism for automatic reassignment of CPU ports would be ideal here.
> 
> What do you guys think?

The reason why I don't think this is such a great idea is because the
CPU port assignment is a major reconfiguration step which should at the
very least be done while the network is down, to avoid races with the
data path (something which this series does not appear to handle).
And if you allow the static user-port-to-CPU-port assignment to change
every time a link goes up/down, I don't think you really want to force
the network down through the entire switch basically.

So I'd be tempted to say 'tough luck' if all your ports are not up, and
the ones that are are assigned statically to the same CPU port. It's a
compromise between flexibility and simplicity, and I would go for
simplicity here. That's the most you can achieve with static assignment,
just put the CPU ports in a LAG if you want better dynamic load balancing
(for details read on below).

But this brings us to another topic, which I've been discussing with
Florian. I am also interested in the multi CPU ports topic for the
NXP LS1028A SoC, which uses the felix driver for its embedded switch.
I need to explain some of the complexities there, in order to lay out
what are the aspects which should ideally be supported.

The Ocelot switch family (which Felix is a part of) doesn't actually
support more than one "NPI" port as it's called (when the CPU port
module's queues are linked to an Ethernet port, which is what DSA calls
the "CPU port"). So you'd be tempted to say that a DSA setup with
multiple CPU ports is not realizable for this SoC.

But in fact, there are 2 Ethernet ports connecting the embedded switch
and the CPU, one port is at 2.5Gbps and the other is at 1Gbps. We can
dynamically choose which one is the NPI port through device tree
(arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi), and at the moment, we
choose the 2.5Gbps port as DSA CPU port, and we disable the 1Gbps
internal port. If we wanted to, we could enable the second internal port
as an internally-facing user port, but that's a bit awkward due to
multi-homing. Nonetheless, this is all that's achievable using the NPI
port functionality.

However, due to some unrelated issues, the Felix switch has ended up
supporting two tagging protocols in fact. So there is now an option
through which the user can run this command:

  echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging

(where eno2 is the DSA master)
and the switch will disable the NPI port and set up some VLAN
pushing/popping rules through which DSA gets everything it needs to
offer absolutely the same services towards the upper network stack
layer, but without enabling the hardware functionality for a CPU port
(as far as the switch hardware is aware, it is unmanaged).

This opens up some possibilities because we no longer have the
limitation that there can be only 1 NPI port in the system. As you'd
have it, I believe that any DSA switch with a fully programmable "port
forwarding matrix" (aka a bitmap which answers the question "can port i
send packets to port j?") is capable to some degree of supporting
multiple DSA CPU ports, in the statically-assigned fashion that this
patch series attempts to achieve. Namely, you just configure the port
forwarding matrix to allow user port i to flood traffic to one CPU port
but not to the other, and you disable communication between the CPU
ports.

But 

Re: [PATCH RFC iproute2-next] iplink: allow to change iplink value

2021-04-11 Thread Vladimir Oltean
On Sun, Apr 11, 2021 at 10:04:11AM -0700, Stephen Hemminger wrote:
> On Sat, 10 Apr 2021 15:34:50 +0200
> Ansuel Smith  wrote:
> 
> > Allow to change the interface to which a given interface is linked to.
> > This is useful in the case of multi-CPU port DSA, for changing the CPU
> > port of a given user port.
> > 
> > Signed-off-by: Marek Behún 
> > Cc: David Ahern 
> > Cc: Stephen Hemminger 
> 
> This may work for DSA but it won't work for all the device types 
> vlan/macsec/... that
> now use the link attribute.  It looks like the change link handling for those
> device types just ignores the link attribute (maybe ok). But before 
> supporting this
> as an API, it would be better if all the other drivers that use IFLA_LINK
> had error checks in their change link handling.
> 
> Please add error checks in kernel first.

Would it be better to expose this as a netlink attribute specific to
DSA, instead of iflink which as you point out has uses for other virtual
interfaces like veth, and the semantics there are not quite the same?


Re: [PATCH net v2 1/2] net: dsa: lantiq_gswip: Don't use PHY auto polling

2021-04-08 Thread Vladimir Oltean
On Thu, Apr 08, 2021 at 08:38:27PM +0200, Martin Blumenstingl wrote:
> PHY auto polling on the GSWIP hardware can be used so link changes
> (speed, link up/down, etc.) can be detected automatically. Internally
> GSWIP reads the PHY's registers for this functionality. Based on this
> automatic detection GSWIP can also automatically re-configure it's port
> settings. Unfortunately this auto polling (and configuration) mechanism
> seems to cause various issues observed by different people on different
> devices:
> - FritzBox 7360v2: the two Gbit/s ports (connected to the two internal
>   PHY11G instances) are working fine but the two Fast Ethernet ports
>   (using an AR8030 RMII PHY) are completely dead (neither RX nor TX are
>   received). It turns out that the AR8030 PHY sets the BMSR_ESTATEN bit
>   as well as the ESTATUS_1000_TFULL and ESTATUS_1000_XFULL bits. This
>   makes the PHY auto polling state machine (rightfully?) think that the
>   established link speed (when the other side is Gbit/s capable) is
>   1Gbit/s.

Why do you say "rightfully"? The PHY is gigabit capable, and it reports
that via the Extended Status register. This is one of the reasons why
the "advertising" and "supported" link modes are separate concepts,
because even though you support gigabit, you don't advertise it because
you are in RMII mode.

How does turning off the auto polling feature help circumvent the
Atheros PHY reporting "issue"? Do we even know that is the problem, or
is it simply a guess on your part based on something that looked strange?

> - None of the Ethernet ports on the Zyxel P-2812HNU-F1 (two are
>   connected to the internal PHY11G GPHYs while the other three are
>   external RGMII PHYs) are working. Neither RX nor TX traffic was
>   observed. It is not clear which part of the PHY auto polling state-
>   machine caused this.

Great.

> - FritzBox 7412 (only one LAN port which is connected to one of the
>   internal GPHYs running in PHY22F / Fast Ethernet mode) was seeing
>   random disconnects (link down events could be seen). Sometimes all
>   traffic would stop after such disconnect. It is not clear which part
>   of the PHY auto polling state-machine cauased this.
> - TP-Link TD-W9980 (two ports are connected to the internal GPHYs
>   running in PHY11G / Gbit/s mode, the other two are external RGMII
>   PHYs) was affected by similar issues as the FritzBox 7412 just without
>   the "link down" events
> 
> Switch to software based configuration instead of PHY auto polling (and
> letting the GSWIP hardware configure the ports automatically) for the
> following link parameters:
> - link up/down
> - link speed
> - full/half duplex
> - flow control (RX / TX pause)

What does the auto polling feature consist of, exactly? Is there some
sort of microcontroller accessing the MDIO bus simultaneously with
Linux?

> After a big round of manual testing by various people (who helped test
> this on OpenWrt) it turns out that this fixes all reported issues.
> 
> Additionally it can be considered more future proof because any
> "quirk" which is implemented for a PHY on the driver side can now be
> used with the GSWIP hardware as well because Linux is in control of the
> link parameters.
> 
> As a nice side-effect this also solves a problem where fixed-links were
> not supported previously because we were relying on the PHY auto polling
> mechanism, which cannot work for fixed-links as there's no PHY from
> where it can read the registers. Configuring the link settings on the
> GSWIP ports means that we now use the settings from device-tree also for
> ports with fixed-links.
> 
> Fixes: 14fceff4771e51 ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
> Fixes: 3e6fdeb28f4c33 ("net: dsa: lantiq_gswip: Let GSWIP automatically set 
> the xMII clock")
> Cc: sta...@vger.kernel.org
> Acked-by: Hauke Mehrtens 
> Reviewed-by: Andrew Lunn 
> Signed-off-by: Martin Blumenstingl 
> ---


Re: [PATCH net-next v1 2/9] net: dsa: tag_ar9331: detect IGMP and MLD packets

2021-04-04 Thread Vladimir Oltean
On Sun, Apr 04, 2021 at 07:35:26AM +0200, Oleksij Rempel wrote:
> Am 04.04.21 um 02:02 schrieb Vladimir Oltean:
> > On Sat, Apr 03, 2021 at 07:14:56PM +0200, Oleksij Rempel wrote:
> >> Am 03.04.21 um 16:49 schrieb Andrew Lunn:
> >>>> @@ -31,6 +96,13 @@ static struct sk_buff *ar9331_tag_xmit(struct sk_buff 
> >>>> *skb,
> >>>>  __le16 *phdr;
> >>>>  u16 hdr;
> >>>>
> >>>> +if (dp->stp_state == BR_STATE_BLOCKING) {
> >>>> +/* TODO: should we reflect it in the stats? */
> >>>> +netdev_warn_once(dev, "%s:%i dropping blocking 
> >>>> packet\n",
> >>>> + __func__, __LINE__);
> >>>> +return NULL;
> >>>> +}
> >>>> +
> >>>>  phdr = skb_push(skb, AR9331_HDR_LEN);
> >>>>
> >>>>  hdr = FIELD_PREP(AR9331_HDR_VERSION_MASK, AR9331_HDR_VERSION);
> >>>
> >>> Hi Oleksij
> >>>
> >>> This change does not seem to fit with what this patch is doing.
> >>
> >> done
> >>
> >>> I also think it is wrong. You still need BPDU to pass through a
> >>> blocked port, otherwise spanning tree protocol will be unstable.
> >>
> >> We need a better filter, otherwise, in case of software based STP, we are 
> >> leaking packages on
> >> blocked port. For example DHCP do trigger lots of spam in the kernel log.
> >
> > I have no idea whatsoever what 'software based STP' is, if you have
> > hardware-accelerated forwarding.
> 
> I do not mean hardware-accelerated forwarding, i mean
> hardware-accelerated STP port state helpers.

Still no clue what you mean, sorry.

> >> I'll drop STP patch for now, it will be better to make a generic soft STP 
> >> for all switches without
> >> HW offloading. For example ksz9477 is doing SW based STP in similar way.
> >
> > How about we discuss first about what your switch is not doing properly?
> > Have you debugged more than just watching the bridge change port states?
> > As Andrew said, a port needs to accept and send link-local frames
> > regardless of the STP state. In the BLOCKING state it must send no other
> > frames and have address learning disabled. Is this what's happening, is
> > the switch forwarding frames towards a BLOCKING port?
> 
> The switch is not forwarding BPDU frame to the CPU port. So, the linux
> bridge will stack by cycling different state of the port where loop was
> detected.

The switch should not be 'forwarding' BPDU frames to the CPU port, it
should be 'trapping' them. The difference is subtle but important. Often
times switches have an Access Control List which allows them to steal
packets from the normal FDB-based forwarding path. It is probably the
case that your switch needs to be told to treat STP BPDUs specially and
not just 'forward' them.
To confirm whether I'm right or wrong, if you disable STP and send a
packet with MAC DA 01:80:c2:00:00:00 to the switch, will it flood it
towards all ports or will it only send them to the CPU?


Re: [PATCH net-next v1 1/9] net: dsa: add rcv_post call back

2021-04-04 Thread Vladimir Oltean
On Sun, Apr 04, 2021 at 07:49:03AM +0200, Oleksij Rempel wrote:
> Am 04.04.21 um 01:21 schrieb Vladimir Oltean:
> > On Sat, Apr 03, 2021 at 05:05:34PM +0300, Vladimir Oltean wrote:
> >> On Sat, Apr 03, 2021 at 01:48:40PM +0200, Oleksij Rempel wrote:
> >>> Some switches (for example ar9331) do not provide enough information
> >>> about forwarded packets. If the switch decision was made based on IPv4
> >>> or IPv6 header, we need to analyze it and set proper flag.
> >>>
> >>> Potentially we can do it in existing rcv path, on other hand we can
> >>> avoid part of duplicated work and let the dsa framework set skb header
> >>> pointers and then use preprocessed skb one step later withing the rcv_post
> >>> call back.
> >>>
> >>> This patch is needed for ar9331 switch.
> >>>
> >>> Signed-off-by: Oleksij Rempel 
> >>> ---
> >>
> >> I don't necessarily disagree with this, perhaps we can even move
> >> Florian's dsa_untag_bridge_pvid() call inside a rcv_post() method
> >> implemented by the DSA_TAG_PROTO_BRCM_LEGACY, DSA_TAG_PROTO_BRCM_PREPEND
> >> and DSA_TAG_PROTO_BRCM taggers. Or even better, because Oleksij's
> >> rcv_post is already prototype-compatible with dsa_untag_bridge_pvid, we
> >> can already do:
> >>
> >>.rcv_post = dsa_untag_bridge_pvid,
> >>
> >> This should be generally useful for stuff that DSA taggers need to do
> >> which is easiest done after eth_type_trans() was called.
> >
> > I had some fun with an alternative method of parsing the frame for IGMP
> > so that you can clear skb->offload_fwd_mark, which doesn't rely on the
> > introduction of a new method in DSA. It should also have several other
> > advantages compared to your solution such as the fact that it should
> > work with VLAN-tagged packets.
> >
> > Background: we made Receive Packet Steering work on DSA master interfaces
> > (echo 3 > /sys/class/net/eth0/queues/rx-1/rps_cpus) even when the DSA
> > tag shifts to the right the IP headers and everything that comes
> > afterwards. The flow dissector had to be patched for that, just grep for
> > DSA in net/core/flow_dissector.c.
> >
> > The problem you're facing is that you can't parse the IP and IGMP
> > headers in the tagger's rcv() method, since the network header,
> > transport header offsets and skb->protocol are all messed up, since
> > eth_type_trans hasn't been called yet.
> >
> > And that's the trick right there, you're between a rock and a hard
> > place: too early because eth_type_trans wasn't called yet, and too late
> > because skb->dev was changed and no longer points to the DSA master, so
> > the flow dissector adjustment we made doesn't apply.
> >
> > But if you call the flow dissector _before_ you call "skb->dev =
> > dsa_master_find_slave" (and yes, while the DSA tag is still there), then
> > it's virtually as if you had called that while the skb belonged to the
> > DSA master, so it should work with __skb_flow_dissect.
> >
> > In fact I prototyped this idea below. I wanted to check whether I can
> > match something as fine-grained as an IGMPv2 Membership Report message,
> > and I could.
> >
> > I prototyped it inside the ocelot tagging protocol driver because that's
> > what I had handy. I used __skb_flow_dissect with my own flow dissector
> > which had to be initialized at the tagger module_init time, even though
> > I think I could have probably just called skb_flow_dissect_flow_keys
> > with a standard dissector, and that would have removed the need for the
> > custom module_init in tag_ocelot.c. One thing that is interesting is
> > that I had to add the bits for IGMP parsing to the flow dissector
> > myself (based on the existing ICMP code). I was too lazy to do that for
> > MLD as well, but it is really not hard. Or even better, if you don't
> > need to look at all inside the IGMP/MLD header, I think you can even
> > omit adding this parsing code to the flow dissector and just look at
> > basic.n_proto and basic.ip_proto.
> >
> > See the snippet below. Hope it helps.
> >
> > -[ cut here ]-
> > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> > index ffd386ea0dbb..4c25fa47637a 100644
> > --- a/include/net/flow_dissector.h
> > +++ b/include/net/flow_dissector.h
> > @@ -190,6 +190,20 @@ struct flow_dissector_key_icmp {
> > u16 id;
> >  };
> >
&g

Re: [PATCH net-next v1 9/9] net: dsa: qca: ar9331: add vlan support

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 01:48:48PM +0200, Oleksij Rempel wrote:
> This switch provides simple VLAN resolution database for 16 entries (VLANs).
> With this database we can cover typical functionalities as port based
> VLANs, untagged and tagged egress. Port based ingress filtering.
> 
> The VLAN database is working on top of forwarding database. So,

Define 'on top'.

> potentially, we can have multiple VLANs on top of multiple bridges.
> Hawing one VLAN on top of multiple bridges will fail on different

s/Hawing/Having/

> levels, most probably DSA framework should warn if some one wont to make

s/wont/wants/
s/some one/someone/

> something likes this.

Finally, why should the DSA framework warn?
Even in the default configuration of two bridges, the default_pvid (1)
will be the same. What problems do you have with that?

In commit 0ee2af4ebbe3 ("net: dsa: set configure_vlan_while_not_filtering
to true by default"), I did not notice that ar9331 does not have VLAN
operations, and I mistakenly set ds->configure_vlan_while_not_filtering
= false for your driver. Could you please delete that line and ensure the
following works?

ip link add br0 type bridge
ip link set lan0 master br0
bridge vlan add dev lan0 vid 100
ip link set br0 type bridge vlan_filtering 1
# make sure you can receive traffic with VLAN 100

> 
> Signed-off-by: Oleksij Rempel 
> ---
>  drivers/net/dsa/qca/ar9331.c | 255 +++
>  1 file changed, 255 insertions(+)
> 
> +static int ar9331_sw_vt_wait(struct ar9331_sw_priv *priv, u32 *f0)
> +{
> + struct regmap *regmap = priv->regmap;
> +
> + return regmap_read_poll_timeout(regmap,
> + AR9331_SW_REG_VLAN_TABLE_FUNCTION0,
> + *f0, !(*f0 & AR9331_SW_VT0_BUSY),
> + 100, 2000);
> +}
> +
> +static int ar9331_sw_port_vt_rmw(struct ar9331_sw_priv *priv, u16 vid,
> +  u8 port_mask_set, u8 port_mask_clr)
> +{
> + struct regmap *regmap = priv->regmap;
> + u32 f0, f1, port_mask = 0, port_mask_new, func;
> + struct ar9331_sw_vlan_db *vdb = NULL;
> + int ret, i;
> +
> + if (!vid)
> + return 0;
> +
> + ret = ar9331_sw_vt_wait(priv, );
> + if (ret)
> + return ret;
> +
> + ret = regmap_write(regmap, AR9331_SW_REG_VLAN_TABLE_FUNCTION0, 0);
> + if (ret)
> + goto error;
> +
> + ret = regmap_write(regmap, AR9331_SW_REG_VLAN_TABLE_FUNCTION1, 0);
> + if (ret)
> + goto error;
> +
> + for (i = 0; i < ARRAY_SIZE(priv->vdb); i++) {
> + if (priv->vdb[i].vid == vid) {
> + vdb = >vdb[i];
> + break;
> + }
> + }
> +
> + ret = regmap_read(regmap, AR9331_SW_REG_VLAN_TABLE_FUNCTION1, );
> + if (ret)
> + return ret;
> +
> + if (vdb) {
> + port_mask = vdb->port_mask;
> + }
> +
> + port_mask_new = port_mask & ~port_mask_clr;
> + port_mask_new |= port_mask_set;
> +
> + if (port_mask_new && port_mask_new == port_mask) {
> + dev_info(priv->dev, "%s: no need to overwrite existing valid 
> entry on %x\n",
> + __func__, port_mask_new);

With VLANs, the bridge is indeed much less strict compared to FDBs, due
to the old API having ranges baked in (which were never used).

That being said, is there actually any value in this message? Would you
mind deleting it (I see how it could annoy a user)?

You might want to look at devlink regions if you want to debug the VLAN
table of the hardware.

> + return 0;
> + }
> +
> + if (port_mask_new) {
> + func = AR9331_SW_VT0_FUNC_LOAD_ENTRY;
> + } else {
> + func = AR9331_SW_VT0_FUNC_PURGE_ENTRY;
> + port_mask_new = port_mask;
> + }
> +
> + if (vdb) {
> + vdb->port_mask = port_mask_new;
> +
> + if (!port_mask_new)
> + vdb->vid = 0;
> + } else {
> + for (i = 0; i < ARRAY_SIZE(priv->vdb); i++) {
> + if (!priv->vdb[i].vid) {
> + vdb = >vdb[i];
> + break;
> + }
> + }
> +
> + if (!vdb) {
> + dev_err_ratelimited(priv->dev, "Local VDB is full\n");

You have a netlink extack at your disposal, use it.

> + return -ENOMEM;
> + }
> + vdb->vid = vid;
> + vdb->port_mask = port_mask_new;
> + }
> +
> + f0 = FIELD_PREP(AR9331_SW_VT0_VID, vid) |
> +  FIELD_PREP(AR9331_SW_VT0_FUNC, func) |
> +  AR9331_SW_VT0_BUSY;
> + f1 = FIELD_PREP(AR9331_SW_VT1_VID_MEM, port_mask_new) |
> + AR9331_SW_VT1_VALID;
> +
> + ret = regmap_write(regmap, AR9331_SW_REG_VLAN_TABLE_FUNCTION1, f1);
> + if (ret)
> + return ret;
> +
> + ret = 

Re: [PATCH net-next v1 4/9] net: dsa: qca: ar9331: make proper initial port defaults

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 01:48:43PM +0200, Oleksij Rempel wrote:
> Make sure that all external port are actually isolated from each other,
> so no packets are leaked.
> 
> Signed-off-by: Oleksij Rempel 
> ---
>  drivers/net/dsa/qca/ar9331.c | 145 ++-
>  1 file changed, 143 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/dsa/qca/ar9331.c b/drivers/net/dsa/qca/ar9331.c
> index 9a5035b2f0ff..a3de3598fbf5 100644
> --- a/drivers/net/dsa/qca/ar9331.c
> +++ b/drivers/net/dsa/qca/ar9331.c
> @@ -60,10 +60,19 @@
>  
>  /* MIB registers */
>  #define AR9331_MIB_COUNTER(x)(0x2 + ((x) * 
> 0x100))
>  
> @@ -229,6 +278,7 @@ struct ar9331_sw_priv {
>   struct regmap *regmap;
>   struct reset_control *sw_reset;
>   struct ar9331_sw_port port[AR9331_SW_PORTS];
> + int cpu_port;
>  };
>  
>  static struct ar9331_sw_priv *ar9331_sw_port_to_priv(struct ar9331_sw_port 
> *port)
> @@ -371,12 +421,72 @@ static int ar9331_sw_mbus_init(struct ar9331_sw_priv 
> *priv)
>   return 0;
>  }
>  
> -static int ar9331_sw_setup(struct dsa_switch *ds)
> +static int ar9331_sw_setup_port(struct dsa_switch *ds, int port)
>  {
>   struct ar9331_sw_priv *priv = (struct ar9331_sw_priv *)ds->priv;
>   struct regmap *regmap = priv->regmap;
> + u32 port_mask, port_ctrl, val;
>   int ret;
>  
> + /* Generate default port settings */
> + port_ctrl = FIELD_PREP(AR9331_SW_PORT_CTRL_PORT_STATE,
> +AR9331_SW_PORT_CTRL_PORT_STATE_DISABLED);
> +
> + if (dsa_is_cpu_port(ds, port)) {
> + /*
> +  * CPU port should be allowed to communicate with all user
> +  * ports.
> +  */
> + //port_mask = dsa_user_ports(ds);

Code commented out should ideally not be part of a submitted patch.
And the networking comment style is:

/* CPU port should be allowed to communicate with all user
 * ports.
 */

> + port_mask = 0;
> + /*
> +  * Enable atheros header on CPU port. This will allow us
> +  * communicate with each port separately
> +  */
> + port_ctrl |= AR9331_SW_PORT_CTRL_HEAD_EN;
> + port_ctrl |= AR9331_SW_PORT_CTRL_LEARN_EN;
> + } else if (dsa_is_user_port(ds, port)) {
> + /*
> +  * User ports should communicate only with the CPU port.
> +  */
> + port_mask = BIT(priv->cpu_port);

For all you care, the CPU port here is dsa_to_port(ds, port)->cpu_dp->index,
no need to go to those lengths in order to find it. DSA does not have a
fixed number for the CPU port but a CPU port pointer per port in order
to not close the door for the future support of multiple CPU ports.

> + /* Enable unicast address learning by default */
> + port_ctrl |= AR9331_SW_PORT_CTRL_LEARN_EN
> + /* IGMP snooping seems to work correctly, let's use it */
> +   | AR9331_SW_PORT_CTRL_IGMP_MLD_EN

I don't really like this ad-hoc enablement of IGMP/MLD snooping from the driver,
please add the pass-through in DSA for SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED
(dsa_slave_port_attr_set, dsa_port_switchdev_sync, dsa_port_switchdev_unsync
should all call a dsa_switch_ops :: port_snoop_igmp_mld function) and then
toggle this bit from there.

> +   | AR9331_SW_PORT_CTRL_SINGLE_VLAN_EN;
> + } else {
> + /* Other ports do not need to communicate at all */
> + port_mask = 0;
> + }
> +
> + val = FIELD_PREP(AR9331_SW_PORT_VLAN_8021Q_MODE,
> +  AR9331_SW_8021Q_MODE_NONE) |
> + FIELD_PREP(AR9331_SW_PORT_VLAN_PORT_VID_MEMBER, port_mask) |
> + FIELD_PREP(AR9331_SW_PORT_VLAN_PORT_VID,
> +AR9331_SW_PORT_VLAN_PORT_VID_DEF);
> +
> + ret = regmap_write(regmap, AR9331_SW_REG_PORT_VLAN(port), val);
> + if (ret)
> + goto error;
> +
> + ret = regmap_write(regmap, AR9331_SW_REG_PORT_CTRL(port), port_ctrl);
> + if (ret)
> + goto error;
> +
> + return 0;
> +error:
> + dev_err_ratelimited(priv->dev, "%s: error: %i\n", __func__, ret);
> +
> + return ret;
> +}
> +
> +static int ar9331_sw_setup(struct dsa_switch *ds)
> +{
> + struct ar9331_sw_priv *priv = (struct ar9331_sw_priv *)ds->priv;
> + struct regmap *regmap = priv->regmap;
> + int ret, i;
> +
>   ret = ar9331_sw_reset(priv);
>   if (ret)
>   return ret;
> @@ -390,7 +500,8 @@ static int ar9331_sw_setup(struct dsa_switch *ds)
>  
>   /* Do not drop broadcast frames */
>   ret = regmap_write_bits(regmap, AR9331_SW_REG_FLOOD_MASK,
> - AR9331_SW_FLOOD_MASK_BROAD_TO_CPU,
> + AR9331_SW_FLOOD_MASK_BROAD_TO_CPU
> + | AR9331_SW_FLOOD_MASK_MULTI_FLOOD_DP,
>

Re: [PATCH net-next v1 2/9] net: dsa: tag_ar9331: detect IGMP and MLD packets

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 07:14:56PM +0200, Oleksij Rempel wrote:
> Am 03.04.21 um 16:49 schrieb Andrew Lunn:
> >> @@ -31,6 +96,13 @@ static struct sk_buff *ar9331_tag_xmit(struct sk_buff 
> >> *skb,
> >>__le16 *phdr;
> >>u16 hdr;
> >>
> >> +  if (dp->stp_state == BR_STATE_BLOCKING) {
> >> +  /* TODO: should we reflect it in the stats? */
> >> +  netdev_warn_once(dev, "%s:%i dropping blocking packet\n",
> >> +   __func__, __LINE__);
> >> +  return NULL;
> >> +  }
> >> +
> >>phdr = skb_push(skb, AR9331_HDR_LEN);
> >>
> >>hdr = FIELD_PREP(AR9331_HDR_VERSION_MASK, AR9331_HDR_VERSION);
> >
> > Hi Oleksij
> >
> > This change does not seem to fit with what this patch is doing.
> 
> done
> 
> > I also think it is wrong. You still need BPDU to pass through a
> > blocked port, otherwise spanning tree protocol will be unstable.
> 
> We need a better filter, otherwise, in case of software based STP, we are 
> leaking packages on
> blocked port. For example DHCP do trigger lots of spam in the kernel log.

I have no idea whatsoever what 'software based STP' is, if you have
hardware-accelerated forwarding.

> I'll drop STP patch for now, it will be better to make a generic soft STP for 
> all switches without
> HW offloading. For example ksz9477 is doing SW based STP in similar way.

How about we discuss first about what your switch is not doing properly?
Have you debugged more than just watching the bridge change port states?
As Andrew said, a port needs to accept and send link-local frames
regardless of the STP state. In the BLOCKING state it must send no other
frames and have address learning disabled. Is this what's happening, is
the switch forwarding frames towards a BLOCKING port?


Re: [PATCH net-next v1 5/9] net: dsa: qca: ar9331: add forwarding database support

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 05:25:16PM +0200, Andrew Lunn wrote:
> > +static int ar9331_sw_port_fdb_rmw(struct ar9331_sw_priv *priv,
> > + const unsigned char *mac,
> > + u8 port_mask_set,
> > + u8 port_mask_clr)
> > +{
> > +   port_mask = FIELD_GET(AR9331_SW_AT_DES_PORT, f2);
> > +   status = FIELD_GET(AR9331_SW_AT_STATUS, f2);
> > +   if (status > 0 && status < AR9331_SW_AT_STATUS_STATIC) {
> > +   dev_err_ratelimited(priv->dev, "%s: found existing dynamic 
> > entry on %x\n",
> > +   __func__, port_mask);
> > +
> > +   if (port_mask_set && port_mask_set != port_mask)
> > +   dev_err_ratelimited(priv->dev, "%s: found existing 
> > dynamic entry on %x, replacing it with static on %x\n",
> > +   __func__, port_mask, port_mask_set);
> > +   port_mask = 0;
> > +   } else if (!status && !port_mask_set) {
> > +   return 0;
> > +   }
> 
> As a generate rule of thumb, use rate limiting where you have no
> control of the number of prints, e.g. it is triggered by packet
> processing, and there is potentially a lot of them, which could DOS
> the box by a remote or unprivileged attacker.
> 
> FDB changes should not happen often. Yes, root my be able to DOS the
> box by doing bridge fdb add commands in a loop, but only root should
> be able to do that.

+1
The way I see it, rate limiting should only be done when the user can't
stop the printing from spamming the console, and they just go "argh,
kill it with fire!!!" and throw the box away. As a side note, most of
the time when I can't stop the kernel from printing in a loop, the rate
limit isn't enough to stop me from wanting to throw the box out the
window, but I digress.

> Plus, i'm not actually sure we should be issuing warnings here. What
> does the bridge code do in this case? Is it silent and just does it,
> or does it issue a warning?

:D

What Oleksij doesn't know, I bet, is that he's using the bridge bypass
commands:

bridge fdb add dev lan0 00:01:02:03:04:05

That is the deprecated way of managing FDB entries, and has several
disadvantages such as:
- the bridge software FDB never gets updated with this entry, so other
  drivers which might be subscribed to DSA's FDB (imagine a non-DSA
  driver having the same logic as our ds->assisted_learning_on_cpu_port)
  will never see these FDB entries
- you have to manage duplicates yourself

The correct way to install a static bridge FDB entry is:

bridge fdb add dev lan0 00:01:02:03:04:05 master static

That will error out on duplicates for you.

>From my side I would even go as far as deleting the bridge bypass
operations (.ndo_fdb_add and .ndo_fdb_del). The more integration we add
between DSA and the bridge/switchdev, the worse it will be when users
just keep using the bridge bypass. To start that process, I guess we
should start emitting a deprecation warning and then pull the trigger
after a few kernel release cycles.

> > +
> > +   port_mask_new = port_mask & ~port_mask_clr;
> > +   port_mask_new |= port_mask_set;
> > +
> > +   if (port_mask_new == port_mask &&
> > +   status == AR9331_SW_AT_STATUS_STATIC) {
> > +   dev_info(priv->dev, "%s: no need to overwrite existing valid 
> > entry on %x\n",
> > +   __func__, port_mask_new);
> 
> This one should probably be dev_dbg().

Or deleted, along with the overwrite logic.


Re: [PATCH net-next v1 1/9] net: dsa: add rcv_post call back

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 05:05:34PM +0300, Vladimir Oltean wrote:
> On Sat, Apr 03, 2021 at 01:48:40PM +0200, Oleksij Rempel wrote:
> > Some switches (for example ar9331) do not provide enough information
> > about forwarded packets. If the switch decision was made based on IPv4
> > or IPv6 header, we need to analyze it and set proper flag.
> > 
> > Potentially we can do it in existing rcv path, on other hand we can
> > avoid part of duplicated work and let the dsa framework set skb header
> > pointers and then use preprocessed skb one step later withing the rcv_post
> > call back.
> > 
> > This patch is needed for ar9331 switch.
> > 
> > Signed-off-by: Oleksij Rempel 
> > ---
> 
> I don't necessarily disagree with this, perhaps we can even move
> Florian's dsa_untag_bridge_pvid() call inside a rcv_post() method
> implemented by the DSA_TAG_PROTO_BRCM_LEGACY, DSA_TAG_PROTO_BRCM_PREPEND
> and DSA_TAG_PROTO_BRCM taggers. Or even better, because Oleksij's
> rcv_post is already prototype-compatible with dsa_untag_bridge_pvid, we
> can already do:
> 
>   .rcv_post = dsa_untag_bridge_pvid,
> 
> This should be generally useful for stuff that DSA taggers need to do
> which is easiest done after eth_type_trans() was called.

I had some fun with an alternative method of parsing the frame for IGMP
so that you can clear skb->offload_fwd_mark, which doesn't rely on the
introduction of a new method in DSA. It should also have several other
advantages compared to your solution such as the fact that it should
work with VLAN-tagged packets.

Background: we made Receive Packet Steering work on DSA master interfaces
(echo 3 > /sys/class/net/eth0/queues/rx-1/rps_cpus) even when the DSA
tag shifts to the right the IP headers and everything that comes
afterwards. The flow dissector had to be patched for that, just grep for
DSA in net/core/flow_dissector.c.

The problem you're facing is that you can't parse the IP and IGMP
headers in the tagger's rcv() method, since the network header,
transport header offsets and skb->protocol are all messed up, since
eth_type_trans hasn't been called yet.

And that's the trick right there, you're between a rock and a hard
place: too early because eth_type_trans wasn't called yet, and too late
because skb->dev was changed and no longer points to the DSA master, so
the flow dissector adjustment we made doesn't apply.

But if you call the flow dissector _before_ you call "skb->dev =
dsa_master_find_slave" (and yes, while the DSA tag is still there), then
it's virtually as if you had called that while the skb belonged to the
DSA master, so it should work with __skb_flow_dissect.

In fact I prototyped this idea below. I wanted to check whether I can
match something as fine-grained as an IGMPv2 Membership Report message,
and I could.

I prototyped it inside the ocelot tagging protocol driver because that's
what I had handy. I used __skb_flow_dissect with my own flow dissector
which had to be initialized at the tagger module_init time, even though
I think I could have probably just called skb_flow_dissect_flow_keys
with a standard dissector, and that would have removed the need for the
custom module_init in tag_ocelot.c. One thing that is interesting is
that I had to add the bits for IGMP parsing to the flow dissector
myself (based on the existing ICMP code). I was too lazy to do that for
MLD as well, but it is really not hard. Or even better, if you don't
need to look at all inside the IGMP/MLD header, I think you can even
omit adding this parsing code to the flow dissector and just look at
basic.n_proto and basic.ip_proto.

See the snippet below. Hope it helps.

-[ cut here ]-
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ffd386ea0dbb..4c25fa47637a 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -190,6 +190,20 @@ struct flow_dissector_key_icmp {
u16 id;
 };
 
+/**
+ * flow_dissector_key_igmp:
+ * type: indicates the message type, see include/uapi/linux/igmp.h
+ * code: Max Resp Code, the maximum time in 1/10 second
+ *   increments before sending a responding report
+ * group: the multicast address being queried when sending a
+ *Group-Specific or Group-and-Source-Specific Query.
+ */
+struct flow_dissector_key_igmp {
+   u8 type;
+   u8 code; /* Max Resp Time in IGMPv2 */
+   __be32 group;
+};
+
 /**
  * struct flow_dissector_key_eth_addrs:
  * @src: source Ethernet address
@@ -259,6 +273,7 @@ enum flow_dissector_key_id {
FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
FLOW_DISSECTOR_KEY_PORTS_RANGE, /* struct flow_dissector_key_ports */
FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
+   

Re: [PATCH net-next v1 2/9] net: dsa: tag_ar9331: detect IGMP and MLD packets

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 05:22:24PM +0200, Oleksij Rempel wrote:
> Off-topic question, this patch set stops to work after rebasing against
> latest netdev. I get following warning:
> ip l s lan0 master test
> RTNETLINK answers: Invalid argumen
> 
> Are there some API changes?

Yes, it's likely that you are returning -EINVAL to some of the functions
with which DSA calls you at .port_bridge_join time, see dsa_port_switchdev_sync.


Re: [PATCH net-next v1 1/9] net: dsa: add rcv_post call back

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 01:48:40PM +0200, Oleksij Rempel wrote:
> Some switches (for example ar9331) do not provide enough information
> about forwarded packets. If the switch decision was made based on IPv4
> or IPv6 header, we need to analyze it and set proper flag.
> 
> Potentially we can do it in existing rcv path, on other hand we can
> avoid part of duplicated work and let the dsa framework set skb header
> pointers and then use preprocessed skb one step later withing the rcv_post
> call back.
> 
> This patch is needed for ar9331 switch.
> 
> Signed-off-by: Oleksij Rempel 
> ---

I don't necessarily disagree with this, perhaps we can even move
Florian's dsa_untag_bridge_pvid() call inside a rcv_post() method
implemented by the DSA_TAG_PROTO_BRCM_LEGACY, DSA_TAG_PROTO_BRCM_PREPEND
and DSA_TAG_PROTO_BRCM taggers. Or even better, because Oleksij's
rcv_post is already prototype-compatible with dsa_untag_bridge_pvid, we
can already do:

.rcv_post = dsa_untag_bridge_pvid,

This should be generally useful for stuff that DSA taggers need to do
which is easiest done after eth_type_trans() was called.


Re: [PATCH net-next v1 2/9] net: dsa: tag_ar9331: detect IGMP and MLD packets

2021-04-03 Thread Vladimir Oltean
On Sat, Apr 03, 2021 at 03:26:36PM +0200, Oleksij Rempel wrote:
> On Sat, Apr 03, 2021 at 04:03:18PM +0300, Vladimir Oltean wrote:
> > Hi Oleksij,
> > 
> > On Sat, Apr 03, 2021 at 01:48:41PM +0200, Oleksij Rempel wrote:
> > > The ar9331 switch is not forwarding IGMP and MLD packets if IGMP
> > > snooping is enabled. This patch is trying to mimic the HW heuristic to 
> > > take
> > > same decisions as this switch would do to be able to tell the linux
> > > bridge if some packet was prabably forwarded or not.
> > > 
> > > Signed-off-by: Oleksij Rempel 
> > > ---
> > 
> > I am not familiar with IGMP/MLD, therefore I don't really understand
> > what problem you are trying to solve.
> > 
> > Your switch has packet traps for IGMP and MLD, ok. So it doesn't forward
> > them. Must the IGMP/MLD packets be forwarded by an IGMP/MLD snooping
> > bridge? Which ones and under what circumstances?
> 
> I'll better refer to the rfc:
> https://tools.ietf.org/html/rfc4541

Ok, the question might have been a little bit dumb.
I found this PDF:
https://www.alliedtelesis.com/sites/default/files/documents/how-alliedware/howto_config_igmp1.pdf
and it explains that:
- a snooper floods the Membership Query messages from the network's
  querier towards all ports that are not blocked by STP
- a snooper forwards all Membership Report messages from a client
  towards the All Groups port (which is how it reaches the querier).

I'm asking this because I just want to understand what the bridge code
does. Does the code path for IGMP_HOST_MEMBERSHIP_REPORT (for example)
for a snooper go through should_deliver -> nbp_switchdev_allowed_egress,
which is what you are affecting here?


Re: [PATCH net-next v1 2/9] net: dsa: tag_ar9331: detect IGMP and MLD packets

2021-04-03 Thread Vladimir Oltean
Hi Oleksij,

On Sat, Apr 03, 2021 at 01:48:41PM +0200, Oleksij Rempel wrote:
> The ar9331 switch is not forwarding IGMP and MLD packets if IGMP
> snooping is enabled. This patch is trying to mimic the HW heuristic to take
> same decisions as this switch would do to be able to tell the linux
> bridge if some packet was prabably forwarded or not.
> 
> Signed-off-by: Oleksij Rempel 
> ---

I am not familiar with IGMP/MLD, therefore I don't really understand
what problem you are trying to solve.

Your switch has packet traps for IGMP and MLD, ok. So it doesn't forward
them. Must the IGMP/MLD packets be forwarded by an IGMP/MLD snooping
bridge? Which ones and under what circumstances?


Re: [EXT] Re: [PATCH] dt-bindings: spi: Convert Freescale DSPI to json schema

2021-03-24 Thread Vladimir Oltean
On Wed, Mar 24, 2021 at 01:20:33PM -0600, Rob Herring wrote:
> In addition, "fsl,ls1088a-dspi" is not known by the Linux driver, so a
> fallback is needed.

This is a good point, the LS1088A went completely off of my radar,
thanks for pointing it out.


Re: [EXT] Re: [PATCH] dt-bindings: spi: Convert Freescale DSPI to json schema

2021-03-24 Thread Vladimir Oltean
On Wed, Mar 24, 2021 at 12:14:03PM -0600, Rob Herring wrote:
> On Tue, Mar 16, 2021 at 12:15:06PM +0200, Vladimir Oltean wrote:
> > On Tue, Mar 16, 2021 at 06:08:17AM +, Kuldeep Singh wrote:
> > > Compatible entries in conjugation require enum and const pair.
> > > For example, ls1012a.dtsi uses compatible = 
> > > "fsl,ls1012a-dspi","fsl,ls1021a-v1.0-dspi";
> > > Same goes for LS1028 as well.
> > >
> > > Therefore, can't mention the compatible entry as single entity otherwise
> > > it may fail "make dt_binding_check" and "make dtbs_check".
> > >
> > > >
> > > > > +examples:
> > > > > +  - |
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +soc {
> > > > > +#address-cells = <2>;
> > > > > +#size-cells = <2>;
> > > > > +
> > > > > +spi@210 {
> > > > > +compatible = "fsl,ls1028a-dspi", "fsl,ls1021a-v1.0-dspi";
> > > >
> > > > This doesn't need the "fsl,ls1021a-v1.0-dspi" compatible, can you 
> > > > please remove
> > > > it?
> > >
> > > I have taken this example from LS1028a.dtsi and it uses these compatibles 
> > > in conjugation.
> > > If "fsl,ls1021a-v1.0-dspi" is not required, then it should also be 
> > > removed from device-tree
> > > As well as from bindings both.
> >
> > Yes, the second compatible is never required by the driver and should be
> > removed from existing device trees if that makes "make dtbs_check" fail.
>
> Can you say that is true for every possible driver implementation?
> u-boot, *BSD, etc.?

I don't think other systems are required to follow Linux conventions, so
I'm not sure why it matters.


Re: [PATCH][next] net: bridge: Fix missing return assignment from br_vlan_replay_one call

2021-03-24 Thread Vladimir Oltean
On Wed, Mar 24, 2021 at 03:09:50PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The call to br_vlan_replay_one is returning an error return value but
> this is not being assigned to err and the following check on err is
> currently always false because err was initialized to zero. Fix this
> by assigning err.
> 
> Addresses-Coverity: ("'Constant' variable guards dead code")
> Fixes: 22f67cdfae6a ("net: bridge: add helper to replay VLANs installed on 
> port")
> Signed-off-by: Colin Ian King 
> ---

Reviewed-by: Vladimir Oltean 


Re: [PATCH v4 net-next 04/11] net: bridge: add helper to replay port and local fdb entries

2021-03-23 Thread Vladimir Oltean
On Tue, Mar 23, 2021 at 01:12:33PM +0200, Nikolay Aleksandrov wrote:
> On 23/03/2021 01:51, Vladimir Oltean wrote:
> > From: Vladimir Oltean 
> > 
> > When a switchdev port starts offloading a LAG that is already in a
> > bridge and has an FDB entry pointing to it:
> > 
> > ip link set bond0 master br0
> > bridge fdb add dev bond0 00:01:02:03:04:05 master static
> > ip link set swp0 master bond0
> > 
> > the switchdev driver will have no idea that this FDB entry is there,
> > because it missed the switchdev event emitted at its creation.
> > 
> > Ido Schimmel pointed this out during a discussion about challenges with
> > switchdev offloading of stacked interfaces between the physical port and
> > the bridge, and recommended to just catch that condition and deny the
> > CHANGEUPPER event:
> > https://lore.kernel.org/netdev/20210210105949.gb287...@shredder.lan/
> > 
> > But in fact, we might need to deal with the hard thing anyway, which is
> > to replay all FDB addresses relevant to this port, because it isn't just
> > static FDB entries, but also local addresses (ones that are not
> > forwarded but terminated by the bridge). There, we can't just say 'oh
> > yeah, there was an upper already so I'm not joining that'.
> > 
> > So, similar to the logic for replaying MDB entries, add a function that
> > must be called by individual switchdev drivers and replays local FDB
> > entries as well as ones pointing towards a bridge port. This time, we
> > use the atomic switchdev notifier block, since that's what FDB entries
> > expect for some reason.
> > 
> 
> I get the reason to have both bridge and bridge port devices (although the 
> bridge
> is really unnecessary as it can be inferred from the port), but it looks kind 
> of
> weird at first glance, I mean we get all of the port's fdbs and all of the 
> bridge
> fdbs every time (dst == NULL). The code itself is correct and the alternative
> to take only 1 net_device and act based on its type would add another
> step to the process per-port which also doesn't sound good...
> There are a few minor const nits below too, again if there is another version
> please take care of them, for the patch:
> 
> Acked-by: Nikolay Aleksandrov 

Thanks for the review. For host MDB entries, those are already offloaded
to every bridge port (which yes, is still giving me headaches), so
replaying them for every port that calls br_mdb_replay is at least
consistent with that. For br_fdb_replay, honestly I am not yet sure
because mainline DSA does not yet handle local FDBs, I might end up
touching things up a little when I come back to the "RX filtering in
DSA" series (I need to address Ido's feedback by then too).  I would
just like to get something started. It's even possible that by the end
of the kernel development cycle, the end result might not even look
anything remotely similar to what we have here - this is just what I
deemed as "good enough as a small first step".

If nobody has objections or sees problems with the current series, I
think I'd prefer to send a follow-up with the const conversions, so I
can spam less people with another 11 emails.


Re: [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-23 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 10:00:33PM +0200, Vladimir Oltean wrote:
> Hi Yunsheng,
> 
> On Mon, Mar 22, 2021 at 05:09:16PM +0800, Yunsheng Lin wrote:
> > Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> > flag set, but queue discipline by-pass does not work for lockless
> > qdisc because skb is always enqueued to qdisc even when the qdisc
> > is empty, see __dev_xmit_skb().
> > 
> > This patch calls sch_direct_xmit() to transmit the skb directly
> > to the driver for empty lockless qdisc too, which aviod enqueuing
> > and dequeuing operation. qdisc->empty is set to false whenever a
> > skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> > skb dequeuing return NULL, see pfifo_fast_dequeue().
> > 
> > There is a data race between enqueue/dequeue and qdisc->empty
> > setting, qdisc->empty is only used as a hint, so we need to call
> > sch_may_need_requeuing() to see if the queue is really empty and if
> > there is requeued skb, which has higher priority than the current
> > skb.
> > 
> > The performance for ip_forward test increases about 10% with this
> > patch.
> > 
> > Signed-off-by: Yunsheng Lin 
> > ---
> > Hi, Vladimir and Ahmad
> > Please give it a test to see if there is any out of order
> > packet for this patch, which has removed the priv->lock added in
> > RFC v2.
> > 
> > There is a data race as below:
> > 
> >   CPU1   CPU2
> > qdisc_run_begin(q).
> > .q->enqueue()
> > sch_may_need_requeuing()  .
> > return true   .
> > . .
> > . .
> > q->enqueue()  .
> > 
> > When above happen, the skb enqueued by CPU1 is dequeued after the
> > skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> > If there is not qdisc bypass, the CPU1 has better chance to queue
> > the skb quicker than CPU2.
> > 
> > This patch does not take care of the above data race, because I
> > view this as similar as below:
> > 
> > Even at the same time CPU1 and CPU2 write the skb to two socket
> > which both heading to the same qdisc, there is no guarantee that
> > which skb will hit the qdisc first, becuase there is a lot of
> > factor like interrupt/softirq/cache miss/scheduling afffecting
> > that.
> > 
> > So I hope the above data race will not cause problem for Vladimir
> > and Ahmad.
> > ---
> 
> Preliminary results on my test setup look fine, but please allow me to
> run the canfdtest overnight, since as you say, races are still
> theoretically possible.

I haven't found any issues during the overnight test and until now.

Tested-by: Vladimir Oltean  # flexcan


[PATCH v4 net-next 11/11] net: ocelot: replay switchdev events when joining bridge

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

The premise of this change is that the switchdev port attributes and
objects offloaded by ocelot might have been missed when we are joining
an already existing bridge port, such as a bonding interface.

The patch pulls these switchdev attributes and objects from the bridge,
on behalf of the 'bridge port' net device which might be either the
ocelot switch interface, or the bonding upper interface.

The ocelot_net.c belongs strictly to the switchdev ocelot driver, while
ocelot.c is part of a library shared with the DSA felix driver.
The ocelot_port_bridge_leave function (part of the common library) used
to call ocelot_port_vlan_filtering(false), something which is not
necessary for DSA, since the framework deals with that already there.
So we move this function to ocelot_switchdev_unsync, which is specific
to the switchdev driver.

The code movement described above makes ocelot_port_bridge_leave no
longer return an error code, so we change its type from int to void.

Signed-off-by: Vladimir Oltean 
---
 drivers/net/dsa/ocelot/felix.c |   4 +-
 drivers/net/ethernet/mscc/Kconfig  |   3 +-
 drivers/net/ethernet/mscc/ocelot.c |  18 ++--
 drivers/net/ethernet/mscc/ocelot_net.c | 117 +
 include/soc/mscc/ocelot.h  |   6 +-
 5 files changed, 113 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 628afb47b579..6b5442be0230 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -719,7 +719,9 @@ static int felix_bridge_join(struct dsa_switch *ds, int 
port,
 {
struct ocelot *ocelot = ds->priv;
 
-   return ocelot_port_bridge_join(ocelot, port, br);
+   ocelot_port_bridge_join(ocelot, port, br);
+
+   return 0;
 }
 
 static void felix_bridge_leave(struct dsa_switch *ds, int port,
diff --git a/drivers/net/ethernet/mscc/Kconfig 
b/drivers/net/ethernet/mscc/Kconfig
index 05cb040c2677..2d3157e4d081 100644
--- a/drivers/net/ethernet/mscc/Kconfig
+++ b/drivers/net/ethernet/mscc/Kconfig
@@ -11,7 +11,7 @@ config NET_VENDOR_MICROSEMI
 
 if NET_VENDOR_MICROSEMI
 
-# Users should depend on NET_SWITCHDEV, HAS_IOMEM
+# Users should depend on NET_SWITCHDEV, HAS_IOMEM, BRIDGE
 config MSCC_OCELOT_SWITCH_LIB
select NET_DEVLINK
select REGMAP_MMIO
@@ -24,6 +24,7 @@ config MSCC_OCELOT_SWITCH_LIB
 
 config MSCC_OCELOT_SWITCH
tristate "Ocelot switch driver"
+   depends on BRIDGE || BRIDGE=n
depends on NET_SWITCHDEV
depends on HAS_IOMEM
depends on OF_NET
diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index ce57929ba3d1..1a36b416fd9b 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1514,34 +1514,28 @@ int ocelot_port_mdb_del(struct ocelot *ocelot, int port,
 }
 EXPORT_SYMBOL(ocelot_port_mdb_del);
 
-int ocelot_port_bridge_join(struct ocelot *ocelot, int port,
-   struct net_device *bridge)
+void ocelot_port_bridge_join(struct ocelot *ocelot, int port,
+struct net_device *bridge)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
 
ocelot_port->bridge = bridge;
 
-   return 0;
+   ocelot_apply_bridge_fwd_mask(ocelot);
 }
 EXPORT_SYMBOL(ocelot_port_bridge_join);
 
-int ocelot_port_bridge_leave(struct ocelot *ocelot, int port,
-struct net_device *bridge)
+void ocelot_port_bridge_leave(struct ocelot *ocelot, int port,
+ struct net_device *bridge)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
struct ocelot_vlan pvid = {0}, native_vlan = {0};
-   int ret;
 
ocelot_port->bridge = NULL;
 
-   ret = ocelot_port_vlan_filtering(ocelot, port, false);
-   if (ret)
-   return ret;
-
ocelot_port_set_pvid(ocelot, port, pvid);
ocelot_port_set_native_vlan(ocelot, port, native_vlan);
-
-   return 0;
+   ocelot_apply_bridge_fwd_mask(ocelot);
 }
 EXPORT_SYMBOL(ocelot_port_bridge_leave);
 
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index d1376f7b34fd..36f32a4d9b0f 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1117,47 +1117,126 @@ static int ocelot_port_obj_del(struct net_device *dev,
return ret;
 }
 
+static void ocelot_inherit_brport_flags(struct ocelot *ocelot, int port,
+   struct net_device *brport_dev)
+{
+   struct switchdev_brport_flags flags = {0};
+   int flag;
+
+   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+
+   for_each_set_bit(flag, , 32)
+   if (br_port_flag_is_set(brport_dev, BIT(flag)))
+   flags.val |= BIT(flag);
+
+   ocelot_port_bridge_flags(ocelot, port, 

[PATCH v4 net-next 07/11] net: dsa: pass extack to dsa_port_{bridge,lag}_join

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

This is a pretty noisy change that was broken out of the larger change
for replaying switchdev attributes and objects at bridge join time,
which is when these extack objects are actually used.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
Reviewed-by: Tobias Waldekranz 
---
 net/dsa/dsa_priv.h | 6 --
 net/dsa/port.c | 8 +---
 net/dsa/slave.c| 7 +--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 4c43c5406834..b8778c5d8529 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -181,12 +181,14 @@ int dsa_port_enable_rt(struct dsa_port *dp, struct 
phy_device *phy);
 int dsa_port_enable(struct dsa_port *dp, struct phy_device *phy);
 void dsa_port_disable_rt(struct dsa_port *dp);
 void dsa_port_disable(struct dsa_port *dp);
-int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br);
+int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
+struct netlink_ext_ack *extack);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
 int dsa_port_lag_change(struct dsa_port *dp,
struct netdev_lag_lower_state_info *linfo);
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev,
- struct netdev_lag_upper_info *uinfo);
+ struct netdev_lag_upper_info *uinfo,
+ struct netlink_ext_ack *extack);
 void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev);
 int dsa_port_vlan_filtering(struct dsa_port *dp, bool vlan_filtering,
struct netlink_ext_ack *extack);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index d39262a9fe0e..fcbe5b1545b8 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -144,7 +144,8 @@ static void dsa_port_change_brport_flags(struct dsa_port 
*dp,
}
 }
 
-int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br)
+int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
+struct netlink_ext_ack *extack)
 {
struct dsa_notifier_bridge_info info = {
.tree_index = dp->ds->dst->index,
@@ -241,7 +242,8 @@ int dsa_port_lag_change(struct dsa_port *dp,
 }
 
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag,
- struct netdev_lag_upper_info *uinfo)
+ struct netdev_lag_upper_info *uinfo,
+ struct netlink_ext_ack *extack)
 {
struct dsa_notifier_lag_info info = {
.sw_index = dp->ds->index,
@@ -263,7 +265,7 @@ int dsa_port_lag_join(struct dsa_port *dp, struct 
net_device *lag,
if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
return 0;
 
-   err = dsa_port_bridge_join(dp, bridge_dev);
+   err = dsa_port_bridge_join(dp, bridge_dev, extack);
if (err)
goto err_bridge_join;
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 992fcab4b552..1ff48be476bb 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1976,11 +1976,14 @@ static int dsa_slave_changeupper(struct net_device *dev,
 struct netdev_notifier_changeupper_info *info)
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
+   struct netlink_ext_ack *extack;
int err = NOTIFY_DONE;
 
+   extack = netdev_notifier_info_to_extack(>info);
+
if (netif_is_bridge_master(info->upper_dev)) {
if (info->linking) {
-   err = dsa_port_bridge_join(dp, info->upper_dev);
+   err = dsa_port_bridge_join(dp, info->upper_dev, extack);
if (!err)
dsa_bridge_mtu_normalization(dp);
err = notifier_from_errno(err);
@@ -1991,7 +1994,7 @@ static int dsa_slave_changeupper(struct net_device *dev,
} else if (netif_is_lag_master(info->upper_dev)) {
if (info->linking) {
err = dsa_port_lag_join(dp, info->upper_dev,
-   info->upper_info);
+   info->upper_info, extack);
if (err == -EOPNOTSUPP) {
NL_SET_ERR_MSG_MOD(info->info.extack,
   "Offloading not supported");
-- 
2.25.1



[PATCH v4 net-next 10/11] net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

Similar to the DSA situation, ocelot supports LAG offload but treats
this scenario improperly:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

We do the same thing as we do there, which is to simulate a 'bridge join'
on 'lag join', if we detect that the bonding upper has a bridge upper.

Again, same as DSA, ocelot supports software fallback for LAG, and in
that case, we should avoid calling ocelot_netdevice_changeupper.

Signed-off-by: Vladimir Oltean 
---
 drivers/net/ethernet/mscc/ocelot_net.c | 111 +++--
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index c08164cd88f4..d1376f7b34fd 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1117,10 +1117,15 @@ static int ocelot_port_obj_del(struct net_device *dev,
return ret;
 }
 
-static int ocelot_netdevice_bridge_join(struct ocelot *ocelot, int port,
-   struct net_device *bridge)
+static int ocelot_netdevice_bridge_join(struct net_device *dev,
+   struct net_device *bridge,
+   struct netlink_ext_ack *extack)
 {
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
struct switchdev_brport_flags flags;
+   int port = priv->chip_port;
int err;
 
flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
@@ -1135,10 +1140,14 @@ static int ocelot_netdevice_bridge_join(struct ocelot 
*ocelot, int port,
return 0;
 }
 
-static int ocelot_netdevice_bridge_leave(struct ocelot *ocelot, int port,
+static int ocelot_netdevice_bridge_leave(struct net_device *dev,
 struct net_device *bridge)
 {
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
struct switchdev_brport_flags flags;
+   int port = priv->chip_port;
int err;
 
flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
@@ -1151,43 +1160,89 @@ static int ocelot_netdevice_bridge_leave(struct ocelot 
*ocelot, int port,
return err;
 }
 
-static int ocelot_netdevice_changeupper(struct net_device *dev,
-   struct netdev_notifier_changeupper_info 
*info)
+static int ocelot_netdevice_lag_join(struct net_device *dev,
+struct net_device *bond,
+struct netdev_lag_upper_info *info,
+struct netlink_ext_ack *extack)
 {
struct ocelot_port_private *priv = netdev_priv(dev);
struct ocelot_port *ocelot_port = >port;
struct ocelot *ocelot = ocelot_port->ocelot;
+   struct net_device *bridge_dev;
int port = priv->chip_port;
+   int err;
+
+   err = ocelot_port_lag_join(ocelot, port, bond, info);
+   if (err == -EOPNOTSUPP) {
+   NL_SET_ERR_MSG_MOD(extack, "Offloading not supported");
+   return 0;
+   }
+
+   bridge_dev = netdev_master_upper_dev_get(bond);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   err = ocelot_netdevice_bridge_join(dev, bridge_dev, extack);
+   if (err)
+   goto err_bridge_join;
+
+   return 0;
+
+err_bridge_join:
+   ocelot_port_lag_leave(ocelot, port, bond);
+   return err;
+}
+
+static int ocelot_netdevice_lag_leave(struct net_device *dev,
+ struct net_device *bond)
+{
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
+   struct net_device *bridge_dev;
+   int port = priv->chip_port;
+
+   ocelot_port_lag_leave(ocelot, port, bond);
+
+   bridge_dev = netdev_master_upper_dev_get(bond);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   return ocelot_netdevice_bridge_leave(dev, bridge_dev);
+}
+
+static int ocelot_netdevice_changeupper(struct net_device *dev,
+   struct netdev_notifier_changeupper_info 
*info)
+{
+   struct netlink_ext_ack *extack;
int err = 0;
 
+   extack = netdev_notifier_info_to_extack(>info);
+
if (netif_is_bridge_master(info->upper_dev)) {
-   if (info->linking) {
-   err = ocelot_netdevice_bridge_join(ocelot, port,
-  info->upper_

[PATCH v4 net-next 09/11] net: dsa: sync up switchdev objects and port attributes when joining the bridge

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

If we join an already-created bridge port, such as a bond master
interface, then we can miss the initial switchdev notifications emitted
by the bridge for this port, while it wasn't offloaded by anybody.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/dsa_priv.h |  3 +++
 net/dsa/port.c | 46 ++
 net/dsa/slave.c|  4 ++--
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index b8778c5d8529..92282de54230 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -262,6 +262,9 @@ static inline bool dsa_tree_offloads_bridge_port(struct 
dsa_switch_tree *dst,
 
 /* slave.c */
 extern const struct dsa_device_ops notag_netdev_ops;
+extern struct notifier_block dsa_slave_switchdev_notifier;
+extern struct notifier_block dsa_slave_switchdev_blocking_notifier;
+
 void dsa_slave_mii_bus_init(struct dsa_switch *ds);
 int dsa_slave_create(struct dsa_port *dp);
 void dsa_slave_destroy(struct net_device *slave_dev);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index c712bf3da0a0..01e30264b25b 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -170,12 +170,46 @@ static void dsa_port_clear_brport_flags(struct dsa_port 
*dp)
 static int dsa_port_switchdev_sync(struct dsa_port *dp,
   struct netlink_ext_ack *extack)
 {
+   struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   struct net_device *br = dp->bridge_dev;
int err;
 
err = dsa_port_inherit_brport_flags(dp, extack);
if (err)
return err;
 
+   err = dsa_port_set_state(dp, br_port_get_stp_state(brport_dev));
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   err = dsa_port_vlan_filtering(dp, br_vlan_enabled(br), extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   err = dsa_port_mrouter(dp->cpu_dp, br_multicast_router(br), extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   err = dsa_port_ageing_time(dp, br_get_ageing_time(br));
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   err = br_mdb_replay(br, brport_dev,
+   _slave_switchdev_blocking_notifier,
+   extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   err = br_fdb_replay(br, brport_dev, _slave_switchdev_notifier);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   err = br_vlan_replay(br, brport_dev,
+_slave_switchdev_blocking_notifier,
+extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -198,6 +232,18 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
 * so allow it to be in BR_STATE_FORWARDING to be kept functional
 */
dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
+
+   /* VLAN filtering is handled by dsa_switch_bridge_leave */
+
+   /* Some drivers treat the notification for having a local multicast
+* router by allowing multicast to be flooded to the CPU, so we should
+* allow this in standalone mode too.
+*/
+   dsa_port_mrouter(dp->cpu_dp, true, NULL);
+
+   /* Ageing time may be global to the switch chip, so don't change it
+* here because we have no good reason (or value) to change it to.
+*/
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 1ff48be476bb..c51e52418a62 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2392,11 +2392,11 @@ static struct notifier_block dsa_slave_nb __read_mostly 
= {
.notifier_call  = dsa_slave_netdevice_event,
 };
 
-static struct notifier_block dsa_slave_switchdev_notifier = {
+struct notifier_block dsa_slave_switchdev_notifier = {
.notifier_call = dsa_slave_switchdev_event,
 };
 
-static struct notifier_block dsa_slave_switchdev_blocking_notifier = {
+struct notifier_block dsa_slave_switchdev_blocking_notifier = {
.notifier_call = dsa_slave_switchdev_blocking_event,
 };
 
-- 
2.25.1



[PATCH v4 net-next 08/11] net: dsa: inherit the actual bridge port flags at join time

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA currently assumes that the bridge port starts off with this
constellation of bridge port flags:

- learning on
- unicast flooding on
- multicast flooding on
- broadcast flooding on

just by virtue of code copy-pasta from the bridge layer (new_nbp).
This was a simple enough strategy thus far, because the 'bridge join'
moment always coincided with the 'bridge port creation' moment.

But with sandwiched interfaces, such as:

 br0
  |
bond0
  |
 swp0

it may happen that the user has had time to change the bridge port flags
of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
assume that the bridge port flags are those determined by new_nbp, when
in fact this can happen:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set bond0 type bridge_slave learning off
ip link set swp0 master br0

Now swp0 has learning enabled, bond0 has learning disabled. Not nice.

Fix this by "dumpster diving" through the actual bridge port flags with
br_port_flag_is_set, at bridge join time.

We use this opportunity to split dsa_port_change_brport_flags into two
distinct functions called dsa_port_inherit_brport_flags and
dsa_port_clear_brport_flags, now that the implementation for the two
cases is no longer similar. This patch also creates two functions called
dsa_port_switchdev_sync and dsa_port_switchdev_unsync which collect what
we have so far, even if that's asymmetrical. More is going to be added
in the next patch.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/port.c | 123 -
 1 file changed, 82 insertions(+), 41 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index fcbe5b1545b8..c712bf3da0a0 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -122,28 +122,84 @@ void dsa_port_disable(struct dsa_port *dp)
rtnl_unlock();
 }
 
-static void dsa_port_change_brport_flags(struct dsa_port *dp,
-bool bridge_offload)
+static int dsa_port_inherit_brport_flags(struct dsa_port *dp,
+struct netlink_ext_ack *extack)
 {
-   struct switchdev_brport_flags flags;
-   int flag;
+   const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD |
+  BR_BCAST_FLOOD;
+   struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   int flag, err;
 
-   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
-   if (bridge_offload)
-   flags.val = flags.mask;
-   else
-   flags.val = flags.mask & ~BR_LEARNING;
+   for_each_set_bit(flag, , 32) {
+   struct switchdev_brport_flags flags = {0};
+
+   flags.mask = BIT(flag);
 
-   for_each_set_bit(flag, , 32) {
-   struct switchdev_brport_flags tmp;
+   if (br_port_flag_is_set(brport_dev, BIT(flag)))
+   flags.val = BIT(flag);
+
+   err = dsa_port_bridge_flags(dp, flags, extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+   }
 
-   tmp.val = flags.val & BIT(flag);
-   tmp.mask = BIT(flag);
+   return 0;
+}
 
-   dsa_port_bridge_flags(dp, tmp, NULL);
+static void dsa_port_clear_brport_flags(struct dsa_port *dp)
+{
+   const unsigned long val = BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+   const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD |
+  BR_BCAST_FLOOD;
+   int flag, err;
+
+   for_each_set_bit(flag, , 32) {
+   struct switchdev_brport_flags flags = {0};
+
+   flags.mask = BIT(flag);
+   flags.val = val & BIT(flag);
+
+   err = dsa_port_bridge_flags(dp, flags, NULL);
+   if (err && err != -EOPNOTSUPP)
+   dev_err(dp->ds->dev,
+   "failed to clear bridge port flag %lu: %pe\n",
+   flags.val, ERR_PTR(err));
}
 }
 
+static int dsa_port_switchdev_sync(struct dsa_port *dp,
+  struct netlink_ext_ack *extack)
+{
+   int err;
+
+   err = dsa_port_inherit_brport_flags(dp, extack);
+   if (err)
+   return err;
+
+   return 0;
+}
+
+static void dsa_port_switchdev_unsync(struct dsa_port *dp)
+{
+   /* Configure the port for standalone mode (no address learning,
+* flood everything).
+* The bridge only emits SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS events
+* when the user requests it through netlink or sysfs, but not
+* automatically at port join or leave, so we need to handle resetting
+* the brport flags ourselves. But we even prefer it that way, because
+* otherwise, some setups might never get the notification they need,
+   

[PATCH v4 net-next 03/11] net: bridge: add helper to replay port and host-joined mdb entries

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

I have a system with DSA ports, and udhcpcd is configured to bring
interfaces up as soon as they are created.

I create a bridge as follows:

ip link add br0 type bridge

As soon as I create the bridge and udhcpcd brings it up, I also have
avahi which automatically starts sending IPv6 packets to advertise some
local services, and because of that, the br0 bridge joins the following
IPv6 groups due to the code path detailed below:

33:33:ff:6d:c1:9c vid 0
33:33:00:00:00:6a vid 0
33:33:00:00:00:fb vid 0

br_dev_xmit
-> br_multicast_rcv
   -> br_ip6_multicast_add_group
  -> __br_multicast_add_group
 -> br_multicast_host_join
-> br_mdb_notify

This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
hooked up, and switchdev will attempt to offload the host joined groups
to an empty list of ports. Of course nobody offloads them.

Then when we add a port to br0:

ip link set swp0 master br0

the bridge doesn't replay the host-joined MDB entries from br_add_if,
and eventually the host joined addresses expire, and a switchdev
notification for deleting it is emitted, but surprise, the original
addition was already completely missed.

The strategy to address this problem is to replay the MDB entries (both
the port ones and the host joined ones) when the new port joins the
bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
be populated and only then attached to a bridge that you offload).
However there are 2 possibilities: the addresses can be 'pushed' by the
bridge into the port, or the port can 'pull' them from the bridge.

Considering that in the general case, the new port can be really late to
the party, and there may have been many other switchdev ports that
already received the initial notification, we would like to avoid
delivering duplicate events to them, since they might misbehave. And
currently, the bridge calls the entire switchdev notifier chain, whereas
for replaying it should just call the notifier block of the new guy.
But the bridge doesn't know what is the new guy's notifier block, it
just knows where the switchdev notifier chain is. So for simplification,
we make this a driver-initiated pull for now, and the notifier block is
passed as an argument.

To emulate the calling context for mdb objects (deferred and put on the
blocking notifier chain), we must iterate under RCU protection through
the bridge's mdb entries, queue them, and only call them once we're out
of the RCU read-side critical section.

There was some opportunity for reuse between br_mdb_switchdev_host_port,
br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev
mdb object is created, so a helper was created.

Suggested-by: Ido Schimmel 
Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h |   9 +++
 include/net/switchdev.h   |   1 +
 net/bridge/br_mdb.c   | 148 +-
 3 files changed, 141 insertions(+), 17 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index ebd16495459c..f6472969bb44 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device 
*dev, int proto);
 bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);
 bool br_multicast_enabled(const struct net_device *dev);
 bool br_multicast_router(const struct net_device *dev);
+int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb, struct netlink_ext_ack *extack);
 #else
 static inline int br_multicast_list_adjacent(struct net_device *dev,
 struct list_head *br_ip_list)
@@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct 
net_device *dev)
 {
return false;
 }
+static inline int br_mdb_replay(struct net_device *br_dev,
+   struct net_device *dev,
+   struct notifier_block *nb,
+   struct netlink_ext_ack *extack)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index b7fc7d0f54e2..8c3218177136 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -68,6 +68,7 @@ enum switchdev_obj_id {
 };
 
 struct switchdev_obj {
+   struct list_head list;
struct net_device *orig_dev;
enum switchdev_obj_id id;
u32 flags;
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 8846c5bcd075..95fa4af0e8dd 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -506,6 +506,134 @@ static void br_mdb_complete(struct net_device *dev, int 
err, void *priv)
kfree(priv);
 }
 
+static void br_switchdev_mdb_populate(struct switchdev_obj_port_mdb *mdb,
+ const struct n

[PATCH v4 net-next 05/11] net: bridge: add helper to replay VLANs installed on port

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

Currently this simple setup with DSA:

ip link add br0 type bridge vlan_filtering 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

will not work because the bridge has created the PVID in br_add_if ->
nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
but that was too early, since swp0 was not yet a lower of bond0, so it
had no reason to act upon that notification.

We need a helper in the bridge to replay the switchdev VLAN objects that
were notified since the bridge port creation, because some of them may
have been missed.

As opposed to the br_mdb_replay function, the vg->vlan_list write side
protection is offered by the rtnl_mutex which is sleepable, so we don't
need to queue up the objects in atomic context, we can replay them right
away.

Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h | 10 ++
 net/bridge/br_vlan.c  | 73 +++
 2 files changed, 83 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b564c4486a45..2cc35038a8ca 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -111,6 +111,8 @@ int br_vlan_get_pvid_rcu(const struct net_device *dev, u16 
*p_pvid);
 int br_vlan_get_proto(const struct net_device *dev, u16 *p_proto);
 int br_vlan_get_info(const struct net_device *dev, u16 vid,
 struct bridge_vlan_info *p_vinfo);
+int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,
+  struct notifier_block *nb, struct netlink_ext_ack *extack);
 #else
 static inline bool br_vlan_enabled(const struct net_device *dev)
 {
@@ -137,6 +139,14 @@ static inline int br_vlan_get_info(const struct net_device 
*dev, u16 vid,
 {
return -EINVAL;
 }
+
+static inline int br_vlan_replay(struct net_device *br_dev,
+struct net_device *dev,
+struct notifier_block *nb,
+struct netlink_ext_ack *extack)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_BRIDGE)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 8829f621b8ec..ca8daccff217 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -1751,6 +1751,79 @@ void br_vlan_notify(const struct net_bridge *br,
kfree_skb(skb);
 }
 
+static int br_vlan_replay_one(struct notifier_block *nb,
+ struct net_device *dev,
+ struct switchdev_obj_port_vlan *vlan,
+ struct netlink_ext_ack *extack)
+{
+   struct switchdev_notifier_port_obj_info obj_info = {
+   .info = {
+   .dev = dev,
+   .extack = extack,
+   },
+   .obj = >obj,
+   };
+   int err;
+
+   err = nb->notifier_call(nb, SWITCHDEV_PORT_OBJ_ADD, _info);
+   return notifier_to_errno(err);
+}
+
+int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,
+  struct notifier_block *nb, struct netlink_ext_ack *extack)
+{
+   struct net_bridge_vlan_group *vg;
+   struct net_bridge_vlan *v;
+   struct net_bridge_port *p;
+   struct net_bridge *br;
+   int err = 0;
+   u16 pvid;
+
+   ASSERT_RTNL();
+
+   if (!netif_is_bridge_master(br_dev))
+   return -EINVAL;
+
+   if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev))
+   return -EINVAL;
+
+   if (netif_is_bridge_master(dev)) {
+   br = netdev_priv(dev);
+   vg = br_vlan_group(br);
+   p = NULL;
+   } else {
+   p = br_port_get_rtnl(dev);
+   if (WARN_ON(!p))
+   return -EINVAL;
+   vg = nbp_vlan_group(p);
+   br = p->br;
+   }
+
+   if (!vg)
+   return 0;
+
+   pvid = br_get_pvid(vg);
+
+   list_for_each_entry(v, >vlan_list, vlist) {
+   struct switchdev_obj_port_vlan vlan = {
+   .obj.orig_dev = dev,
+   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
+   .flags = br_vlan_flags(v, pvid),
+   .vid = v->vid,
+   };
+
+   if (!br_vlan_should_use(v))
+   continue;
+
+   br_vlan_replay_one(nb, dev, , extack);
+   if (err)
+   return err;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL_GPL(br_vlan_replay);
+
 /* check if v_curr can enter a range ending in range_end */
 bool br_vlan_can_enter_range(const struct net_bridge_vlan *v_curr,
 const struct net_bridge_vlan *range_end)
-- 
2.25.1



[PATCH v4 net-next 06/11] net: dsa: call dsa_port_bridge_join when joining a LAG that is already in a bridge

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA can properly detect and offload this sequence of operations:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set swp0 master bond0
ip link set bond0 master br0

But not this one:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

Actually the second one is more complicated, due to the elapsed time
between the enslavement of bond0 and the offloading of it via swp0, a
lot of things could have happened to the bond0 bridge port in terms of
switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
a bit of a can of worms, and making sure that the DSA port's state is in
sync with this already existing bridge port is handled in the next
patches.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
Reviewed-by: Tobias Waldekranz 
---
 net/dsa/port.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index c9c6d7ab3f47..d39262a9fe0e 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -249,17 +249,31 @@ int dsa_port_lag_join(struct dsa_port *dp, struct 
net_device *lag,
.lag = lag,
.info = uinfo,
};
+   struct net_device *bridge_dev;
int err;
 
dsa_lag_map(dp->ds->dst, lag);
dp->lag_dev = lag;
 
err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_JOIN, );
-   if (err) {
-   dp->lag_dev = NULL;
-   dsa_lag_unmap(dp->ds->dst, lag);
-   }
+   if (err)
+   goto err_lag_join;
 
+   bridge_dev = netdev_master_upper_dev_get(lag);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   err = dsa_port_bridge_join(dp, bridge_dev);
+   if (err)
+   goto err_bridge_join;
+
+   return 0;
+
+err_bridge_join:
+   dsa_port_notify(dp, DSA_NOTIFIER_LAG_LEAVE, );
+err_lag_join:
+   dp->lag_dev = NULL;
+   dsa_lag_unmap(dp->ds->dst, lag);
return err;
 }
 
-- 
2.25.1



[PATCH v4 net-next 04/11] net: bridge: add helper to replay port and local fdb entries

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

When a switchdev port starts offloading a LAG that is already in a
bridge and has an FDB entry pointing to it:

ip link set bond0 master br0
bridge fdb add dev bond0 00:01:02:03:04:05 master static
ip link set swp0 master bond0

the switchdev driver will have no idea that this FDB entry is there,
because it missed the switchdev event emitted at its creation.

Ido Schimmel pointed this out during a discussion about challenges with
switchdev offloading of stacked interfaces between the physical port and
the bridge, and recommended to just catch that condition and deny the
CHANGEUPPER event:
https://lore.kernel.org/netdev/20210210105949.gb287...@shredder.lan/

But in fact, we might need to deal with the hard thing anyway, which is
to replay all FDB addresses relevant to this port, because it isn't just
static FDB entries, but also local addresses (ones that are not
forwarded but terminated by the bridge). There, we can't just say 'oh
yeah, there was an upper already so I'm not joining that'.

So, similar to the logic for replaying MDB entries, add a function that
must be called by individual switchdev drivers and replays local FDB
entries as well as ones pointing towards a bridge port. This time, we
use the atomic switchdev notifier block, since that's what FDB entries
expect for some reason.

Reported-by: Ido Schimmel 
Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h |  9 +++
 net/bridge/br_fdb.c   | 50 +++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index f6472969bb44..b564c4486a45 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -147,6 +147,8 @@ void br_fdb_clear_offload(const struct net_device *dev, u16 
vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
 u8 br_port_get_stp_state(const struct net_device *dev);
 clock_t br_get_ageing_time(struct net_device *br_dev);
+int br_fdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -175,6 +177,13 @@ static inline clock_t br_get_ageing_time(struct net_device 
*br_dev)
 {
return 0;
 }
+
+static inline int br_fdb_replay(struct net_device *br_dev,
+   struct net_device *dev,
+   struct notifier_block *nb)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index b7490237f3fc..698b79747d32 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -726,6 +726,56 @@ static inline size_t fdb_nlmsg_size(void)
+ nla_total_size(sizeof(u8)); /* NFEA_ACTIVITY_NOTIFY */
 }
 
+static int br_fdb_replay_one(struct notifier_block *nb,
+struct net_bridge_fdb_entry *fdb,
+struct net_device *dev)
+{
+   struct switchdev_notifier_fdb_info item;
+   int err;
+
+   item.addr = fdb->key.addr.addr;
+   item.vid = fdb->key.vlan_id;
+   item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, >flags);
+   item.offloaded = test_bit(BR_FDB_OFFLOADED, >flags);
+   item.info.dev = dev;
+
+   err = nb->notifier_call(nb, SWITCHDEV_FDB_ADD_TO_DEVICE, );
+   return notifier_to_errno(err);
+}
+
+int br_fdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb)
+{
+   struct net_bridge_fdb_entry *fdb;
+   struct net_bridge *br;
+   int err = 0;
+
+   if (!netif_is_bridge_master(br_dev) || !netif_is_bridge_port(dev))
+   return -EINVAL;
+
+   br = netdev_priv(br_dev);
+
+   rcu_read_lock();
+
+   hlist_for_each_entry_rcu(fdb, >fdb_list, fdb_node) {
+   struct net_bridge_port *dst = READ_ONCE(fdb->dst);
+   struct net_device *dst_dev;
+
+   dst_dev = dst ? dst->dev : br->dev;
+   if (dst_dev != br_dev && dst_dev != dev)
+   continue;
+
+   err = br_fdb_replay_one(nb, fdb, dst_dev);
+   if (err)
+   break;
+   }
+
+   rcu_read_unlock();
+
+   return err;
+}
+EXPORT_SYMBOL_GPL(br_fdb_replay);
+
 static void fdb_notify(struct net_bridge *br,
   const struct net_bridge_fdb_entry *fdb, int type,
   bool swdev_notify)
-- 
2.25.1



[PATCH v4 net-next 02/11] net: bridge: add helper to retrieve the current ageing time

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:

sysfs/ioctl/netlink
-> br_set_ageing_time
   -> __set_ageing_time

therefore not at bridge port creation time, so:
(a) switchdev drivers have to hardcode the initial value for the address
ageing time, because they didn't get any notification
(b) that hardcoded value can be out of sync, if the user changes the
ageing time before enslaving the port to the bridge

We need a helper in the bridge, such that switchdev drivers can query
the current value of the bridge ageing time when they start offloading
it.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
Reviewed-by: Tobias Waldekranz 
---
 include/linux/if_bridge.h |  6 ++
 net/bridge/br_stp.c   | 13 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 920d3a02cc68..ebd16495459c 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -137,6 +137,7 @@ struct net_device *br_fdb_find_port(const struct net_device 
*br_dev,
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
 u8 br_port_get_stp_state(const struct net_device *dev);
+clock_t br_get_ageing_time(struct net_device *br_dev);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -160,6 +161,11 @@ static inline u8 br_port_get_stp_state(const struct 
net_device *dev)
 {
return BR_STATE_DISABLED;
 }
+
+static inline clock_t br_get_ageing_time(struct net_device *br_dev)
+{
+   return 0;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 86b5e05d3f21..3dafb6143cff 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -639,6 +639,19 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
ageing_time)
return 0;
 }
 
+clock_t br_get_ageing_time(struct net_device *br_dev)
+{
+   struct net_bridge *br;
+
+   if (!netif_is_bridge_master(br_dev))
+   return 0;
+
+   br = netdev_priv(br_dev);
+
+   return jiffies_to_clock_t(br->ageing_time);
+}
+EXPORT_SYMBOL_GPL(br_get_ageing_time);
+
 /* called under bridge lock */
 void __br_set_topology_change(struct net_bridge *br, unsigned char val)
 {
-- 
2.25.1



[PATCH v4 net-next 01/11] net: bridge: add helper for retrieving the current bridge port STP state

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

It may happen that we have the following topology with DSA or any other
switchdev driver with LAG offload:

ip link add br0 type bridge stp_state 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
ip link set swp1 master bond0

STP decides that it should put bond0 into the BLOCKING state, and
that's that. The ports that are actively listening for the switchdev
port attributes emitted for the bond0 bridge port (because they are
offloading it) and have the honor of seeing that switchdev port
attribute can react to it, so we can program swp0 and swp1 into the
BLOCKING state.

But if then we do:

ip link set swp2 master bond0

then as far as the bridge is concerned, nothing has changed: it still
has one bridge port. But this new bridge port will not see any STP state
change notification and will remain FORWARDING, which is how the
standalone code leaves it in.

We need a function in the bridge driver which retrieves the current STP
state, such that drivers can synchronize to it when they may have missed
switchdev events.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
Reviewed-by: Tobias Waldekranz 
---
 include/linux/if_bridge.h |  6 ++
 net/bridge/br_stp.c   | 14 ++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b979005ea39c..920d3a02cc68 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -136,6 +136,7 @@ struct net_device *br_fdb_find_port(const struct net_device 
*br_dev,
__u16 vid);
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
+u8 br_port_get_stp_state(const struct net_device *dev);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -154,6 +155,11 @@ br_port_flag_is_set(const struct net_device *dev, unsigned 
long flag)
 {
return false;
 }
+
+static inline u8 br_port_get_stp_state(const struct net_device *dev)
+{
+   return BR_STATE_DISABLED;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 21c6781906aa..86b5e05d3f21 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -64,6 +64,20 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
state)
}
 }
 
+u8 br_port_get_stp_state(const struct net_device *dev)
+{
+   struct net_bridge_port *p;
+
+   ASSERT_RTNL();
+
+   p = br_port_get_rtnl(dev);
+   if (!p)
+   return BR_STATE_DISABLED;
+
+   return p->state;
+}
+EXPORT_SYMBOL_GPL(br_port_get_stp_state);
+
 /* called under bridge lock */
 struct net_bridge_port *br_get_port(struct net_bridge *br, u16 port_no)
 {
-- 
2.25.1



[PATCH v4 net-next 00/11] Better support for sandwiched LAGs with bridge and DSA

2021-03-22 Thread Vladimir Oltean
From: Vladimir Oltean 

Changes in v4:
- Added missing EXPORT_SYMBOL_GPL
- Using READ_ONCE(fdb->dst)
- Split patches into (a) adding the bridge helpers (b) making DSA use them
- br_mdb_replay went back to the v1 approach where it allocated memory
  in atomic context
- Created a br_switchdev_mdb_populate which reduces some of the code
  duplication
- Fixed the error message in dsa_port_clear_brport_flags
- Replaced "dsa_port_vlan_filtering(dp, br, extack)" with
  "dsa_port_vlan_filtering(dp, br_vlan_enabled(br), extack)" (duh)
- Added review tags (sorry if I missed any)

The objective of this series is to make LAG uppers on top of switchdev
ports work regardless of which order we link interfaces to their masters
(first make the port join the LAG, then the LAG join the bridge, or the
other way around).

There was a design decision to be made in patches 2-4 on whether we
should adopt the "push" model (which attempts to solve the problem
centrally, in the bridge layer) where the driver just calls:

  switchdev_bridge_port_offloaded(brport_dev,
  _notifier_block,
  _notifier_block,
  extack);

and the bridge just replays the entire collection of switchdev port
attributes and objects that it has, in some predefined order and with
some predefined error handling logic;


or the "pull" model (which attempts to solve the problem by giving the
driver the rope to hang itself), where the driver, apart from calling:

  switchdev_bridge_port_offloaded(brport_dev, extack);

has the task of "dumpster diving" (as Tobias puts it) through the bridge
attributes and objects by itself, by calling:

  - br_vlan_replay
  - br_fdb_replay
  - br_mdb_replay
  - br_vlan_enabled
  - br_port_flag_is_set
  - br_port_get_stp_state
  - br_multicast_router
  - br_get_ageing_time

(not necessarily all of them, and not necessarily in this order, and
with driver-defined error handling).

Even though I'm not in love myself with the "pull" model, I chose it
because there is a fundamental trick with replaying switchdev events
like this:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0 <- this will replay the objects once for
 the bond0 bridge port, and the swp0
 switchdev port will process them
ip link set swp1 master bond0 <- this will replay the objects again for
 the bond0 bridge port, and the swp1
 switchdev port will see them, but swp0
 will see them for the second time now

Basically I believe that it is implementation defined whether the driver
wants to error out on switchdev objects seen twice on a port, and the
bridge should not enforce a certain model for that. For example, for FDB
entries added to a bonding interface, the underling switchdev driver
might have an abstraction for just that: an FDB entry pointing towards a
logical (as opposed to physical) port. So when the second port joins the
bridge, it doesn't realy need to replay FDB entries, since there is
already at least one hardware port which has been receiving those
events, and the FDB entries don't need to be added a second time to the
same logical port.
In the other corner, we have the drivers that handle switchdev port
attributes on a LAG as individual switchdev port attributes on physical
ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set
helper facilitates this: it is a fan-out from a single orig_dev towards
multiple lowers that pass the check_cb().
But that's the point: switchdev_handle_port_attr_set is just a helper
which the driver _opts_ to use. The bridge can't enforce the "push"
model, because that would assume that all drivers handle port attributes
in the same way, which is probably false.

For this reason, I preferred to go with the "pull" mode for this patch
set. Just to see how bad it is for other switchdev drivers to copy-paste
this logic, I added the pull support to ocelot too, and I think it's
pretty manageable.

Vladimir Oltean (11):
  net: bridge: add helper for retrieving the current bridge port STP
state
  net: bridge: add helper to retrieve the current ageing time
  net: bridge: add helper to replay port and host-joined mdb entries
  net: bridge: add helper to replay port and local fdb entries
  net: bridge: add helper to replay VLANs installed on port
  net: dsa: call dsa_port_bridge_join when joining a LAG that is already
in a bridge
  net: dsa: pass extack to dsa_port_{bridge,lag}_join
  net: dsa: inherit the actual bridge port flags at join time
  net: dsa: sync up switchdev objects and port attributes when joining
the bridge
  net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged
LAG
  

Re: [RFC PATCH v2 net-next 14/16] net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 04:04:01PM +0800, DENG Qingfang wrote:
> On Fri, Mar 19, 2021 at 6:49 PM Vladimir Oltean  wrote:
> > Why would you even want to look at the source net device for forwarding?
> > I'd say that if dp->bridge_dev is NULL in the xmit function, you certainly
> > want to bypass address learning if you can. Maybe also for link-local 
> > traffic.
> 
> Also for trapped traffic (snooping, tc-flower trap action) if the CPU
> sends them back.

This sounds line an interesting use case, please tell me more about what
commands I could run to reinject trapped packets into the hardware data
path.


Re: [PATCH net] net: dsa: don't assign an error value to tag_ops

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 03:26:50PM -0500, George McCollister wrote:
> Use a temporary variable to hold the return value from
> dsa_tag_driver_get() instead of assigning it to dst->tag_ops. Leaving
> an error value in dst->tag_ops can result in deferencing an invalid
> pointer when a deferred switch configuration happens later.
> 
> Fixes: 357f203bb3b5 ("net: dsa: keep a copy of the tagging protocol in the 
> DSA switch tree")
> 
> Signed-off-by: George McCollister 
> ---

Reviewed-by: Vladimir Oltean 

Just FYI, new lines aren't typically added between the various tags.


Re: [PATCH net] net: dsa: don't assign an error value to tag_ops

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 03:26:50PM -0500, George McCollister wrote:
> Use a temporary variable to hold the return value from
> dsa_tag_driver_get() instead of assigning it to dst->tag_ops. Leaving
> an error value in dst->tag_ops can result in deferencing an invalid
> pointer when a deferred switch configuration happens later.
> 
> Fixes: 357f203bb3b5 ("net: dsa: keep a copy of the tagging protocol in the 
> DSA switch tree")
> 
> Signed-off-by: George McCollister 
> ---

Who dereferences the invalid pointer? dsa_tree_free I guess?


Re: [RFC v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-22 Thread Vladimir Oltean
Hi Yunsheng,

On Mon, Mar 22, 2021 at 05:09:16PM +0800, Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
> 
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc too, which aviod enqueuing
> and dequeuing operation. qdisc->empty is set to false whenever a
> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> skb dequeuing return NULL, see pfifo_fast_dequeue().
> 
> There is a data race between enqueue/dequeue and qdisc->empty
> setting, qdisc->empty is only used as a hint, so we need to call
> sch_may_need_requeuing() to see if the queue is really empty and if
> there is requeued skb, which has higher priority than the current
> skb.
> 
> The performance for ip_forward test increases about 10% with this
> patch.
> 
> Signed-off-by: Yunsheng Lin 
> ---
> Hi, Vladimir and Ahmad
>   Please give it a test to see if there is any out of order
> packet for this patch, which has removed the priv->lock added in
> RFC v2.
> 
> There is a data race as below:
> 
>   CPU1   CPU2
> qdisc_run_begin(q).
> .q->enqueue()
> sch_may_need_requeuing()  .
> return true   .
> . .
> . .
> q->enqueue()  .
> 
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
> 
> This patch does not take care of the above data race, because I
> view this as similar as below:
> 
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that.
> 
> So I hope the above data race will not cause problem for Vladimir
> and Ahmad.
> ---

Preliminary results on my test setup look fine, but please allow me to
run the canfdtest overnight, since as you say, races are still
theoretically possible.


Re: [RFC PATCH v2 net-next 16/16] net: bridge: switchdev: let drivers inform which bridge ports are offloaded

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 05:30:52PM +0100, Tobias Waldekranz wrote:
> > ---
> >  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  4 +-
> >  .../marvell/prestera/prestera_switchdev.c |  7 ++
> >  .../mellanox/mlxsw/spectrum_switchdev.c   |  4 +-
> >  drivers/net/ethernet/mscc/ocelot_net.c|  4 +-
> >  drivers/net/ethernet/rocker/rocker_ofdpa.c|  8 +-
> >  drivers/net/ethernet/ti/am65-cpsw-nuss.c  |  7 +-
> >  drivers/net/ethernet/ti/cpsw_new.c|  6 +-
> 
> Why is not net/dsa included in this change?

I don't know, must have went shopping somewhere?
I'll make sure DSA is included in this change when I resend.


Re: [RFC PATCH v2 net-next 09/16] net: dsa: replay port and local fdb entries when joining the bridge

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 06:07:51PM +0100, Tobias Waldekranz wrote:
> On Mon, Mar 22, 2021 at 18:19, Vladimir Oltean  wrote:
> > On Mon, Mar 22, 2021 at 04:44:41PM +0100, Tobias Waldekranz wrote:
> >> I do not know if it is a problem or not, more of an observation: This is
> >> not guaranteed to be an exact replay of the events that the bridge port
> >> (i.e. bond0 or whatever) has received since, in fdb_insert, we exit
> >> early when adding local entries if that address is already in the
> >> database.
> >> 
> >> Do we have to guard against this somehow? Or maybe we should consider
> >> the current behavior a bug and make sure to always send the event in the
> >> first place?
> >
> > I don't really understand what you're saying.
> > fdb_insert has:
> >
> > fdb = br_fdb_find(br, addr, vid);
> > if (fdb) {
> > /* it is okay to have multiple ports with same
> >  * address, just use the first one.
> >  */
> > if (test_bit(BR_FDB_LOCAL, >flags))
> > return 0;
> > br_warn(br, "adding interface %s with same address as a 
> > received packet (addr:%pM, vlan:%u)\n",
> >source ? source->dev->name : br->dev->name, addr, vid);
> > fdb_delete(br, fdb, true);
> > }
> >
> > fdb = fdb_create(br, source, addr, vid,
> >  BIT(BR_FDB_LOCAL) | BIT(BR_FDB_STATIC));
> >
> > Basically, if the {addr, vid} pair already exists in the fdb, and it
> > points to a local entry, fdb_create is bypassed.
> >
> > Whereas my br_fdb_replay() function iterates over br->fdb_list, which is
> > exactly where fdb_create() also lays its eggs. That is to say, unless
> > I'm missing something, that duplicate local FDB entries that skipped the
> > fdb_create() call in fdb_insert() because they were for already-existing
> > local FDB entries will also be skipped by br_fdb_replay(), because it
> > iterates over a br->fdb_list which contains unique local addresses.
> > Where am I wrong?
> 
> No you are right. I was thinking back to my attempt of offloading local
> addresses and I distinctly remembered that local addresses could be
> added without a notification being sent.
> 
> But that is not what is happening. It is just already inserted on
> another port. So the notification would reach DSA, or not, depending on
> ordering the of events. But there will be no discrepancy between that
> and the replay.

I'm not saying that the bridge isn't broken, because it is, but for
different reasons, as explained here:
https://patchwork.kernel.org/project/netdevbpf/patch/20210224114350.2791260-9-olte...@gmail.com/

What I can do is I can make br_switchdev_fdb_notify() skip fdb entries
with the BR_FDB_LOCAL bit set, and target that patch against "net", with
a Fixes: tag of 6b26b51b1d13 ("net: bridge: Add support for notifying
devices about FDB add/del").
Then I can also skip the entries with BR_FDB_LOCAL from br_fdb_replay.
Then, when I return to the "RX filtering for DSA" series, I can add the
"is_local" bit to switchdev FDB objects, and make all drivers reject
"is_local" entries (which is what the linked patch does) unless more
specific treatment is applied to those (trap to CPU).
Nikolay?


Re: [PATCH v3 net-next 08/12] net: dsa: replay port and host-joined mdb entries when joining the bridge

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 06:35:10PM +0200, Nikolay Aleksandrov wrote:
> > +   hlist_for_each_entry(mp, >mdb_list, mdb_node) {
> 
> You cannot walk over these lists without the multicast lock or RCU. RTNL is 
> not
> enough because of various timers and leave messages that can alter both the 
> mdb_list
> and the port group lists. I'd prefer RCU to avoid blocking the bridge mcast.

The trouble is that I need to emulate the calling context that is
provided to SWITCHDEV_OBJ_ID_HOST_MDB and SWITCHDEV_OBJ_ID_PORT_MDB, and
that means blocking context.

So if I hold rcu_read_lock(), I need to queue up the mdb entries, and
notify the driver only after I leave the RCU critical section. The
memory footprint may temporarily blow up.

In fact this is what I did in v1:
https://patchwork.kernel.org/project/netdevbpf/patch/20210224114350.2791260-15-olte...@gmail.com/

I just figured I could get away with rtnl_mutex protection, but it looks
like I can't. So I guess you prefer my v1?


Re: [RFC PATCH v2 net-next 09/16] net: dsa: replay port and local fdb entries when joining the bridge

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 04:44:41PM +0100, Tobias Waldekranz wrote:
> I do not know if it is a problem or not, more of an observation: This is
> not guaranteed to be an exact replay of the events that the bridge port
> (i.e. bond0 or whatever) has received since, in fdb_insert, we exit
> early when adding local entries if that address is already in the
> database.
> 
> Do we have to guard against this somehow? Or maybe we should consider
> the current behavior a bug and make sure to always send the event in the
> first place?

I don't really understand what you're saying.
fdb_insert has:

fdb = br_fdb_find(br, addr, vid);
if (fdb) {
/* it is okay to have multiple ports with same
 * address, just use the first one.
 */
if (test_bit(BR_FDB_LOCAL, >flags))
return 0;
br_warn(br, "adding interface %s with same address as a 
received packet (addr:%pM, vlan:%u)\n",
   source ? source->dev->name : br->dev->name, addr, vid);
fdb_delete(br, fdb, true);
}

fdb = fdb_create(br, source, addr, vid,
 BIT(BR_FDB_LOCAL) | BIT(BR_FDB_STATIC));

Basically, if the {addr, vid} pair already exists in the fdb, and it
points to a local entry, fdb_create is bypassed.

Whereas my br_fdb_replay() function iterates over br->fdb_list, which is
exactly where fdb_create() also lays its eggs. That is to say, unless
I'm missing something, that duplicate local FDB entries that skipped the
fdb_create() call in fdb_insert() because they were for already-existing
local FDB entries will also be skipped by br_fdb_replay(), because it
iterates over a br->fdb_list which contains unique local addresses.
Where am I wrong?


Re: [RFC PATCH v2 net-next 06/16] net: dsa: sync multicast router state when joining the bridge

2021-03-22 Thread Vladimir Oltean
On Mon, Mar 22, 2021 at 12:17:33PM +0100, Tobias Waldekranz wrote:
> On Fri, Mar 19, 2021 at 01:18, Vladimir Oltean  wrote:
> > From: Vladimir Oltean 
> >
> > Make sure that the multicast router setting of the bridge is picked up
> > correctly by DSA when joining, regardless of whether there are
> > sandwiched interfaces or not. The SWITCHDEV_ATTR_ID_BRIDGE_MROUTER port
> > attribute is only emitted from br_mc_router_state_change.
> >
> > Signed-off-by: Vladimir Oltean 
> > ---
> >  net/dsa/port.c | 10 ++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/net/dsa/port.c b/net/dsa/port.c
> > index ac1afe182c3b..8380509ee47c 100644
> > --- a/net/dsa/port.c
> > +++ b/net/dsa/port.c
> > @@ -189,6 +189,10 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
> > if (err && err != -EOPNOTSUPP)
> > return err;
> >  
> > +   err = dsa_port_mrouter(dp->cpu_dp, br_multicast_router(br), extack);
> > +   if (err && err != -EOPNOTSUPP)
> > +   return err;
> > +
> > return 0;
> >  }
> >  
> > @@ -212,6 +216,12 @@ static void dsa_port_switchdev_unsync(struct dsa_port 
> > *dp)
> > dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
> >  
> > /* VLAN filtering is handled by dsa_switch_bridge_leave */
> > +
> > +   /* Some drivers treat the notification for having a local multicast
> > +* router by allowing multicast to be flooded to the CPU, so we should
> > +* allow this in standalone mode too.
> > +*/
> > +   dsa_port_mrouter(dp->cpu_dp, true, NULL);
> 
> Is this really for the DSA layer to decide? The driver has already been
> notified that at least one port is now in standalone mode. So if that
> particular driver then requires all multicast to be flooded towards the
> CPU, it can make that decision on its own.
> 
> E.g. say that you implement standalone mode using a matchall TCAM rule
> that maps all frames coming in on a particular port to the CPU. You
> could still leave flooding of unknown multicast off in that case. Now
> that driver has to figure out if the notification about a multicast
> router on the CPU is a real router, or the DSA layer telling it
> something that it can safely ignore.
> 
> Today I think that most (all?) DSA drivers treats mrouter in the same
> way as the multicast flooding bridge flag. But AFAIK, the semantic
> meaning of the setting is "flood IP multicast to this port because there
> is a router behind it somewhere". This means unknown _IP_ multicast, but
> also all known (IGMP/MLD) groups. As most smaller devices cannot
> separate IP multicast from the non-IP variety, we flood everything. But
> we should also make sure that the port in question receives all known
> groups for the _bridge_ in question. Because this is really a bridge
> setting, though that information is not carried over to the driver
> today. So reusing it in this way feels like it could be problematic down
> the road.

I agree with your objections in principle, but somehow I would like to
make progress with this patch series which is not really about how we
deal with IP multicast flooding to the CPU port in standalone ports
mode, so I would like to not get bogged down too much into this for now.
Don't forget that up until recent commit a8b659e7ff75 ("net: dsa: act as
passthrough for bridge port flags"), DSA drivers had no real idea
whether multicast flooding was meant for IP or not. And in standalone
mode, the way things work now is that the CPU port should see all
traffic, so it isn't wrong to do what this patch does.
Unless you see a breaking change introduced by this patch, we can
revisit this discussion for the "RX filtering on DSA" series, where it
is more relevant.


Re: enetc: fix bitfields, we are clearing wrong bits

2021-03-21 Thread Vladimir Oltean
On Sun, Mar 21, 2021 at 07:44:19PM +0200, Vladimir Oltean wrote:
> On Sun, Mar 21, 2021 at 05:25:00PM +0100, Pavel Machek wrote:
> > Bitfield manipulation in enetc_mac_config() looks wrong. Fix
> > it. Untested.
> > 
> > Signed-off-by: Pavel Machek (CIP) 
> > 
> > diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c 
> > b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> > index 224fc37a6757..b85079493933 100644
> > --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> > +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> > @@ -505,7 +505,7 @@ static void enetc_mac_config(struct enetc_hw *hw, 
> > phy_interface_t phy_mode)
> > if (phy_interface_mode_is_rgmii(phy_mode)) {
> > val = enetc_port_rd(hw, ENETC_PM0_IF_MODE);
> > val &= ~ENETC_PM0_IFM_EN_AUTO;
> > -   val &= ENETC_PM0_IFM_IFMODE_MASK;
> > +   val &= ~ENETC_PM0_IFM_IFMODE_MASK;
> > val |= ENETC_PM0_IFM_IFMODE_GMII | ENETC_PM0_IFM_RG;
> > enetc_port_wr(hw, ENETC_PM0_IF_MODE, val);
> > }
> > 
> > -- 
> > DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> 
> Fixes: c76a97218dcb ("net: enetc: force the RGMII speed and duplex instead of 
> operating in inband mode")
> Reviewed-by: Vladimir Oltean 
> 
> Note that for normal operation, the bug was inconsequential, due to the
> fact that we write the IF_MODE register in two stages, first in
> .phylink_mac_config (which incorrectly cleared out a bunch of stuff),
> then we update the speed and duplex to the correct values in
> .phylink_mac_link_up. Maybe loopback mode was broken.
> 
> Thanks!

I forgot to mention, target tree should be "net" and patch should be
queued up for stable.


Re: enetc: fix bitfields, we are clearing wrong bits

2021-03-21 Thread Vladimir Oltean
On Sun, Mar 21, 2021 at 05:25:00PM +0100, Pavel Machek wrote:
> Bitfield manipulation in enetc_mac_config() looks wrong. Fix
> it. Untested.
> 
> Signed-off-by: Pavel Machek (CIP) 
> 
> diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c 
> b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> index 224fc37a6757..b85079493933 100644
> --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> @@ -505,7 +505,7 @@ static void enetc_mac_config(struct enetc_hw *hw, 
> phy_interface_t phy_mode)
>   if (phy_interface_mode_is_rgmii(phy_mode)) {
>   val = enetc_port_rd(hw, ENETC_PM0_IF_MODE);
>   val &= ~ENETC_PM0_IFM_EN_AUTO;
> - val &= ENETC_PM0_IFM_IFMODE_MASK;
> + val &= ~ENETC_PM0_IFM_IFMODE_MASK;
>   val |= ENETC_PM0_IFM_IFMODE_GMII | ENETC_PM0_IFM_RG;
>   enetc_port_wr(hw, ENETC_PM0_IF_MODE, val);
>   }
> 
> -- 
> DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

Fixes: c76a97218dcb ("net: enetc: force the RGMII speed and duplex instead of 
operating in inband mode")
Reviewed-by: Vladimir Oltean 

Note that for normal operation, the bug was inconsequential, due to the
fact that we write the IF_MODE register in two stages, first in
.phylink_mac_config (which incorrectly cleared out a bunch of stuff),
then we update the speed and duplex to the correct values in
.phylink_mac_link_up. Maybe loopback mode was broken.

Thanks!


Re: [PATCH net-next] dsa: simplify Kconfig symbols and dependencies

2021-03-20 Thread Vladimir Oltean
On Fri, Mar 19, 2021 at 03:46:30PM +, Alexander Lobakin wrote:
> 1. Remove CONFIG_HAVE_NET_DSA.
>
> CONFIG_HAVE_NET_DSA is a legacy leftover from the times when drivers
> should have selected CONFIG_NET_DSA manually.
> Currently, all drivers has explicit 'depends on NET_DSA', so this is
> no more needed.
>
> 2. CONFIG_HAVE_NET_DSA dependencies became CONFIG_NET_DSA's ones.
>
>  - dropped !S390 dependency which was introduced to be sure NET_DSA
>can select CONFIG_PHYLIB. DSA migrated to Phylink almost 3 years
>ago and the PHY library itself doesn't depend on !S390 since
>commit 870a2b5e4fcd ("phylib: remove !S390 dependeny from Kconfig");
>  - INET dependency is kept to be sure we can select NET_SWITCHDEV;
>  - NETDEVICES dependency is kept to be sure we can select PHYLINK.
>
> 3. DSA drivers menu now depends on NET_DSA.
>
> Instead on 'depends on NET_DSA' on every single driver, the entire
> menu now depends on it. This eliminates a lot of duplicated lines
> from Kconfig with no loss (when CONFIG_NET_DSA=m, drivers also can
> be only m or n).
> This also has a nice side effect that there's no more empty menu on
> configurations without DSA.
>
> 4. Kbuild will now descend into 'drivers/net/dsa' only when
>CONFIG_NET_DSA is y or m.
>
> This is safe since no objects inside this folder can be built without
> DSA core, as well as when CONFIG_NET_DSA=m, no objects can be
> built-in.
>
> Signed-off-by: Alexander Lobakin 
> ---

Thanks!

Reviewed-by: Vladimir Oltean 


[PATCH v3 net-next 12/12] net: ocelot: replay switchdev events when joining bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

The premise of this change is that the switchdev port attributes and
objects offloaded by ocelot might have been missed when we are joining
an already existing bridge port, such as a bonding interface.

The patch pulls these switchdev attributes and objects from the bridge,
on behalf of the 'bridge port' net device which might be either the
ocelot switch interface, or the bonding upper interface.

The ocelot_net.c belongs strictly to the switchdev ocelot driver, while
ocelot.c is part of a library shared with the DSA felix driver.
The ocelot_port_bridge_leave function (part of the common library) used
to call ocelot_port_vlan_filtering(false), something which is not
necessary for DSA, since the framework deals with that already there.
So we move this function to ocelot_switchdev_unsync, which is specific
to the switchdev driver.

The code movement described above makes ocelot_port_bridge_leave no
longer return an error code, so we change its type from int to void.

Signed-off-by: Vladimir Oltean 
---
Changes in v3:
Added -EOPNOTSUPP to br_mdb_replay and br_vlan_replay, which can be
compiled out.

 drivers/net/dsa/ocelot/felix.c |   4 +-
 drivers/net/ethernet/mscc/ocelot.c |  18 ++--
 drivers/net/ethernet/mscc/ocelot_net.c | 117 +
 include/soc/mscc/ocelot.h  |   6 +-
 4 files changed, 111 insertions(+), 34 deletions(-)

diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 628afb47b579..6b5442be0230 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -719,7 +719,9 @@ static int felix_bridge_join(struct dsa_switch *ds, int 
port,
 {
struct ocelot *ocelot = ds->priv;
 
-   return ocelot_port_bridge_join(ocelot, port, br);
+   ocelot_port_bridge_join(ocelot, port, br);
+
+   return 0;
 }
 
 static void felix_bridge_leave(struct dsa_switch *ds, int port,
diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index ce57929ba3d1..1a36b416fd9b 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1514,34 +1514,28 @@ int ocelot_port_mdb_del(struct ocelot *ocelot, int port,
 }
 EXPORT_SYMBOL(ocelot_port_mdb_del);
 
-int ocelot_port_bridge_join(struct ocelot *ocelot, int port,
-   struct net_device *bridge)
+void ocelot_port_bridge_join(struct ocelot *ocelot, int port,
+struct net_device *bridge)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
 
ocelot_port->bridge = bridge;
 
-   return 0;
+   ocelot_apply_bridge_fwd_mask(ocelot);
 }
 EXPORT_SYMBOL(ocelot_port_bridge_join);
 
-int ocelot_port_bridge_leave(struct ocelot *ocelot, int port,
-struct net_device *bridge)
+void ocelot_port_bridge_leave(struct ocelot *ocelot, int port,
+ struct net_device *bridge)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
struct ocelot_vlan pvid = {0}, native_vlan = {0};
-   int ret;
 
ocelot_port->bridge = NULL;
 
-   ret = ocelot_port_vlan_filtering(ocelot, port, false);
-   if (ret)
-   return ret;
-
ocelot_port_set_pvid(ocelot, port, pvid);
ocelot_port_set_native_vlan(ocelot, port, native_vlan);
-
-   return 0;
+   ocelot_apply_bridge_fwd_mask(ocelot);
 }
 EXPORT_SYMBOL(ocelot_port_bridge_leave);
 
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index d1376f7b34fd..36f32a4d9b0f 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1117,47 +1117,126 @@ static int ocelot_port_obj_del(struct net_device *dev,
return ret;
 }
 
+static void ocelot_inherit_brport_flags(struct ocelot *ocelot, int port,
+   struct net_device *brport_dev)
+{
+   struct switchdev_brport_flags flags = {0};
+   int flag;
+
+   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+
+   for_each_set_bit(flag, , 32)
+   if (br_port_flag_is_set(brport_dev, BIT(flag)))
+   flags.val |= BIT(flag);
+
+   ocelot_port_bridge_flags(ocelot, port, flags);
+}
+
+static void ocelot_clear_brport_flags(struct ocelot *ocelot, int port)
+{
+   struct switchdev_brport_flags flags;
+
+   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+   flags.val = flags.mask & ~BR_LEARNING;
+
+   ocelot_port_bridge_flags(ocelot, port, flags);
+}
+
+static int ocelot_switchdev_sync(struct ocelot *ocelot, int port,
+struct net_device *brport_dev,
+struct net_device *bridge_dev,
+struct netlink_ext_ack *extack)
+{
+   clock_t ageing_time;
+   u8 stp_state;
+   int err;
+
+   ocelot_i

[PATCH v3 net-next 11/12] net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

Similar to the DSA situation, ocelot supports LAG offload but treats
this scenario improperly:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

We do the same thing as we do there, which is to simulate a 'bridge join'
on 'lag join', if we detect that the bonding upper has a bridge upper.

Again, same as DSA, ocelot supports software fallback for LAG, and in
that case, we should avoid calling ocelot_netdevice_changeupper.

Signed-off-by: Vladimir Oltean 
---
Changes in v3:
None.

 drivers/net/ethernet/mscc/ocelot_net.c | 111 +++--
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index c08164cd88f4..d1376f7b34fd 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1117,10 +1117,15 @@ static int ocelot_port_obj_del(struct net_device *dev,
return ret;
 }
 
-static int ocelot_netdevice_bridge_join(struct ocelot *ocelot, int port,
-   struct net_device *bridge)
+static int ocelot_netdevice_bridge_join(struct net_device *dev,
+   struct net_device *bridge,
+   struct netlink_ext_ack *extack)
 {
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
struct switchdev_brport_flags flags;
+   int port = priv->chip_port;
int err;
 
flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
@@ -1135,10 +1140,14 @@ static int ocelot_netdevice_bridge_join(struct ocelot 
*ocelot, int port,
return 0;
 }
 
-static int ocelot_netdevice_bridge_leave(struct ocelot *ocelot, int port,
+static int ocelot_netdevice_bridge_leave(struct net_device *dev,
 struct net_device *bridge)
 {
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
struct switchdev_brport_flags flags;
+   int port = priv->chip_port;
int err;
 
flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
@@ -1151,43 +1160,89 @@ static int ocelot_netdevice_bridge_leave(struct ocelot 
*ocelot, int port,
return err;
 }
 
-static int ocelot_netdevice_changeupper(struct net_device *dev,
-   struct netdev_notifier_changeupper_info 
*info)
+static int ocelot_netdevice_lag_join(struct net_device *dev,
+struct net_device *bond,
+struct netdev_lag_upper_info *info,
+struct netlink_ext_ack *extack)
 {
struct ocelot_port_private *priv = netdev_priv(dev);
struct ocelot_port *ocelot_port = >port;
struct ocelot *ocelot = ocelot_port->ocelot;
+   struct net_device *bridge_dev;
int port = priv->chip_port;
+   int err;
+
+   err = ocelot_port_lag_join(ocelot, port, bond, info);
+   if (err == -EOPNOTSUPP) {
+   NL_SET_ERR_MSG_MOD(extack, "Offloading not supported");
+   return 0;
+   }
+
+   bridge_dev = netdev_master_upper_dev_get(bond);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   err = ocelot_netdevice_bridge_join(dev, bridge_dev, extack);
+   if (err)
+   goto err_bridge_join;
+
+   return 0;
+
+err_bridge_join:
+   ocelot_port_lag_leave(ocelot, port, bond);
+   return err;
+}
+
+static int ocelot_netdevice_lag_leave(struct net_device *dev,
+ struct net_device *bond)
+{
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
+   struct net_device *bridge_dev;
+   int port = priv->chip_port;
+
+   ocelot_port_lag_leave(ocelot, port, bond);
+
+   bridge_dev = netdev_master_upper_dev_get(bond);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   return ocelot_netdevice_bridge_leave(dev, bridge_dev);
+}
+
+static int ocelot_netdevice_changeupper(struct net_device *dev,
+   struct netdev_notifier_changeupper_info 
*info)
+{
+   struct netlink_ext_ack *extack;
int err = 0;
 
+   extack = netdev_notifier_info_to_extack(>info);
+
if (netif_is_bridge_master(info->upper_dev)) {
-   if (info->linking) {
-   err = ocelot_netdevice_bridge_join(ocelot, port,
-   

[PATCH v3 net-next 10/12] net: dsa: replay VLANs installed on port when joining the bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

Currently this simple setup:

ip link add br0 type bridge vlan_filtering 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

will not work because the bridge has created the PVID in br_add_if ->
nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
but that was too early, since swp0 was not yet a lower of bond0, so it
had no reason to act upon that notification.

Signed-off-by: Vladimir Oltean 
---
Changes in v3:
Made the br_vlan_replay shim return -EOPNOTSUPP.

 include/linux/if_bridge.h | 10 ++
 net/bridge/br_vlan.c  | 71 +++
 net/dsa/port.c|  6 
 3 files changed, 87 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b564c4486a45..2cc35038a8ca 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -111,6 +111,8 @@ int br_vlan_get_pvid_rcu(const struct net_device *dev, u16 
*p_pvid);
 int br_vlan_get_proto(const struct net_device *dev, u16 *p_proto);
 int br_vlan_get_info(const struct net_device *dev, u16 vid,
 struct bridge_vlan_info *p_vinfo);
+int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,
+  struct notifier_block *nb, struct netlink_ext_ack *extack);
 #else
 static inline bool br_vlan_enabled(const struct net_device *dev)
 {
@@ -137,6 +139,14 @@ static inline int br_vlan_get_info(const struct net_device 
*dev, u16 vid,
 {
return -EINVAL;
 }
+
+static inline int br_vlan_replay(struct net_device *br_dev,
+struct net_device *dev,
+struct notifier_block *nb,
+struct netlink_ext_ack *extack)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_BRIDGE)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 8829f621b8ec..45a4eac1b217 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -1751,6 +1751,77 @@ void br_vlan_notify(const struct net_bridge *br,
kfree_skb(skb);
 }
 
+static int br_vlan_replay_one(struct notifier_block *nb,
+ struct net_device *dev,
+ struct switchdev_obj_port_vlan *vlan,
+ struct netlink_ext_ack *extack)
+{
+   struct switchdev_notifier_port_obj_info obj_info = {
+   .info = {
+   .dev = dev,
+   .extack = extack,
+   },
+   .obj = >obj,
+   };
+   int err;
+
+   err = nb->notifier_call(nb, SWITCHDEV_PORT_OBJ_ADD, _info);
+   return notifier_to_errno(err);
+}
+
+int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,
+  struct notifier_block *nb, struct netlink_ext_ack *extack)
+{
+   struct net_bridge_vlan_group *vg;
+   struct net_bridge_vlan *v;
+   struct net_bridge_port *p;
+   struct net_bridge *br;
+   int err = 0;
+   u16 pvid;
+
+   ASSERT_RTNL();
+
+   if (!netif_is_bridge_master(br_dev))
+   return -EINVAL;
+
+   if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev))
+   return -EINVAL;
+
+   if (netif_is_bridge_master(dev)) {
+   br = netdev_priv(dev);
+   vg = br_vlan_group(br);
+   p = NULL;
+   } else {
+   p = br_port_get_rtnl(dev);
+   if (WARN_ON(!p))
+   return -EINVAL;
+   vg = nbp_vlan_group(p);
+   br = p->br;
+   }
+
+   if (!vg)
+   return 0;
+
+   pvid = br_get_pvid(vg);
+
+   list_for_each_entry(v, >vlan_list, vlist) {
+   struct switchdev_obj_port_vlan vlan = {
+   .obj.orig_dev = dev,
+   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
+   .flags = br_vlan_flags(v, pvid),
+   .vid = v->vid,
+   };
+
+   if (!br_vlan_should_use(v))
+   continue;
+
+   br_vlan_replay_one(nb, dev, , extack);
+   if (err)
+   return err;
+   }
+
+   return err;
+}
 /* check if v_curr can enter a range ending in range_end */
 bool br_vlan_can_enter_range(const struct net_bridge_vlan *v_curr,
 const struct net_bridge_vlan *range_end)
diff --git a/net/dsa/port.c b/net/dsa/port.c
index d21a511f1e16..84775e253ee8 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -209,6 +209,12 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   err = br_vlan_replay(br, brport_dev,
+_slave_switchdev_blocking_notifier,
+extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
-- 
2.25.1



[PATCH v3 net-next 09/12] net: dsa: replay port and local fdb entries when joining the bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

When a DSA port joins a LAG that already had an FDB entry pointing to it:

ip link set bond0 master br0
bridge fdb add dev bond0 00:01:02:03:04:05 master static
ip link set swp0 master bond0

the DSA port will have no idea that this FDB entry is there, because it
missed the switchdev event emitted at its creation.

Ido Schimmel pointed this out during a discussion about challenges with
switchdev offloading of stacked interfaces between the physical port and
the bridge, and recommended to just catch that condition and deny the
CHANGEUPPER event:
https://lore.kernel.org/netdev/20210210105949.gb287...@shredder.lan/

But in fact, we might need to deal with the hard thing anyway, which is
to replay all FDB addresses relevant to this port, because it isn't just
static FDB entries, but also local addresses (ones that are not
forwarded but terminated by the bridge). There, we can't just say 'oh
yeah, there was an upper already so I'm not joining that'.

So, similar to the logic for replaying MDB entries, add a function that
must be called by individual switchdev drivers and replays local FDB
entries as well as ones pointing towards a bridge port. This time, we
use the atomic switchdev notifier block, since that's what FDB entries
expect for some reason.

Reported-by: Ido Schimmel 
Signed-off-by: Vladimir Oltean 
---
Changes in v3:
Made the br_fdb_replay shim return -EOPNOTSUPP.

 include/linux/if_bridge.h |  9 +++
 include/net/switchdev.h   |  1 +
 net/bridge/br_fdb.c   | 52 +++
 net/dsa/dsa_priv.h|  1 +
 net/dsa/port.c|  4 +++
 net/dsa/slave.c   |  2 +-
 6 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index f6472969bb44..b564c4486a45 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -147,6 +147,8 @@ void br_fdb_clear_offload(const struct net_device *dev, u16 
vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
 u8 br_port_get_stp_state(const struct net_device *dev);
 clock_t br_get_ageing_time(struct net_device *br_dev);
+int br_fdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -175,6 +177,13 @@ static inline clock_t br_get_ageing_time(struct net_device 
*br_dev)
 {
return 0;
 }
+
+static inline int br_fdb_replay(struct net_device *br_dev,
+   struct net_device *dev,
+   struct notifier_block *nb)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 #endif
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index b7fc7d0f54e2..7688ec572757 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -205,6 +205,7 @@ struct switchdev_notifier_info {
 
 struct switchdev_notifier_fdb_info {
struct switchdev_notifier_info info; /* must be first */
+   struct list_head list;
const unsigned char *addr;
u16 vid;
u8 added_by_user:1,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index b7490237f3fc..49125cc196ac 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -726,6 +726,58 @@ static inline size_t fdb_nlmsg_size(void)
+ nla_total_size(sizeof(u8)); /* NFEA_ACTIVITY_NOTIFY */
 }
 
+static int br_fdb_replay_one(struct notifier_block *nb,
+struct net_bridge_fdb_entry *fdb,
+struct net_device *dev)
+{
+   struct switchdev_notifier_fdb_info item;
+   int err;
+
+   item.addr = fdb->key.addr.addr;
+   item.vid = fdb->key.vlan_id;
+   item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, >flags);
+   item.offloaded = test_bit(BR_FDB_OFFLOADED, >flags);
+   item.info.dev = dev;
+
+   err = nb->notifier_call(nb, SWITCHDEV_FDB_ADD_TO_DEVICE, );
+   return notifier_to_errno(err);
+}
+
+int br_fdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb)
+{
+   struct net_bridge_fdb_entry *fdb;
+   struct net_bridge *br;
+   int err = 0;
+
+   if (!netif_is_bridge_master(br_dev))
+   return -EINVAL;
+
+   if (!netif_is_bridge_port(dev))
+   return -EINVAL;
+
+   br = netdev_priv(br_dev);
+
+   rcu_read_lock();
+
+   hlist_for_each_entry_rcu(fdb, >fdb_list, fdb_node) {
+   struct net_device *dst_dev;
+
+   dst_dev = fdb->dst ? fdb->dst->dev : br->dev;
+   if (dst_dev != br_dev && dst_dev != dev)
+   continue;
+
+   err = br_fdb_replay_one(nb, fdb, dst_dev);
+   if (err)
+   break;
+   }
+
+   rcu_read_unlock();
+
+   return err;
+}
+EXPORT_SYMBOL(br_fdb_replay);
+
 static

[PATCH v3 net-next 08/12] net: dsa: replay port and host-joined mdb entries when joining the bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

I have udhcpcd in my system and this is configured to bring interfaces
up as soon as they are created.

I create a bridge as follows:

ip link add br0 type bridge

As soon as I create the bridge and udhcpcd brings it up, I also have
avahi which automatically starts sending IPv6 packets to advertise some
local services, and because of that, the br0 bridge joins the following
IPv6 groups due to the code path detailed below:

33:33:ff:6d:c1:9c vid 0
33:33:00:00:00:6a vid 0
33:33:00:00:00:fb vid 0

br_dev_xmit
-> br_multicast_rcv
   -> br_ip6_multicast_add_group
  -> __br_multicast_add_group
 -> br_multicast_host_join
-> br_mdb_notify

This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
hooked up, and switchdev will attempt to offload the host joined groups
to an empty list of ports. Of course nobody offloads them.

Then when we add a port to br0:

ip link set swp0 master br0

the bridge doesn't replay the host-joined MDB entries from br_add_if,
and eventually the host joined addresses expire, and a switchdev
notification for deleting it is emitted, but surprise, the original
addition was already completely missed.

The strategy to address this problem is to replay the MDB entries (both
the port ones and the host joined ones) when the new port joins the
bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
be populated and only then attached to a bridge that you offload).
However there are 2 possibilities: the addresses can be 'pushed' by the
bridge into the port, or the port can 'pull' them from the bridge.

Considering that in the general case, the new port can be really late to
the party, and there may have been many other switchdev ports that
already received the initial notification, we would like to avoid
delivering duplicate events to them, since they might misbehave. And
currently, the bridge calls the entire switchdev notifier chain, whereas
for replaying it should just call the notifier block of the new guy.
But the bridge doesn't know what is the new guy's notifier block, it
just knows where the switchdev notifier chain is. So for simplification,
we make this a driver-initiated pull for now, and the notifier block is
passed as an argument.

To emulate the calling context for mdb objects (deferred and put on the
blocking notifier chain), we must iterate under RCU protection through
the bridge's mdb entries, queue them, and only call them once we're out
of the RCU read-side critical section.

Suggested-by: Ido Schimmel 
Signed-off-by: Vladimir Oltean 
---
Changes in v3:
- Removed the implication that avahi is crap from the commit message.
- Made the br_mdb_replay shim return -EOPNOTSUPP.

 include/linux/if_bridge.h |  9 +
 net/bridge/br_mdb.c   | 84 +++
 net/dsa/dsa_priv.h|  2 +
 net/dsa/port.c|  6 +++
 net/dsa/slave.c   |  2 +-
 5 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index ebd16495459c..f6472969bb44 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device 
*dev, int proto);
 bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);
 bool br_multicast_enabled(const struct net_device *dev);
 bool br_multicast_router(const struct net_device *dev);
+int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb, struct netlink_ext_ack *extack);
 #else
 static inline int br_multicast_list_adjacent(struct net_device *dev,
 struct list_head *br_ip_list)
@@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct 
net_device *dev)
 {
return false;
 }
+static inline int br_mdb_replay(struct net_device *br_dev,
+   struct net_device *dev,
+   struct notifier_block *nb,
+   struct netlink_ext_ack *extack)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 8846c5bcd075..23973186094c 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -506,6 +506,90 @@ static void br_mdb_complete(struct net_device *dev, int 
err, void *priv)
kfree(priv);
 }
 
+static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,
+struct net_bridge_mdb_entry *mp, int obj_id,
+struct net_device *orig_dev,
+struct netlink_ext_ack *extack)
+{
+   struct switchdev_notifier_port_obj_info obj_info = {
+   .info = {
+   .dev = dev,
+   .extack = extack,
+   },
+ 

[PATCH v3 net-next 07/12] net: dsa: sync ageing time when joining the bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:

sysfs/ioctl/netlink
-> br_set_ageing_time
   -> __set_ageing_time

therefore not at bridge port creation time, so:
(a) drivers had to hardcode the initial value for the address ageing time,
because they didn't get any notification
(b) that hardcoded value can be out of sync, if the user changes the
ageing time before enslaving the port to the bridge

Signed-off-by: Vladimir Oltean 
---
Changes in v3:
None.

 include/linux/if_bridge.h |  6 ++
 net/bridge/br_stp.c   | 13 +
 net/dsa/port.c| 10 ++
 3 files changed, 29 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 920d3a02cc68..ebd16495459c 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -137,6 +137,7 @@ struct net_device *br_fdb_find_port(const struct net_device 
*br_dev,
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
 u8 br_port_get_stp_state(const struct net_device *dev);
+clock_t br_get_ageing_time(struct net_device *br_dev);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -160,6 +161,11 @@ static inline u8 br_port_get_stp_state(const struct 
net_device *dev)
 {
return BR_STATE_DISABLED;
 }
+
+static inline clock_t br_get_ageing_time(struct net_device *br_dev)
+{
+   return 0;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 86b5e05d3f21..3dafb6143cff 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -639,6 +639,19 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
ageing_time)
return 0;
 }
 
+clock_t br_get_ageing_time(struct net_device *br_dev)
+{
+   struct net_bridge *br;
+
+   if (!netif_is_bridge_master(br_dev))
+   return 0;
+
+   br = netdev_priv(br_dev);
+
+   return jiffies_to_clock_t(br->ageing_time);
+}
+EXPORT_SYMBOL_GPL(br_get_ageing_time);
+
 /* called under bridge lock */
 void __br_set_topology_change(struct net_bridge *br, unsigned char val)
 {
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 124f8bb21204..95e6f2861290 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -173,6 +173,7 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
 {
struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
struct net_device *br = dp->bridge_dev;
+   clock_t ageing_time;
u8 stp_state;
int err;
 
@@ -193,6 +194,11 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   ageing_time = br_get_ageing_time(br);
+   err = dsa_port_ageing_time(dp, ageing_time);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -222,6 +228,10 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
 * allow this in standalone mode too.
 */
dsa_port_mrouter(dp->cpu_dp, true, NULL);
+
+   /* Ageing time may be global to the switch chip, so don't change it
+* here because we have no good reason (or value) to change it to.
+*/
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
-- 
2.25.1



[PATCH v3 net-next 06/12] net: dsa: sync multicast router state when joining the bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

Make sure that the multicast router setting of the bridge is picked up
correctly by DSA when joining, regardless of whether there are
sandwiched interfaces or not. The SWITCHDEV_ATTR_ID_BRIDGE_MROUTER port
attribute is only emitted from br_mc_router_state_change.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
Changes in v3:
None.

 net/dsa/port.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 3f938c253c99..124f8bb21204 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -189,6 +189,10 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   err = dsa_port_mrouter(dp->cpu_dp, br_multicast_router(br), extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -212,6 +216,12 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
 
/* VLAN filtering is handled by dsa_switch_bridge_leave */
+
+   /* Some drivers treat the notification for having a local multicast
+* router by allowing multicast to be flooded to the CPU, so we should
+* allow this in standalone mode too.
+*/
+   dsa_port_mrouter(dp->cpu_dp, true, NULL);
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
-- 
2.25.1



[PATCH v3 net-next 05/12] net: dsa: sync up VLAN filtering state when joining the bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

This is the same situation as for other switchdev port attributes: if we
join an already-created bridge port, such as a bond master interface,
then we can miss the initial switchdev notification emitted by the
bridge for this port.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
Changes in v3:
None.

 net/dsa/port.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 2ecdc824ea66..3f938c253c99 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -172,6 +172,7 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
   struct netlink_ext_ack *extack)
 {
struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   struct net_device *br = dp->bridge_dev;
u8 stp_state;
int err;
 
@@ -184,6 +185,10 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   err = dsa_port_vlan_filtering(dp, br, extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -205,6 +210,8 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
 * so allow it to be in BR_STATE_FORWARDING to be kept functional
 */
dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
+
+   /* VLAN filtering is handled by dsa_switch_bridge_leave */
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
-- 
2.25.1



[PATCH v3 net-next 04/12] net: dsa: sync up with bridge port's STP state when joining

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

It may happen that we have the following topology:

ip link add br0 type bridge stp_state 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
ip link set swp1 master bond0

STP decides that it should put bond0 into the BLOCKING state, and
that's that. The ports that are actively listening for the switchdev
port attributes emitted for the bond0 bridge port (because they are
offloading it) and have the honor of seeing that switchdev port
attribute can react to it, so we can program swp0 and swp1 into the
BLOCKING state.

But if then we do:

ip link set swp2 master bond0

then as far as the bridge is concerned, nothing has changed: it still
has one bridge port. But this new bridge port will not see any STP state
change notification and will remain FORWARDING, which is how the
standalone code leaves it in.

Add a function to the bridge which retrieves the current STP state, such
that drivers can synchronize to it when they may have missed switchdev
events.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
Changes in v3:
None.

 include/linux/if_bridge.h |  6 ++
 net/bridge/br_stp.c   | 14 ++
 net/dsa/port.c|  7 +++
 3 files changed, 27 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b979005ea39c..920d3a02cc68 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -136,6 +136,7 @@ struct net_device *br_fdb_find_port(const struct net_device 
*br_dev,
__u16 vid);
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
+u8 br_port_get_stp_state(const struct net_device *dev);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -154,6 +155,11 @@ br_port_flag_is_set(const struct net_device *dev, unsigned 
long flag)
 {
return false;
 }
+
+static inline u8 br_port_get_stp_state(const struct net_device *dev)
+{
+   return BR_STATE_DISABLED;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 21c6781906aa..86b5e05d3f21 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -64,6 +64,20 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
state)
}
 }
 
+u8 br_port_get_stp_state(const struct net_device *dev)
+{
+   struct net_bridge_port *p;
+
+   ASSERT_RTNL();
+
+   p = br_port_get_rtnl(dev);
+   if (!p)
+   return BR_STATE_DISABLED;
+
+   return p->state;
+}
+EXPORT_SYMBOL_GPL(br_port_get_stp_state);
+
 /* called under bridge lock */
 struct net_bridge_port *br_get_port(struct net_bridge *br, u16 port_no)
 {
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 8dbc6e0db30c..2ecdc824ea66 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -171,12 +171,19 @@ static void dsa_port_clear_brport_flags(struct dsa_port 
*dp,
 static int dsa_port_switchdev_sync(struct dsa_port *dp,
   struct netlink_ext_ack *extack)
 {
+   struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   u8 stp_state;
int err;
 
err = dsa_port_inherit_brport_flags(dp, extack);
if (err)
return err;
 
+   stp_state = br_port_get_stp_state(brport_dev);
+   err = dsa_port_set_state(dp, stp_state);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
-- 
2.25.1



[PATCH v3 net-next 01/12] net: dsa: call dsa_port_bridge_join when joining a LAG that is already in a bridge

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA can properly detect and offload this sequence of operations:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set swp0 master bond0
ip link set bond0 master br0

But not this one:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

Actually the second one is more complicated, due to the elapsed time
between the enslavement of bond0 and the offloading of it via swp0, a
lot of things could have happened to the bond0 bridge port in terms of
switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
a bit of a can of worms, and making sure that the DSA port's state is in
sync with this already existing bridge port is handled in the next
patches.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
Changes in v3:
None.

 net/dsa/port.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index c9c6d7ab3f47..d39262a9fe0e 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -249,17 +249,31 @@ int dsa_port_lag_join(struct dsa_port *dp, struct 
net_device *lag,
.lag = lag,
.info = uinfo,
};
+   struct net_device *bridge_dev;
int err;
 
dsa_lag_map(dp->ds->dst, lag);
dp->lag_dev = lag;
 
err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_JOIN, );
-   if (err) {
-   dp->lag_dev = NULL;
-   dsa_lag_unmap(dp->ds->dst, lag);
-   }
+   if (err)
+   goto err_lag_join;
 
+   bridge_dev = netdev_master_upper_dev_get(lag);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   err = dsa_port_bridge_join(dp, bridge_dev);
+   if (err)
+   goto err_bridge_join;
+
+   return 0;
+
+err_bridge_join:
+   dsa_port_notify(dp, DSA_NOTIFIER_LAG_LEAVE, );
+err_lag_join:
+   dp->lag_dev = NULL;
+   dsa_lag_unmap(dp->ds->dst, lag);
return err;
 }
 
-- 
2.25.1



[PATCH v3 net-next 03/12] net: dsa: inherit the actual bridge port flags at join time

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA currently assumes that the bridge port starts off with this
constellation of bridge port flags:

- learning on
- unicast flooding on
- multicast flooding on
- broadcast flooding on

just by virtue of code copy-pasta from the bridge layer (new_nbp).
This was a simple enough strategy thus far, because the 'bridge join'
moment always coincided with the 'bridge port creation' moment.

But with sandwiched interfaces, such as:

 br0
  |
bond0
  |
 swp0

it may happen that the user has had time to change the bridge port flags
of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
assume that the bridge port flags are those determined by new_nbp, when
in fact this can happen:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set bond0 type bridge_slave learning off
ip link set swp0 master br0

Now swp0 has learning enabled, bond0 has learning disabled. Not nice.

Fix this by "dumpster diving" through the actual bridge port flags with
br_port_flag_is_set, at bridge join time.

We use this opportunity to split dsa_port_change_brport_flags into two
distinct functions called dsa_port_inherit_brport_flags and
dsa_port_clear_brport_flags, now that the implementation for the two
cases is no longer similar.

Signed-off-by: Vladimir Oltean 
---
Changes in v3:
Rewrote dsa_port_clear_brport_flags to at least catch errors, and to use
the same "for" loop structure as dsa_port_inherit_brport_flags.

 net/dsa/port.c | 125 -
 1 file changed, 83 insertions(+), 42 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index fcbe5b1545b8..8dbc6e0db30c 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -122,26 +122,82 @@ void dsa_port_disable(struct dsa_port *dp)
rtnl_unlock();
 }
 
-static void dsa_port_change_brport_flags(struct dsa_port *dp,
-bool bridge_offload)
+static int dsa_port_inherit_brport_flags(struct dsa_port *dp,
+struct netlink_ext_ack *extack)
 {
-   struct switchdev_brport_flags flags;
-   int flag;
+   const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD |
+  BR_BCAST_FLOOD;
+   struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   int flag, err;
 
-   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
-   if (bridge_offload)
-   flags.val = flags.mask;
-   else
-   flags.val = flags.mask & ~BR_LEARNING;
+   for_each_set_bit(flag, , 32) {
+   struct switchdev_brport_flags flags = {0};
 
-   for_each_set_bit(flag, , 32) {
-   struct switchdev_brport_flags tmp;
+   flags.mask = BIT(flag);
 
-   tmp.val = flags.val & BIT(flag);
-   tmp.mask = BIT(flag);
+   if (br_port_flag_is_set(brport_dev, BIT(flag)))
+   flags.val = BIT(flag);
 
-   dsa_port_bridge_flags(dp, tmp, NULL);
+   err = dsa_port_bridge_flags(dp, flags, extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
}
+
+   return 0;
+}
+
+static void dsa_port_clear_brport_flags(struct dsa_port *dp,
+   struct netlink_ext_ack *extack)
+{
+   const unsigned long val = BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+   const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD |
+  BR_BCAST_FLOOD;
+   int flag, err;
+
+   for_each_set_bit(flag, , 32) {
+   struct switchdev_brport_flags flags = {0};
+
+   flags.mask = BIT(flag);
+   flags.val = val & BIT(flag);
+
+   err = dsa_port_bridge_flags(dp, flags, extack);
+   if (err && err != -EOPNOTSUPP)
+   dev_err(dp->ds->dev,
+   "failed to clear bridge port flag %d: %d 
(%pe)\n",
+   flags.val, err, ERR_PTR(err));
+   }
+}
+
+static int dsa_port_switchdev_sync(struct dsa_port *dp,
+  struct netlink_ext_ack *extack)
+{
+   int err;
+
+   err = dsa_port_inherit_brport_flags(dp, extack);
+   if (err)
+   return err;
+
+   return 0;
+}
+
+/* Configure the port for standalone mode (no address learning, flood
+ * everything, BR_STATE_FORWARDING, etc).
+ * The bridge only emits SWITCHDEV_ATTR_ID_PORT_* events when the user
+ * requests it through netlink or sysfs, but not automatically at port
+ * join or leave, so we need to handle resetting the brport flags ourselves.
+ * But we even prefer it that way, because otherwise, some setups might never
+ * get the notification they need, for example, when a port leaves a LAG that
+ * offloads the bridge, 

[PATCH v3 net-next 00/12] Better support for sandwiched LAGs with bridge and DSA

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

The objective of this series is to make LAG uppers on top of switchdev
ports work regardless of which order we link interfaces to their masters
(first make the port join the LAG, then the LAG join the bridge, or the
other way around).

There was a design decision to be made in patches 2-4 on whether we
should adopt the "push" model (which attempts to solve the problem
centrally, in the bridge layer) where the driver just calls:

  switchdev_bridge_port_offloaded(brport_dev,
  _notifier_block,
  _notifier_block,
  extack);

and the bridge just replays the entire collection of switchdev port
attributes and objects that it has, in some predefined order and with
some predefined error handling logic;


or the "pull" model (which attempts to solve the problem by giving the
driver the rope to hang itself), where the driver, apart from calling:

  switchdev_bridge_port_offloaded(brport_dev, extack);

has the task of "dumpster diving" (as Tobias puts it) through the bridge
attributes and objects by itself, by calling:

  - br_vlan_replay
  - br_fdb_replay
  - br_mdb_replay
  - br_vlan_enabled
  - br_port_flag_is_set
  - br_port_get_stp_state
  - br_multicast_router
  - br_get_ageing_time

(not necessarily all of them, and not necessarily in this order, and
with driver-defined error handling).

Even though I'm not in love myself with the "pull" model, I chose it
because there is a fundamental trick with replaying switchdev events
like this:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0 <- this will replay the objects once for
 the bond0 bridge port, and the swp0
 switchdev port will process them
ip link set swp1 master bond0 <- this will replay the objects again for
 the bond0 bridge port, and the swp1
 switchdev port will see them, but swp0
 will see them for the second time now

Basically I believe that it is implementation defined whether the driver
wants to error out on switchdev objects seen twice on a port, and the
bridge should not enforce a certain model for that. For example, for FDB
entries added to a bonding interface, the underling switchdev driver
might have an abstraction for just that: an FDB entry pointing towards a
logical (as opposed to physical) port. So when the second port joins the
bridge, it doesn't realy need to replay FDB entries, since there is
already at least one hardware port which has been receiving those
events, and the FDB entries don't need to be added a second time to the
same logical port.
In the other corner, we have the drivers that handle switchdev port
attributes on a LAG as individual switchdev port attributes on physical
ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set
helper facilitates this: it is a fan-out from a single orig_dev towards
multiple lowers that pass the check_cb().
But that's the point: switchdev_handle_port_attr_set is just a helper
which the driver _opts_ to use. The bridge can't enforce the "push"
model, because that would assume that all drivers handle port attributes
in the same way, which is probably false.

For this reason, I preferred to go with the "pull" mode for this patch
set. Just to see how bad it is for other switchdev drivers to copy-paste
this logic, I added the pull support to ocelot too, and I think it's
pretty manageable.

Vladimir Oltean (12):
  net: dsa: call dsa_port_bridge_join when joining a LAG that is already
in a bridge
  net: dsa: pass extack to dsa_port_{bridge,lag}_join
  net: dsa: inherit the actual bridge port flags at join time
  net: dsa: sync up with bridge port's STP state when joining
  net: dsa: sync up VLAN filtering state when joining the bridge
  net: dsa: sync multicast router state when joining the bridge
  net: dsa: sync ageing time when joining the bridge
  net: dsa: replay port and host-joined mdb entries when joining the
bridge
  net: dsa: replay port and local fdb entries when joining the bridge
  net: dsa: replay VLANs installed on port when joining the bridge
  net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged
LAG
  net: ocelot: replay switchdev events when joining bridge

 drivers/net/dsa/ocelot/felix.c |   4 +-
 drivers/net/ethernet/mscc/ocelot.c |  18 +--
 drivers/net/ethernet/mscc/ocelot_net.c | 208 +
 include/linux/if_bridge.h  |  40 +
 include/net/switchdev.h|   1 +
 include/soc/mscc/ocelot.h  |   6 +-
 net/bridge/br_fdb.c|  52 +++
 net/bridge/br_mdb.c|  84 ++
 net/bridge/br_stp.c|  27

[PATCH v3 net-next 02/12] net: dsa: pass extack to dsa_port_{bridge,lag}_join

2021-03-20 Thread Vladimir Oltean
From: Vladimir Oltean 

This is a pretty noisy change that was broken out of the larger change
for replaying switchdev attributes and objects at bridge join time,
which is when these extack objects are actually used.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
Changes in v3:
None.

 net/dsa/dsa_priv.h | 6 --
 net/dsa/port.c | 8 +---
 net/dsa/slave.c| 7 +--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 4c43c5406834..b8778c5d8529 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -181,12 +181,14 @@ int dsa_port_enable_rt(struct dsa_port *dp, struct 
phy_device *phy);
 int dsa_port_enable(struct dsa_port *dp, struct phy_device *phy);
 void dsa_port_disable_rt(struct dsa_port *dp);
 void dsa_port_disable(struct dsa_port *dp);
-int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br);
+int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
+struct netlink_ext_ack *extack);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
 int dsa_port_lag_change(struct dsa_port *dp,
struct netdev_lag_lower_state_info *linfo);
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev,
- struct netdev_lag_upper_info *uinfo);
+ struct netdev_lag_upper_info *uinfo,
+ struct netlink_ext_ack *extack);
 void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev);
 int dsa_port_vlan_filtering(struct dsa_port *dp, bool vlan_filtering,
struct netlink_ext_ack *extack);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index d39262a9fe0e..fcbe5b1545b8 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -144,7 +144,8 @@ static void dsa_port_change_brport_flags(struct dsa_port 
*dp,
}
 }
 
-int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br)
+int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
+struct netlink_ext_ack *extack)
 {
struct dsa_notifier_bridge_info info = {
.tree_index = dp->ds->dst->index,
@@ -241,7 +242,8 @@ int dsa_port_lag_change(struct dsa_port *dp,
 }
 
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag,
- struct netdev_lag_upper_info *uinfo)
+ struct netdev_lag_upper_info *uinfo,
+ struct netlink_ext_ack *extack)
 {
struct dsa_notifier_lag_info info = {
.sw_index = dp->ds->index,
@@ -263,7 +265,7 @@ int dsa_port_lag_join(struct dsa_port *dp, struct 
net_device *lag,
if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
return 0;
 
-   err = dsa_port_bridge_join(dp, bridge_dev);
+   err = dsa_port_bridge_join(dp, bridge_dev, extack);
if (err)
goto err_bridge_join;
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 992fcab4b552..1ff48be476bb 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1976,11 +1976,14 @@ static int dsa_slave_changeupper(struct net_device *dev,
 struct netdev_notifier_changeupper_info *info)
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
+   struct netlink_ext_ack *extack;
int err = NOTIFY_DONE;
 
+   extack = netdev_notifier_info_to_extack(>info);
+
if (netif_is_bridge_master(info->upper_dev)) {
if (info->linking) {
-   err = dsa_port_bridge_join(dp, info->upper_dev);
+   err = dsa_port_bridge_join(dp, info->upper_dev, extack);
if (!err)
dsa_bridge_mtu_normalization(dp);
err = notifier_from_errno(err);
@@ -1991,7 +1994,7 @@ static int dsa_slave_changeupper(struct net_device *dev,
} else if (netif_is_lag_master(info->upper_dev)) {
if (info->linking) {
err = dsa_port_lag_join(dp, info->upper_dev,
-   info->upper_info);
+   info->upper_info, extack);
if (err == -EOPNOTSUPP) {
NL_SET_ERR_MSG_MOD(info->info.extack,
   "Offloading not supported");
-- 
2.25.1



Re: [RFC PATCH v2 net-next 08/16] net: dsa: replay port and host-joined mdb entries when joining the bridge

2021-03-20 Thread Vladimir Oltean
On Fri, Mar 19, 2021 at 03:20:38PM -0700, Florian Fainelli wrote:
>
>
> On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> > From: Vladimir Oltean 
> >
> > I have udhcpcd in my system and this is configured to bring interfaces
> > up as soon as they are created.
> >
> > I create a bridge as follows:
> >
> > ip link add br0 type bridge
> >
> > As soon as I create the bridge and udhcpcd brings it up, I have some
> > other crap (avahi)
>
> How dare you ;)

Well, it comes preinstalled on my system, I don't need it, and it has
caused me nothing but trouble. So I think it has earned its title :D

> > that starts sending some random IPv6 packets to
> > advertise some local services, and from there, the br0 bridge joins the
> > following IPv6 groups:
> >
> > 33:33:ff:6d:c1:9c vid 0
> > 33:33:00:00:00:6a vid 0
> > 33:33:00:00:00:fb vid 0
> >
> > br_dev_xmit
> > -> br_multicast_rcv
> >-> br_ip6_multicast_add_group
> >   -> __br_multicast_add_group
> >  -> br_multicast_host_join
> > -> br_mdb_notify
> >
> > This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
> > hooked up, and switchdev will attempt to offload the host joined groups
> > to an empty list of ports. Of course nobody offloads them.
> >
> > Then when we add a port to br0:
> >
> > ip link set swp0 master br0
> >
> > the bridge doesn't replay the host-joined MDB entries from br_add_if,
> > and eventually the host joined addresses expire, and a switchdev
> > notification for deleting it is emitted, but surprise, the original
> > addition was already completely missed.
> >
> > The strategy to address this problem is to replay the MDB entries (both
> > the port ones and the host joined ones) when the new port joins the
> > bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
> > be populated and only then attached to a bridge that you offload).
> > However there are 2 possibilities: the addresses can be 'pushed' by the
> > bridge into the port, or the port can 'pull' them from the bridge.
> >
> > Considering that in the general case, the new port can be really late to
> > the party, and there may have been many other switchdev ports that
> > already received the initial notification, we would like to avoid
> > delivering duplicate events to them, since they might misbehave. And
> > currently, the bridge calls the entire switchdev notifier chain, whereas
> > for replaying it should just call the notifier block of the new guy.
> > But the bridge doesn't know what is the new guy's notifier block, it
> > just knows where the switchdev notifier chain is. So for simplification,
> > we make this a driver-initiated pull for now, and the notifier block is
> > passed as an argument.
> >
> > To emulate the calling context for mdb objects (deferred and put on the
> > blocking notifier chain), we must iterate under RCU protection through
> > the bridge's mdb entries, queue them, and only call them once we're out
> > of the RCU read-side critical section.
> >
> > Suggested-by: Ido Schimmel 
> > Signed-off-by: Vladimir Oltean 
> > ---
> >  include/linux/if_bridge.h |  9 +
> >  net/bridge/br_mdb.c   | 84 +++
> >  net/dsa/dsa_priv.h|  2 +
> >  net/dsa/port.c|  6 +++
> >  net/dsa/slave.c   |  2 +-
> >  5 files changed, 102 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
> > index ebd16495459c..4c25dafb013d 100644
> > --- a/include/linux/if_bridge.h
> > +++ b/include/linux/if_bridge.h
> > @@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device 
> > *dev, int proto);
> >  bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);
> >  bool br_multicast_enabled(const struct net_device *dev);
> >  bool br_multicast_router(const struct net_device *dev);
> > +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,
> > + struct notifier_block *nb, struct netlink_ext_ack *extack);
> >  #else
> >  static inline int br_multicast_list_adjacent(struct net_device *dev,
> >  struct list_head *br_ip_list)
> > @@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct 
> > net_device *dev)
> >  {
> > return false;
> >  }
> > +static inline int br_mdb_replay(struct net_device *br_dev,
> > + 

Re: [RFC PATCH v2 net-next 07/16] net: dsa: sync ageing time when joining the bridge

2021-03-20 Thread Vladimir Oltean
On Fri, Mar 19, 2021 at 03:13:03PM -0700, Florian Fainelli wrote:
> > diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> > index 86b5e05d3f21..3dafb6143cff 100644
> > --- a/net/bridge/br_stp.c
> > +++ b/net/bridge/br_stp.c
> > @@ -639,6 +639,19 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
> > ageing_time)
> > return 0;
> >  }
> >  
> > +clock_t br_get_ageing_time(struct net_device *br_dev)
> > +{
> > +   struct net_bridge *br;
> > +
> > +   if (!netif_is_bridge_master(br_dev))
> > +   return 0;
> > +
> > +   br = netdev_priv(br_dev);
> > +
> > +   return jiffies_to_clock_t(br->ageing_time);
> 
> Don't you want an ASSERT_RTNL() in this function as well?

Hmm, I'm not sure. I don't think I'm accessing anything that is under
the protection of the rtnl_mutex. If anything, the ageing time is
protected by the "bridge lock", but I don't think there's much of an
issue if I read an unsigned int while not holding it.


Re: [RFC PATCH v2 net-next 03/16] net: dsa: inherit the actual bridge port flags at join time

2021-03-20 Thread Vladimir Oltean
On Fri, Mar 19, 2021 at 03:08:46PM -0700, Florian Fainelli wrote:
> 
> 
> On 3/18/2021 4:18 PM, Vladimir Oltean wrote:
> > From: Vladimir Oltean 
> > 
> > DSA currently assumes that the bridge port starts off with this
> > constellation of bridge port flags:
> > 
> > - learning on
> > - unicast flooding on
> > - multicast flooding on
> > - broadcast flooding on
> > 
> > just by virtue of code copy-pasta from the bridge layer (new_nbp).
> > This was a simple enough strategy thus far, because the 'bridge join'
> > moment always coincided with the 'bridge port creation' moment.
> > 
> > But with sandwiched interfaces, such as:
> > 
> >  br0
> >   |
> > bond0
> >   |
> >  swp0
> > 
> > it may happen that the user has had time to change the bridge port flags
> > of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
> > assume that the bridge port flags are those determined by new_nbp, when
> > in fact this can happen:
> > 
> > ip link add br0 type bridge
> > ip link add bond0 type bond
> > ip link set bond0 master br0
> > ip link set bond0 type bridge_slave learning off
> > ip link set swp0 master br0
> > 
> > Now swp0 has learning enabled, bond0 has learning disabled. Not nice.
> > 
> > Fix this by "dumpster diving" through the actual bridge port flags with
> > br_port_flag_is_set, at bridge join time.
> > 
> > We use this opportunity to split dsa_port_change_brport_flags into two
> > distinct functions called dsa_port_inherit_brport_flags and
> > dsa_port_clear_brport_flags, now that the implementation for the two
> > cases is no longer similar.
> > 
> > Signed-off-by: Vladimir Oltean 
> > ---
> >  net/dsa/port.c | 123 -
> >  1 file changed, 82 insertions(+), 41 deletions(-)
> > 
> > diff --git a/net/dsa/port.c b/net/dsa/port.c
> > index fcbe5b1545b8..346c50467810 100644
> > --- a/net/dsa/port.c
> > +++ b/net/dsa/port.c
> > @@ -122,26 +122,82 @@ void dsa_port_disable(struct dsa_port *dp)
> > rtnl_unlock();
> >  }
> >  
> > -static void dsa_port_change_brport_flags(struct dsa_port *dp,
> > -bool bridge_offload)
> > +static void dsa_port_clear_brport_flags(struct dsa_port *dp,
> > +   struct netlink_ext_ack *extack)
> >  {
> > struct switchdev_brport_flags flags;
> > -   int flag;
> >  
> > -   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
> > -   if (bridge_offload)
> > -   flags.val = flags.mask;
> > -   else
> > -   flags.val = flags.mask & ~BR_LEARNING;
> > +   flags.mask = BR_LEARNING;
> > +   flags.val = 0;
> > +   dsa_port_bridge_flags(dp, flags, extack);
> 
> Would not you want to use the same for_each_set_bit() loop that
> dsa_port_change_br_flags() uses, that would be a tad more compact.
> -- 
> Florian

The reworded version has an equal number of lines, but at least it
catches errors now:

static void dsa_port_clear_brport_flags(struct dsa_port *dp,
struct netlink_ext_ack *extack)
{
const unsigned long val = BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD |
   BR_BCAST_FLOOD;
int flag, err;

for_each_set_bit(flag, , 32) {
struct switchdev_brport_flags flags = {0};

flags.mask = BIT(flag);
flags.val = val & BIT(flag);

err = dsa_port_bridge_flags(dp, flags, extack);
if (err && err != -EOPNOTSUPP)
dev_err(dp->ds->dev,
"failed to clear bridge port flag %d: %d 
(%pe)\n",
flag, err, ERR_PTR(err));
}
}


Re: [RFC PATCH v2 net-next 14/16] net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge

2021-03-19 Thread Vladimir Oltean
On Fri, Mar 19, 2021 at 05:29:12PM +0800, DENG Qingfang wrote:
> On Fri, Mar 19, 2021 at 5:06 PM Vladimir Oltean  wrote:
> >
> > This is a good point actually, which I thought about, but did not give a
> > lot of importance to for the moment. Either we go full steam ahead with
> > assisted learning on the CPU port for everybody, and we selectively
> > learn the addresses relevant to the bridging funciton only, or we do
> > what you say, but then it will be a little bit more complicated IMO, and
> > have hardware dependencies, which isn't as nice.
> 
> Are skb->offload_fwd_mark and source DSA switch kept in dsa_slave_xmit?
> I think SA learning should be bypassed iff skb->offload_fwd_mark == 1 and
> source DSA switch == destination DSA switch.

Why would you even want to look at the source net device for forwarding?
I'd say that if dp->bridge_dev is NULL in the xmit function, you certainly
want to bypass address learning if you can. Maybe also for link-local traffic.


Re: [RFC PATCH v2 net-next 14/16] net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge

2021-03-19 Thread Vladimir Oltean
On Fri, Mar 19, 2021 at 04:52:31PM +0800, DENG Qingfang wrote:
> On Fri, Mar 19, 2021 at 01:18:27AM +0200, Vladimir Oltean wrote:
> > From: Vladimir Oltean 
> > 
> > DSA has gained the recent ability to deal gracefully with upper
> > interfaces it cannot offload, such as the bridge, bonding or team
> > drivers. When such uppers exist, the ports are still in standalone mode
> > as far as the hardware is concerned.
> > 
> > But when we deliver packets to the software bridge in order for that to
> > do the forwarding, there is an unpleasant surprise in that the bridge
> > will refuse to forward them. This is because we unconditionally set
> > skb->offload_fwd_mark = true, meaning that the bridge thinks the frames
> > were already forwarded in hardware by us.
> > 
> > Since dp->bridge_dev is populated only when there is hardware offload
> > for it, but not in the software fallback case, let's introduce a new
> > helper that can be called from the tagger data path which sets the
> > skb->offload_fwd_mark accordingly to zero when there is no hardware
> > offload for bridging. This lets the bridge forward packets back to other
> > interfaces of our switch, if needed.
> > 
> > Without this change, sending a packet to the CPU for an unoffloaded
> > interface triggers this WARN_ON:
> > 
> > void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
> >   struct sk_buff *skb)
> > {
> > if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
> > BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
> > }
> > 
> > Signed-off-by: Vladimir Oltean 
> > Reviewed-by: Tobias Waldekranz 
> > ---
> >  net/dsa/dsa_priv.h | 14 ++
> >  net/dsa/tag_brcm.c |  2 +-
> >  net/dsa/tag_dsa.c  | 15 +++
> >  net/dsa/tag_hellcreek.c|  2 +-
> >  net/dsa/tag_ksz.c  |  2 +-
> >  net/dsa/tag_lan9303.c  |  3 ++-
> >  net/dsa/tag_mtk.c  |  2 +-
> >  net/dsa/tag_ocelot.c   |  2 +-
> >  net/dsa/tag_ocelot_8021q.c |  2 +-
> >  net/dsa/tag_rtl4_a.c   |  2 +-
> >  net/dsa/tag_sja1105.c  |  4 ++--
> >  net/dsa/tag_xrs700x.c  |  2 +-
> >  12 files changed, 37 insertions(+), 15 deletions(-)
> > 
> > diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
> > index 92282de54230..b61bef79ce84 100644
> > --- a/net/dsa/dsa_priv.h
> > +++ b/net/dsa/dsa_priv.h
> > @@ -349,6 +349,20 @@ static inline struct sk_buff 
> > *dsa_untag_bridge_pvid(struct sk_buff *skb)
> > return skb;
> >  }
> >  
> > +/* If the ingress port offloads the bridge, we mark the frame as 
> > autonomously
> > + * forwarded by hardware, so the software bridge doesn't forward in twice, 
> > back
> > + * to us, because we already did. However, if we're in fallback mode and 
> > we do
> > + * software bridging, we are not offloading it, therefore the 
> > dp->bridge_dev
> > + * pointer is not populated, and flooding needs to be done by software (we 
> > are
> > + * effectively operating in standalone ports mode).
> > + */
> > +static inline void dsa_default_offload_fwd_mark(struct sk_buff *skb)
> > +{
> > +   struct dsa_port *dp = dsa_slave_to_port(skb->dev);
> > +
> > +   skb->offload_fwd_mark = !!(dp->bridge_dev);
> > +}
> 
> So offload_fwd_mark is set iff the ingress port offloads the bridge.
> Consider this set up on a switch which does NOT support LAG offload:
> 
> +- br0 -+
> |   |
>   bond0 |
> |   | (Linux interfaces)
> +---+---+   +---+---+
> |   |   |   |
> +---+---+---+---+
> | sw0p0 | sw0p1 | sw0p2 | sw0p3 |
> +---+---+---+---+
> |   |   |   |
> +---A---+   B   C (LAN clients)
> 
> 
> sw0p0 and sw0p1 should be in standalone mode (offload_fwd_mark = 0),
> while sw0p2 and sw0p3 are offloaded (offload_fwd_mark = 1).
> 
> When a frame is sent into sw0p2 or sw0p3, can it be forwarded to sw0p0 or
> sw0p1?

bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
  const struct sk_buff *skb)
{
return !skb->offload_fwd_mark ||
   BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
}

where p->offload_fwd_mark is the mark of the egress port, and
BR_INPUT_SKB_CB(skb) is the mark of the ingress port, assigned here:

void nbp_switchdev_frame_mark(cons

[RFC PATCH v2 net-next 15/16] net: dsa: return -EOPNOTSUPP when driver does not implement .port_lag_join

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

The DSA core has a layered structure, and even though we end up
returning 0 (success) to user space when setting a bonding/team upper
that can't be offloaded, some parts of the framework actually need to
know that we couldn't offload that.

For example, if dsa_switch_lag_join returns 0 as it currently does,
dsa_port_lag_join has no way to tell a successful offload from a
software fallback, and it will call dsa_port_bridge_join afterwards.
Then we'll think we're offloading the bridge master of the LAG, when in
fact we're not even offloading the LAG. In turn, this will make us set
skb->offload_fwd_mark = true, which is incorrect and the bridge doesn't
like it.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/switch.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 4b5da89dc27a..162bbb2f5cec 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -213,7 +213,7 @@ static int dsa_switch_lag_join(struct dsa_switch *ds,
   info->port, info->lag,
   info->info);
 
-   return 0;
+   return -EOPNOTSUPP;
 }
 
 static int dsa_switch_lag_leave(struct dsa_switch *ds,
@@ -226,7 +226,7 @@ static int dsa_switch_lag_leave(struct dsa_switch *ds,
return ds->ops->crosschip_lag_leave(ds, info->sw_index,
info->port, info->lag);
 
-   return 0;
+   return -EOPNOTSUPP;
 }
 
 static bool dsa_switch_mdb_match(struct dsa_switch *ds, int port,
-- 
2.25.1



[RFC PATCH v2 net-next 16/16] net: bridge: switchdev: let drivers inform which bridge ports are offloaded

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

On reception of an skb, the bridge checks if it was marked as 'already
forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
is, it puts a mark of its own on that skb, with the switchdev mark of
the ingress port. Then during forwarding, it enforces that the egress
port must have a different switchdev mark than the ingress one (this is
done in nbp_switchdev_allowed_egress).

Non-switchdev drivers don't report any physical switch id (neither
through devlink nor .ndo_get_port_parent_id), therefore the bridge
assigns them a switchdev mark of 0, and packets coming from them will
always have skb->offload_fwd_mark = 0. So there aren't any restrictions.

Problems appear due to the fact that DSA would like to perform software
fallback for bonding and team interfaces that the physical switch cannot
offload.

 +-- br0 -+
/   / |\
   /   /  | \
  /   /   |  \
 /   /|   \
/   / |\
   /| |   bond0
  / | |  /\
 swp0  swp1  swp2  swp3  swp4

There, it is desirable that the presence of swp3 and swp4 under a
non-offloaded LAG does not preclude us from doing hardware bridging
beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
enough that software bridging between {swp0,swp1,swp2} and bond0 is not
impractical.

But this creates an impossible paradox given the current way in which
port switchdev marks are assigned. When the driver receives a packet
from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
something.

- If we set it to 0, then the bridge will forward it towards swp1, swp2
  and bond0. But the switch has already forwarded it towards swp1 and
  swp2 (not to bond0, remember, that isn't offloaded, so as far as the
  switch is concerned, ports swp3 and swp4 are not looking up the FDB,
  and the entire bond0 is a destination that is strictly behind the
  CPU). But we don't want duplicated traffic towards swp1 and swp2, so
  it's not ok to set skb->offload_fwd_mark = 0.

- If we set it to 1, then the bridge will not forward the skb towards
  the ports with the same switchdev mark, i.e. not to swp1, swp2 and
  bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
  have forwarded the skb there.

So the real issue is that bond0 will be assigned the same switchdev mark
as {swp0,swp1,swp2}, because the function that assigns switchdev marks
to bridge ports, nbp_switchdev_mark_set, recurses through bond0's lower
interfaces until it finds something that implements devlink.

A solution is to give the bridge explicit hints as to what switchdev
mark it should use for each port.

Currently, the bridging offload is very 'silent': a driver registers a
netdevice notifier, which is put on the netns's notifier chain, and
which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
bridge, and the lower is an interface it knows about (one registered by
this driver, normally). Then, from within that notifier, it does a bunch
of stuff behind the bridge's back, without the bridge necessarily
knowing that there's somebody offloading that port. It looks like this:

 ip link set swp0 master br0
  |
  v
   bridge calls netdev_master_upper_dev_link
  |
  v
call_netdevice_notifiers
  |
  v
   dsa_slave_netdevice_event
  |
  v
oh, hey! it's for me!
  |
  v
   .port_bridge_join

What we do to solve the conundrum is to be less silent, and emit a
notification back. Something like this:

 ip link set swp0 master br0
  |
  v
   bridge calls netdev_master_upper_dev_link
  |
  vbridge: Aye! I'll use this
call_netdevice_notifiers   ^  ppid as the
  ||  switchdev mark for
  v|  this port, and zero
   dsa_slave_netdevice_event   |  if I got nothing.
  ||
  v|
oh, hey! it's for me!  |
  ||
  v|
   .port_bridge_join   |
  ||
  ++
 switchdev_bridge_port_offload(swp0)

Then stacked interfaces (like bond0 on top of swp3/swp4) would be
treated differently in DSA, depending on whether we can or cannot
offload them.

The offload case:

ip link set bond0 master br0
  |
  v
   bridge calls netdev_master_upper_dev_link
  |
  vbridge: Aye! I'll use this
call_netdevice_notifiers  

[RFC PATCH v2 net-next 14/16] net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA has gained the recent ability to deal gracefully with upper
interfaces it cannot offload, such as the bridge, bonding or team
drivers. When such uppers exist, the ports are still in standalone mode
as far as the hardware is concerned.

But when we deliver packets to the software bridge in order for that to
do the forwarding, there is an unpleasant surprise in that the bridge
will refuse to forward them. This is because we unconditionally set
skb->offload_fwd_mark = true, meaning that the bridge thinks the frames
were already forwarded in hardware by us.

Since dp->bridge_dev is populated only when there is hardware offload
for it, but not in the software fallback case, let's introduce a new
helper that can be called from the tagger data path which sets the
skb->offload_fwd_mark accordingly to zero when there is no hardware
offload for bridging. This lets the bridge forward packets back to other
interfaces of our switch, if needed.

Without this change, sending a packet to the CPU for an unoffloaded
interface triggers this WARN_ON:

void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
  struct sk_buff *skb)
{
if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
}

Signed-off-by: Vladimir Oltean 
Reviewed-by: Tobias Waldekranz 
---
 net/dsa/dsa_priv.h | 14 ++
 net/dsa/tag_brcm.c |  2 +-
 net/dsa/tag_dsa.c  | 15 +++
 net/dsa/tag_hellcreek.c|  2 +-
 net/dsa/tag_ksz.c  |  2 +-
 net/dsa/tag_lan9303.c  |  3 ++-
 net/dsa/tag_mtk.c  |  2 +-
 net/dsa/tag_ocelot.c   |  2 +-
 net/dsa/tag_ocelot_8021q.c |  2 +-
 net/dsa/tag_rtl4_a.c   |  2 +-
 net/dsa/tag_sja1105.c  |  4 ++--
 net/dsa/tag_xrs700x.c  |  2 +-
 12 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 92282de54230..b61bef79ce84 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -349,6 +349,20 @@ static inline struct sk_buff *dsa_untag_bridge_pvid(struct 
sk_buff *skb)
return skb;
 }
 
+/* If the ingress port offloads the bridge, we mark the frame as autonomously
+ * forwarded by hardware, so the software bridge doesn't forward in twice, back
+ * to us, because we already did. However, if we're in fallback mode and we do
+ * software bridging, we are not offloading it, therefore the dp->bridge_dev
+ * pointer is not populated, and flooding needs to be done by software (we are
+ * effectively operating in standalone ports mode).
+ */
+static inline void dsa_default_offload_fwd_mark(struct sk_buff *skb)
+{
+   struct dsa_port *dp = dsa_slave_to_port(skb->dev);
+
+   skb->offload_fwd_mark = !!(dp->bridge_dev);
+}
+
 /* switch.c */
 int dsa_switch_register_notifier(struct dsa_switch *ds);
 void dsa_switch_unregister_notifier(struct dsa_switch *ds);
diff --git a/net/dsa/tag_brcm.c b/net/dsa/tag_brcm.c
index e2577a7dcbca..a8880b3bb106 100644
--- a/net/dsa/tag_brcm.c
+++ b/net/dsa/tag_brcm.c
@@ -150,7 +150,7 @@ static struct sk_buff *brcm_tag_rcv_ll(struct sk_buff *skb,
/* Remove Broadcom tag and update checksum */
skb_pull_rcsum(skb, BRCM_TAG_LEN);
 
-   skb->offload_fwd_mark = 1;
+   dsa_default_offload_fwd_mark(skb);
 
return skb;
 }
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 7e7b7decdf39..09ab9c25e686 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -162,8 +162,8 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, 
struct net_device *dev,
 static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
  u8 extra)
 {
+   bool trap = false, trunk = false;
int source_device, source_port;
-   bool trunk = false;
enum dsa_code code;
enum dsa_cmd cmd;
u8 *dsa_header;
@@ -174,8 +174,6 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, 
struct net_device *dev,
cmd = dsa_header[0] >> 6;
switch (cmd) {
case DSA_CMD_FORWARD:
-   skb->offload_fwd_mark = 1;
-
trunk = !!(dsa_header[1] & 7);
break;
 
@@ -194,7 +192,6 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, 
struct net_device *dev,
 * device (like a bridge) that forwarding has
 * already been done by hardware.
 */
-   skb->offload_fwd_mark = 1;
break;
case DSA_CODE_MGMT_TRAP:
case DSA_CODE_IGMP_MLD_TRAP:
@@ -202,6 +199,7 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, 
struct net_device *dev,
/* Traps have, by definition, not been
 * forwarded by hardware, so don't mark them.
   

[RFC PATCH v2 net-next 13/16] net: ocelot: replay switchdev events when joining bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

The premise of this change is that the switchdev port attributes and
objects offloaded by ocelot might have been missed when we are joining
an already existing bridge port, such as a bonding interface.

The patch pulls these switchdev attributes and objects from the bridge,
on behalf of the 'bridge port' net device which might be either the
ocelot switch interface, or the bonding upper interface.

The ocelot_net.c belongs strictly to the switchdev ocelot driver, while
ocelot.c is part of a library shared with the DSA felix driver.
The ocelot_port_bridge_leave function (part of the common library) used
to call ocelot_port_vlan_filtering(false), something which is not
necessary for DSA, since the framework deals with that already there.
So we move this function to ocelot_switchdev_unsync, which is specific
to the switchdev driver.

The code movement described above makes ocelot_port_bridge_leave no
longer return an error code, so we change its type from int to void.

Signed-off-by: Vladimir Oltean 
---
 drivers/net/dsa/ocelot/felix.c |   4 +-
 drivers/net/ethernet/mscc/ocelot.c |  18 ++--
 drivers/net/ethernet/mscc/ocelot_net.c | 117 +
 include/soc/mscc/ocelot.h  |   6 +-
 4 files changed, 111 insertions(+), 34 deletions(-)

diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 628afb47b579..6b5442be0230 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -719,7 +719,9 @@ static int felix_bridge_join(struct dsa_switch *ds, int 
port,
 {
struct ocelot *ocelot = ds->priv;
 
-   return ocelot_port_bridge_join(ocelot, port, br);
+   ocelot_port_bridge_join(ocelot, port, br);
+
+   return 0;
 }
 
 static void felix_bridge_leave(struct dsa_switch *ds, int port,
diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index ce57929ba3d1..1a36b416fd9b 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1514,34 +1514,28 @@ int ocelot_port_mdb_del(struct ocelot *ocelot, int port,
 }
 EXPORT_SYMBOL(ocelot_port_mdb_del);
 
-int ocelot_port_bridge_join(struct ocelot *ocelot, int port,
-   struct net_device *bridge)
+void ocelot_port_bridge_join(struct ocelot *ocelot, int port,
+struct net_device *bridge)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
 
ocelot_port->bridge = bridge;
 
-   return 0;
+   ocelot_apply_bridge_fwd_mask(ocelot);
 }
 EXPORT_SYMBOL(ocelot_port_bridge_join);
 
-int ocelot_port_bridge_leave(struct ocelot *ocelot, int port,
-struct net_device *bridge)
+void ocelot_port_bridge_leave(struct ocelot *ocelot, int port,
+ struct net_device *bridge)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
struct ocelot_vlan pvid = {0}, native_vlan = {0};
-   int ret;
 
ocelot_port->bridge = NULL;
 
-   ret = ocelot_port_vlan_filtering(ocelot, port, false);
-   if (ret)
-   return ret;
-
ocelot_port_set_pvid(ocelot, port, pvid);
ocelot_port_set_native_vlan(ocelot, port, native_vlan);
-
-   return 0;
+   ocelot_apply_bridge_fwd_mask(ocelot);
 }
 EXPORT_SYMBOL(ocelot_port_bridge_leave);
 
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index d1376f7b34fd..d38ffc7cf5f0 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1117,47 +1117,126 @@ static int ocelot_port_obj_del(struct net_device *dev,
return ret;
 }
 
+static void ocelot_inherit_brport_flags(struct ocelot *ocelot, int port,
+   struct net_device *brport_dev)
+{
+   struct switchdev_brport_flags flags = {0};
+   int flag;
+
+   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+
+   for_each_set_bit(flag, , 32)
+   if (br_port_flag_is_set(brport_dev, BIT(flag)))
+   flags.val |= BIT(flag);
+
+   ocelot_port_bridge_flags(ocelot, port, flags);
+}
+
+static void ocelot_clear_brport_flags(struct ocelot *ocelot, int port)
+{
+   struct switchdev_brport_flags flags;
+
+   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+   flags.val = flags.mask & ~BR_LEARNING;
+
+   ocelot_port_bridge_flags(ocelot, port, flags);
+}
+
+static int ocelot_switchdev_sync(struct ocelot *ocelot, int port,
+struct net_device *brport_dev,
+struct net_device *bridge_dev,
+struct netlink_ext_ack *extack)
+{
+   clock_t ageing_time;
+   u8 stp_state;
+   int err;
+
+   ocelot_inherit_brport_flags(ocelot, port, brport_dev);
+
+   stp_state = br_port_get_s

[RFC PATCH v2 net-next 12/16] net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

Similar to the DSA situation, ocelot supports LAG offload but treats
this scenario improperly:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

We do the same thing as we do there, which is to simulate a 'bridge join'
on 'lag join', if we detect that the bonding upper has a bridge upper.

Again, same as DSA, ocelot supports software fallback for LAG, and in
that case, we should avoid calling ocelot_netdevice_changeupper.

Signed-off-by: Vladimir Oltean 
---
 drivers/net/ethernet/mscc/ocelot_net.c | 111 +++--
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index c08164cd88f4..d1376f7b34fd 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1117,10 +1117,15 @@ static int ocelot_port_obj_del(struct net_device *dev,
return ret;
 }
 
-static int ocelot_netdevice_bridge_join(struct ocelot *ocelot, int port,
-   struct net_device *bridge)
+static int ocelot_netdevice_bridge_join(struct net_device *dev,
+   struct net_device *bridge,
+   struct netlink_ext_ack *extack)
 {
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
struct switchdev_brport_flags flags;
+   int port = priv->chip_port;
int err;
 
flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
@@ -1135,10 +1140,14 @@ static int ocelot_netdevice_bridge_join(struct ocelot 
*ocelot, int port,
return 0;
 }
 
-static int ocelot_netdevice_bridge_leave(struct ocelot *ocelot, int port,
+static int ocelot_netdevice_bridge_leave(struct net_device *dev,
 struct net_device *bridge)
 {
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
struct switchdev_brport_flags flags;
+   int port = priv->chip_port;
int err;
 
flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
@@ -1151,43 +1160,89 @@ static int ocelot_netdevice_bridge_leave(struct ocelot 
*ocelot, int port,
return err;
 }
 
-static int ocelot_netdevice_changeupper(struct net_device *dev,
-   struct netdev_notifier_changeupper_info 
*info)
+static int ocelot_netdevice_lag_join(struct net_device *dev,
+struct net_device *bond,
+struct netdev_lag_upper_info *info,
+struct netlink_ext_ack *extack)
 {
struct ocelot_port_private *priv = netdev_priv(dev);
struct ocelot_port *ocelot_port = >port;
struct ocelot *ocelot = ocelot_port->ocelot;
+   struct net_device *bridge_dev;
int port = priv->chip_port;
+   int err;
+
+   err = ocelot_port_lag_join(ocelot, port, bond, info);
+   if (err == -EOPNOTSUPP) {
+   NL_SET_ERR_MSG_MOD(extack, "Offloading not supported");
+   return 0;
+   }
+
+   bridge_dev = netdev_master_upper_dev_get(bond);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   err = ocelot_netdevice_bridge_join(dev, bridge_dev, extack);
+   if (err)
+   goto err_bridge_join;
+
+   return 0;
+
+err_bridge_join:
+   ocelot_port_lag_leave(ocelot, port, bond);
+   return err;
+}
+
+static int ocelot_netdevice_lag_leave(struct net_device *dev,
+ struct net_device *bond)
+{
+   struct ocelot_port_private *priv = netdev_priv(dev);
+   struct ocelot_port *ocelot_port = >port;
+   struct ocelot *ocelot = ocelot_port->ocelot;
+   struct net_device *bridge_dev;
+   int port = priv->chip_port;
+
+   ocelot_port_lag_leave(ocelot, port, bond);
+
+   bridge_dev = netdev_master_upper_dev_get(bond);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   return ocelot_netdevice_bridge_leave(dev, bridge_dev);
+}
+
+static int ocelot_netdevice_changeupper(struct net_device *dev,
+   struct netdev_notifier_changeupper_info 
*info)
+{
+   struct netlink_ext_ack *extack;
int err = 0;
 
+   extack = netdev_notifier_info_to_extack(>info);
+
if (netif_is_bridge_master(info->upper_dev)) {
-   if (info->linking) {
-   err = ocelot_netdevice_bridge_join(ocelot, port,
-  info->upper_

[RFC PATCH v2 net-next 11/16] net: ocelot: support multiple bridges

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

The ocelot switches are a bit odd in that they do not have an STP state
to put the ports into. Instead, the forwarding configuration is delayed
from the typical port_bridge_join into stp_state_set, when the port enters
the BR_STATE_FORWARDING state.

I can only guess that the implementation of this quirk is the reason that
led to the simplification of the driver such that only one bridge could
be offloaded at a time.

We can simplify the data structures somewhat, and introduce a per-port
bridge device pointer and STP state, similar to how the LAG offload
works now (there we have a per-port bonding device pointer and TX
enabled state). This allows offloading multiple bridges with relative
ease, while still keeping in place the quirk to delay the programming of
the PGIDs.

We actually need this change now because we need to remove the bogus
restriction from ocelot_bridge_stp_state_set that ocelot->bridge_mask
needs to contain BIT(port), otherwise that function is a no-op.

Signed-off-by: Vladimir Oltean 
---
 drivers/net/ethernet/mscc/ocelot.c | 72 +++---
 include/soc/mscc/ocelot.h  |  7 ++-
 2 files changed, 39 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 9f0c9bdd9f5d..ce57929ba3d1 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -766,7 +766,7 @@ int ocelot_xtr_poll_frame(struct ocelot *ocelot, int grp, 
struct sk_buff **nskb)
/* Everything we see on an interface that is in the HW bridge
 * has already been forwarded.
 */
-   if (ocelot->bridge_mask & BIT(src_port))
+   if (ocelot->ports[src_port]->bridge)
skb->offload_fwd_mark = 1;
 
skb->protocol = eth_type_trans(skb, dev);
@@ -1183,6 +1183,26 @@ static u32 ocelot_get_bond_mask(struct ocelot *ocelot, 
struct net_device *bond,
return mask;
 }
 
+static u32 ocelot_get_bridge_fwd_mask(struct ocelot *ocelot,
+ struct net_device *bridge)
+{
+   u32 mask = 0;
+   int port;
+
+   for (port = 0; port < ocelot->num_phys_ports; port++) {
+   struct ocelot_port *ocelot_port = ocelot->ports[port];
+
+   if (!ocelot_port)
+   continue;
+
+   if (ocelot_port->stp_state == BR_STATE_FORWARDING &&
+   ocelot_port->bridge == bridge)
+   mask |= BIT(port);
+   }
+
+   return mask;
+}
+
 static u32 ocelot_get_dsa_8021q_cpu_mask(struct ocelot *ocelot)
 {
u32 mask = 0;
@@ -1232,10 +1252,12 @@ void ocelot_apply_bridge_fwd_mask(struct ocelot *ocelot)
 */
mask = GENMASK(ocelot->num_phys_ports - 1, 0);
mask &= ~cpu_fwd_mask;
-   } else if (ocelot->bridge_fwd_mask & BIT(port)) {
+   } else if (ocelot_port->bridge) {
+   struct net_device *bridge = ocelot_port->bridge;
struct net_device *bond = ocelot_port->bond;
 
-   mask = ocelot->bridge_fwd_mask & ~BIT(port);
+   mask = ocelot_get_bridge_fwd_mask(ocelot, bridge);
+   mask &= ~BIT(port);
if (bond) {
mask &= ~ocelot_get_bond_mask(ocelot, bond,
  false);
@@ -1256,29 +1278,16 @@ EXPORT_SYMBOL(ocelot_apply_bridge_fwd_mask);
 void ocelot_bridge_stp_state_set(struct ocelot *ocelot, int port, u8 state)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
-   u32 port_cfg;
-
-   if (!(BIT(port) & ocelot->bridge_mask))
-   return;
+   u32 learn_ena = 0;
 
-   port_cfg = ocelot_read_gix(ocelot, ANA_PORT_PORT_CFG, port);
+   ocelot_port->stp_state = state;
 
-   switch (state) {
-   case BR_STATE_FORWARDING:
-   ocelot->bridge_fwd_mask |= BIT(port);
-   fallthrough;
-   case BR_STATE_LEARNING:
-   if (ocelot_port->learn_ena)
-   port_cfg |= ANA_PORT_PORT_CFG_LEARN_ENA;
-   break;
-
-   default:
-   port_cfg &= ~ANA_PORT_PORT_CFG_LEARN_ENA;
-   ocelot->bridge_fwd_mask &= ~BIT(port);
-   break;
-   }
+   if ((state == BR_STATE_LEARNING || state == BR_STATE_FORWARDING) &&
+   ocelot_port->learn_ena)
+   learn_ena = ANA_PORT_PORT_CFG_LEARN_ENA;
 
-   ocelot_write_gix(ocelot, port_cfg, ANA_PORT_PORT_CFG, port);
+   ocelot_rmw_gix(ocelot, learn_ena, ANA_PORT_PORT_CFG_LEARN_ENA,
+  ANA_PORT_PORT_CFG, port);
 
ocelot_apply_bridge_fwd_mask(ocelot);
 }
@@ -1508,16 +1517,9 @@ EXPORT_SYMBOL(ocelot_port_m

[RFC PATCH v2 net-next 10/16] net: dsa: replay VLANs installed on port when joining the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

Currently this simple setup:

ip link add br0 type bridge vlan_filtering 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

will not work because the bridge has created the PVID in br_add_if ->
nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
but that was too early, since swp0 was not yet a lower of bond0, so it
had no reason to act upon that notification.

Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h | 10 ++
 net/bridge/br_vlan.c  | 71 +++
 net/dsa/port.c|  6 
 3 files changed, 87 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 89596134e88f..ea176c508c0d 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -111,6 +111,8 @@ int br_vlan_get_pvid_rcu(const struct net_device *dev, u16 
*p_pvid);
 int br_vlan_get_proto(const struct net_device *dev, u16 *p_proto);
 int br_vlan_get_info(const struct net_device *dev, u16 vid,
 struct bridge_vlan_info *p_vinfo);
+int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,
+  struct notifier_block *nb, struct netlink_ext_ack *extack);
 #else
 static inline bool br_vlan_enabled(const struct net_device *dev)
 {
@@ -137,6 +139,14 @@ static inline int br_vlan_get_info(const struct net_device 
*dev, u16 vid,
 {
return -EINVAL;
 }
+
+static inline int br_vlan_replay(struct net_device *br_dev,
+struct net_device *dev,
+struct notifier_block *nb,
+struct netlink_ext_ack *extack)
+{
+   return -EINVAL;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_BRIDGE)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 8829f621b8ec..45a4eac1b217 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -1751,6 +1751,77 @@ void br_vlan_notify(const struct net_bridge *br,
kfree_skb(skb);
 }
 
+static int br_vlan_replay_one(struct notifier_block *nb,
+ struct net_device *dev,
+ struct switchdev_obj_port_vlan *vlan,
+ struct netlink_ext_ack *extack)
+{
+   struct switchdev_notifier_port_obj_info obj_info = {
+   .info = {
+   .dev = dev,
+   .extack = extack,
+   },
+   .obj = >obj,
+   };
+   int err;
+
+   err = nb->notifier_call(nb, SWITCHDEV_PORT_OBJ_ADD, _info);
+   return notifier_to_errno(err);
+}
+
+int br_vlan_replay(struct net_device *br_dev, struct net_device *dev,
+  struct notifier_block *nb, struct netlink_ext_ack *extack)
+{
+   struct net_bridge_vlan_group *vg;
+   struct net_bridge_vlan *v;
+   struct net_bridge_port *p;
+   struct net_bridge *br;
+   int err = 0;
+   u16 pvid;
+
+   ASSERT_RTNL();
+
+   if (!netif_is_bridge_master(br_dev))
+   return -EINVAL;
+
+   if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev))
+   return -EINVAL;
+
+   if (netif_is_bridge_master(dev)) {
+   br = netdev_priv(dev);
+   vg = br_vlan_group(br);
+   p = NULL;
+   } else {
+   p = br_port_get_rtnl(dev);
+   if (WARN_ON(!p))
+   return -EINVAL;
+   vg = nbp_vlan_group(p);
+   br = p->br;
+   }
+
+   if (!vg)
+   return 0;
+
+   pvid = br_get_pvid(vg);
+
+   list_for_each_entry(v, >vlan_list, vlist) {
+   struct switchdev_obj_port_vlan vlan = {
+   .obj.orig_dev = dev,
+   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
+   .flags = br_vlan_flags(v, pvid),
+   .vid = v->vid,
+   };
+
+   if (!br_vlan_should_use(v))
+   continue;
+
+   br_vlan_replay_one(nb, dev, , extack);
+   if (err)
+   return err;
+   }
+
+   return err;
+}
 /* check if v_curr can enter a range ending in range_end */
 bool br_vlan_can_enter_range(const struct net_bridge_vlan *v_curr,
 const struct net_bridge_vlan *range_end)
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 9850051071f2..6c3c357ac409 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -209,6 +209,12 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   err = br_vlan_replay(br, brport_dev,
+_slave_switchdev_blocking_notifier,
+extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
-- 
2.25.1



[RFC PATCH v2 net-next 08/16] net: dsa: replay port and host-joined mdb entries when joining the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

I have udhcpcd in my system and this is configured to bring interfaces
up as soon as they are created.

I create a bridge as follows:

ip link add br0 type bridge

As soon as I create the bridge and udhcpcd brings it up, I have some
other crap (avahi) that starts sending some random IPv6 packets to
advertise some local services, and from there, the br0 bridge joins the
following IPv6 groups:

33:33:ff:6d:c1:9c vid 0
33:33:00:00:00:6a vid 0
33:33:00:00:00:fb vid 0

br_dev_xmit
-> br_multicast_rcv
   -> br_ip6_multicast_add_group
  -> __br_multicast_add_group
 -> br_multicast_host_join
-> br_mdb_notify

This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
hooked up, and switchdev will attempt to offload the host joined groups
to an empty list of ports. Of course nobody offloads them.

Then when we add a port to br0:

ip link set swp0 master br0

the bridge doesn't replay the host-joined MDB entries from br_add_if,
and eventually the host joined addresses expire, and a switchdev
notification for deleting it is emitted, but surprise, the original
addition was already completely missed.

The strategy to address this problem is to replay the MDB entries (both
the port ones and the host joined ones) when the new port joins the
bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
be populated and only then attached to a bridge that you offload).
However there are 2 possibilities: the addresses can be 'pushed' by the
bridge into the port, or the port can 'pull' them from the bridge.

Considering that in the general case, the new port can be really late to
the party, and there may have been many other switchdev ports that
already received the initial notification, we would like to avoid
delivering duplicate events to them, since they might misbehave. And
currently, the bridge calls the entire switchdev notifier chain, whereas
for replaying it should just call the notifier block of the new guy.
But the bridge doesn't know what is the new guy's notifier block, it
just knows where the switchdev notifier chain is. So for simplification,
we make this a driver-initiated pull for now, and the notifier block is
passed as an argument.

To emulate the calling context for mdb objects (deferred and put on the
blocking notifier chain), we must iterate under RCU protection through
the bridge's mdb entries, queue them, and only call them once we're out
of the RCU read-side critical section.

Suggested-by: Ido Schimmel 
Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h |  9 +
 net/bridge/br_mdb.c   | 84 +++
 net/dsa/dsa_priv.h|  2 +
 net/dsa/port.c|  6 +++
 net/dsa/slave.c   |  2 +-
 5 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index ebd16495459c..4c25dafb013d 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device 
*dev, int proto);
 bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto);
 bool br_multicast_enabled(const struct net_device *dev);
 bool br_multicast_router(const struct net_device *dev);
+int br_mdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb, struct netlink_ext_ack *extack);
 #else
 static inline int br_multicast_list_adjacent(struct net_device *dev,
 struct list_head *br_ip_list)
@@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct 
net_device *dev)
 {
return false;
 }
+static inline int br_mdb_replay(struct net_device *br_dev,
+   struct net_device *dev,
+   struct notifier_block *nb,
+   struct netlink_ext_ack *extack)
+{
+   return -EINVAL;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 8846c5bcd075..23973186094c 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -506,6 +506,90 @@ static void br_mdb_complete(struct net_device *dev, int 
err, void *priv)
kfree(priv);
 }
 
+static int br_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,
+struct net_bridge_mdb_entry *mp, int obj_id,
+struct net_device *orig_dev,
+struct netlink_ext_ack *extack)
+{
+   struct switchdev_notifier_port_obj_info obj_info = {
+   .info = {
+   .dev = dev,
+   .extack = extack,
+   },
+   };
+   struct switchdev_obj_port_mdb mdb = {
+   .obj = {
+   .orig_dev = orig_dev,
+   .id = obj_id,
+   },
+ 

[RFC PATCH v2 net-next 09/16] net: dsa: replay port and local fdb entries when joining the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

When a DSA port joins a LAG that already had an FDB entry pointing to it:

ip link set bond0 master br0
bridge fdb add dev bond0 00:01:02:03:04:05 master static
ip link set swp0 master bond0

the DSA port will have no idea that this FDB entry is there, because it
missed the switchdev event emitted at its creation.

Ido Schimmel pointed this out during a discussion about challenges with
switchdev offloading of stacked interfaces between the physical port and
the bridge, and recommended to just catch that condition and deny the
CHANGEUPPER event:
https://lore.kernel.org/netdev/20210210105949.gb287...@shredder.lan/

But in fact, we might need to deal with the hard thing anyway, which is
to replay all FDB addresses relevant to this port, because it isn't just
static FDB entries, but also local addresses (ones that are not
forwarded but terminated by the bridge). There, we can't just say 'oh
yeah, there was an upper already so I'm not joining that'.

So, similar to the logic for replaying MDB entries, add a function that
must be called by individual switchdev drivers and replays local FDB
entries as well as ones pointing towards a bridge port. This time, we
use the atomic switchdev notifier block, since that's what FDB entries
expect for some reason.

Reported-by: Ido Schimmel 
Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h |  9 +++
 include/net/switchdev.h   |  1 +
 net/bridge/br_fdb.c   | 52 +++
 net/dsa/dsa_priv.h|  1 +
 net/dsa/port.c|  4 +++
 net/dsa/slave.c   |  2 +-
 6 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 4c25dafb013d..89596134e88f 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -147,6 +147,8 @@ void br_fdb_clear_offload(const struct net_device *dev, u16 
vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
 u8 br_port_get_stp_state(const struct net_device *dev);
 clock_t br_get_ageing_time(struct net_device *br_dev);
+int br_fdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -175,6 +177,13 @@ static inline clock_t br_get_ageing_time(struct net_device 
*br_dev)
 {
return 0;
 }
+
+static inline int br_fdb_replay(struct net_device *br_dev,
+   struct net_device *dev,
+   struct notifier_block *nb)
+{
+   return -EINVAL;
+}
 #endif
 
 #endif
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index b7fc7d0f54e2..7688ec572757 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -205,6 +205,7 @@ struct switchdev_notifier_info {
 
 struct switchdev_notifier_fdb_info {
struct switchdev_notifier_info info; /* must be first */
+   struct list_head list;
const unsigned char *addr;
u16 vid;
u8 added_by_user:1,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index b7490237f3fc..49125cc196ac 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -726,6 +726,58 @@ static inline size_t fdb_nlmsg_size(void)
+ nla_total_size(sizeof(u8)); /* NFEA_ACTIVITY_NOTIFY */
 }
 
+static int br_fdb_replay_one(struct notifier_block *nb,
+struct net_bridge_fdb_entry *fdb,
+struct net_device *dev)
+{
+   struct switchdev_notifier_fdb_info item;
+   int err;
+
+   item.addr = fdb->key.addr.addr;
+   item.vid = fdb->key.vlan_id;
+   item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, >flags);
+   item.offloaded = test_bit(BR_FDB_OFFLOADED, >flags);
+   item.info.dev = dev;
+
+   err = nb->notifier_call(nb, SWITCHDEV_FDB_ADD_TO_DEVICE, );
+   return notifier_to_errno(err);
+}
+
+int br_fdb_replay(struct net_device *br_dev, struct net_device *dev,
+ struct notifier_block *nb)
+{
+   struct net_bridge_fdb_entry *fdb;
+   struct net_bridge *br;
+   int err = 0;
+
+   if (!netif_is_bridge_master(br_dev))
+   return -EINVAL;
+
+   if (!netif_is_bridge_port(dev))
+   return -EINVAL;
+
+   br = netdev_priv(br_dev);
+
+   rcu_read_lock();
+
+   hlist_for_each_entry_rcu(fdb, >fdb_list, fdb_node) {
+   struct net_device *dst_dev;
+
+   dst_dev = fdb->dst ? fdb->dst->dev : br->dev;
+   if (dst_dev != br_dev && dst_dev != dev)
+   continue;
+
+   err = br_fdb_replay_one(nb, fdb, dst_dev);
+   if (err)
+   break;
+   }
+
+   rcu_read_unlock();
+
+   return err;
+}
+EXPORT_SYMBOL(br_fdb_replay);
+
 static void fdb_notify(struct net_bridge *br,
   const stru

[RFC PATCH v2 net-next 07/16] net: dsa: sync ageing time when joining the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:

sysfs/ioctl/netlink
-> br_set_ageing_time
   -> __set_ageing_time

therefore not at bridge port creation time, so:
(a) drivers had to hardcode the initial value for the address ageing time,
because they didn't get any notification
(b) that hardcoded value can be out of sync, if the user changes the
ageing time before enslaving the port to the bridge

Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h |  6 ++
 net/bridge/br_stp.c   | 13 +
 net/dsa/port.c| 10 ++
 3 files changed, 29 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 920d3a02cc68..ebd16495459c 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -137,6 +137,7 @@ struct net_device *br_fdb_find_port(const struct net_device 
*br_dev,
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
 u8 br_port_get_stp_state(const struct net_device *dev);
+clock_t br_get_ageing_time(struct net_device *br_dev);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -160,6 +161,11 @@ static inline u8 br_port_get_stp_state(const struct 
net_device *dev)
 {
return BR_STATE_DISABLED;
 }
+
+static inline clock_t br_get_ageing_time(struct net_device *br_dev)
+{
+   return 0;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 86b5e05d3f21..3dafb6143cff 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -639,6 +639,19 @@ int br_set_ageing_time(struct net_bridge *br, clock_t 
ageing_time)
return 0;
 }
 
+clock_t br_get_ageing_time(struct net_device *br_dev)
+{
+   struct net_bridge *br;
+
+   if (!netif_is_bridge_master(br_dev))
+   return 0;
+
+   br = netdev_priv(br_dev);
+
+   return jiffies_to_clock_t(br->ageing_time);
+}
+EXPORT_SYMBOL_GPL(br_get_ageing_time);
+
 /* called under bridge lock */
 void __br_set_topology_change(struct net_bridge *br, unsigned char val)
 {
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 8380509ee47c..9fde2371e1bc 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -173,6 +173,7 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
 {
struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
struct net_device *br = dp->bridge_dev;
+   clock_t ageing_time;
u8 stp_state;
int err;
 
@@ -193,6 +194,11 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   ageing_time = br_get_ageing_time(br);
+   err = dsa_port_ageing_time(dp, ageing_time);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -222,6 +228,10 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
 * allow this in standalone mode too.
 */
dsa_port_mrouter(dp->cpu_dp, true, NULL);
+
+   /* Ageing time may be global to the switch chip, so don't change it
+* here because we have no good reason (or value) to change it to.
+*/
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
-- 
2.25.1



[RFC PATCH v2 net-next 06/16] net: dsa: sync multicast router state when joining the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

Make sure that the multicast router setting of the bridge is picked up
correctly by DSA when joining, regardless of whether there are
sandwiched interfaces or not. The SWITCHDEV_ATTR_ID_BRIDGE_MROUTER port
attribute is only emitted from br_mc_router_state_change.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/port.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index ac1afe182c3b..8380509ee47c 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -189,6 +189,10 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   err = dsa_port_mrouter(dp->cpu_dp, br_multicast_router(br), extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -212,6 +216,12 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
 
/* VLAN filtering is handled by dsa_switch_bridge_leave */
+
+   /* Some drivers treat the notification for having a local multicast
+* router by allowing multicast to be flooded to the CPU, so we should
+* allow this in standalone mode too.
+*/
+   dsa_port_mrouter(dp->cpu_dp, true, NULL);
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
-- 
2.25.1



[RFC PATCH v2 net-next 05/16] net: dsa: sync up VLAN filtering state when joining the bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

This is the same situation as for other switchdev port attributes: if we
join an already-created bridge port, such as a bond master interface,
then we can miss the initial switchdev notification emitted by the
bridge for this port.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/port.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 785374744462..ac1afe182c3b 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -172,6 +172,7 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
   struct netlink_ext_ack *extack)
 {
struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   struct net_device *br = dp->bridge_dev;
u8 stp_state;
int err;
 
@@ -184,6 +185,10 @@ static int dsa_port_switchdev_sync(struct dsa_port *dp,
if (err && err != -EOPNOTSUPP)
return err;
 
+   err = dsa_port_vlan_filtering(dp, br, extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
@@ -205,6 +210,8 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp)
 * so allow it to be in BR_STATE_FORWARDING to be kept functional
 */
dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
+
+   /* VLAN filtering is handled by dsa_switch_bridge_leave */
 }
 
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
-- 
2.25.1



[RFC PATCH v2 net-next 04/16] net: dsa: sync up with bridge port's STP state when joining

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

It may happen that we have the following topology:

ip link add br0 type bridge stp_state 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
ip link set swp1 master bond0

STP decides that it should put bond0 into the BLOCKING state, and
that's that. The ports that are actively listening for the switchdev
port attributes emitted for the bond0 bridge port (because they are
offloading it) and have the honor of seeing that switchdev port
attribute can react to it, so we can program swp0 and swp1 into the
BLOCKING state.

But if then we do:

ip link set swp2 master bond0

then as far as the bridge is concerned, nothing has changed: it still
has one bridge port. But this new bridge port will not see any STP state
change notification and will remain FORWARDING, which is how the
standalone code leaves it in.

Add a function to the bridge which retrieves the current STP state, such
that drivers can synchronize to it when they may have missed switchdev
events.

Signed-off-by: Vladimir Oltean 
---
 include/linux/if_bridge.h |  6 ++
 net/bridge/br_stp.c   | 14 ++
 net/dsa/port.c|  7 +++
 3 files changed, 27 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b979005ea39c..920d3a02cc68 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -136,6 +136,7 @@ struct net_device *br_fdb_find_port(const struct net_device 
*br_dev,
__u16 vid);
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
+u8 br_port_get_stp_state(const struct net_device *dev);
 #else
 static inline struct net_device *
 br_fdb_find_port(const struct net_device *br_dev,
@@ -154,6 +155,11 @@ br_port_flag_is_set(const struct net_device *dev, unsigned 
long flag)
 {
return false;
 }
+
+static inline u8 br_port_get_stp_state(const struct net_device *dev)
+{
+   return BR_STATE_DISABLED;
+}
 #endif
 
 #endif
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 21c6781906aa..86b5e05d3f21 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -64,6 +64,20 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
state)
}
 }
 
+u8 br_port_get_stp_state(const struct net_device *dev)
+{
+   struct net_bridge_port *p;
+
+   ASSERT_RTNL();
+
+   p = br_port_get_rtnl(dev);
+   if (!p)
+   return BR_STATE_DISABLED;
+
+   return p->state;
+}
+EXPORT_SYMBOL_GPL(br_port_get_stp_state);
+
 /* called under bridge lock */
 struct net_bridge_port *br_get_port(struct net_bridge *br, u16 port_no)
 {
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 346c50467810..785374744462 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -171,12 +171,19 @@ static int dsa_port_inherit_brport_flags(struct dsa_port 
*dp,
 static int dsa_port_switchdev_sync(struct dsa_port *dp,
   struct netlink_ext_ack *extack)
 {
+   struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   u8 stp_state;
int err;
 
err = dsa_port_inherit_brport_flags(dp, extack);
if (err)
return err;
 
+   stp_state = br_port_get_stp_state(brport_dev);
+   err = dsa_port_set_state(dp, stp_state);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
return 0;
 }
 
-- 
2.25.1



[RFC PATCH v2 net-next 03/16] net: dsa: inherit the actual bridge port flags at join time

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA currently assumes that the bridge port starts off with this
constellation of bridge port flags:

- learning on
- unicast flooding on
- multicast flooding on
- broadcast flooding on

just by virtue of code copy-pasta from the bridge layer (new_nbp).
This was a simple enough strategy thus far, because the 'bridge join'
moment always coincided with the 'bridge port creation' moment.

But with sandwiched interfaces, such as:

 br0
  |
bond0
  |
 swp0

it may happen that the user has had time to change the bridge port flags
of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
assume that the bridge port flags are those determined by new_nbp, when
in fact this can happen:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set bond0 type bridge_slave learning off
ip link set swp0 master br0

Now swp0 has learning enabled, bond0 has learning disabled. Not nice.

Fix this by "dumpster diving" through the actual bridge port flags with
br_port_flag_is_set, at bridge join time.

We use this opportunity to split dsa_port_change_brport_flags into two
distinct functions called dsa_port_inherit_brport_flags and
dsa_port_clear_brport_flags, now that the implementation for the two
cases is no longer similar.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/port.c | 123 -
 1 file changed, 82 insertions(+), 41 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index fcbe5b1545b8..346c50467810 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -122,26 +122,82 @@ void dsa_port_disable(struct dsa_port *dp)
rtnl_unlock();
 }
 
-static void dsa_port_change_brport_flags(struct dsa_port *dp,
-bool bridge_offload)
+static void dsa_port_clear_brport_flags(struct dsa_port *dp,
+   struct netlink_ext_ack *extack)
 {
struct switchdev_brport_flags flags;
-   int flag;
 
-   flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
-   if (bridge_offload)
-   flags.val = flags.mask;
-   else
-   flags.val = flags.mask & ~BR_LEARNING;
+   flags.mask = BR_LEARNING;
+   flags.val = 0;
+   dsa_port_bridge_flags(dp, flags, extack);
+
+   flags.mask = BR_FLOOD;
+   flags.val = BR_FLOOD;
+   dsa_port_bridge_flags(dp, flags, extack);
+
+   flags.mask = BR_MCAST_FLOOD;
+   flags.val = BR_MCAST_FLOOD;
+   dsa_port_bridge_flags(dp, flags, extack);
+
+   flags.mask = BR_BCAST_FLOOD;
+   flags.val = BR_BCAST_FLOOD;
+   dsa_port_bridge_flags(dp, flags, extack);
+}
+
+static int dsa_port_inherit_brport_flags(struct dsa_port *dp,
+struct netlink_ext_ack *extack)
+{
+   const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD |
+  BR_BCAST_FLOOD;
+   struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
+   int flag, err;
+
+   for_each_set_bit(flag, , 32) {
+   struct switchdev_brport_flags flags = {0};
 
-   for_each_set_bit(flag, , 32) {
-   struct switchdev_brport_flags tmp;
+   flags.mask = BIT(flag);
 
-   tmp.val = flags.val & BIT(flag);
-   tmp.mask = BIT(flag);
+   if (br_port_flag_is_set(brport_dev, BIT(flag)))
+   flags.val = BIT(flag);
 
-   dsa_port_bridge_flags(dp, tmp, NULL);
+   err = dsa_port_bridge_flags(dp, flags, extack);
+   if (err && err != -EOPNOTSUPP)
+   return err;
}
+
+   return 0;
+}
+
+static int dsa_port_switchdev_sync(struct dsa_port *dp,
+  struct netlink_ext_ack *extack)
+{
+   int err;
+
+   err = dsa_port_inherit_brport_flags(dp, extack);
+   if (err)
+   return err;
+
+   return 0;
+}
+
+/* Configure the port for standalone mode (no address learning, flood
+ * everything, BR_STATE_FORWARDING, etc).
+ * The bridge only emits SWITCHDEV_ATTR_ID_PORT_* events when the user
+ * requests it through netlink or sysfs, but not automatically at port
+ * join or leave, so we need to handle resetting the brport flags ourselves.
+ * But we even prefer it that way, because otherwise, some setups might never
+ * get the notification they need, for example, when a port leaves a LAG that
+ * offloads the bridge, it becomes standalone, but as far as the bridge is
+ * concerned, no port ever left.
+ */
+static void dsa_port_switchdev_unsync(struct dsa_port *dp)
+{
+   dsa_port_clear_brport_flags(dp, NULL);
+
+   /* Port left the bridge, put in BR_STATE_DISABLED by the bridge layer,
+* so allow it to be in BR_STATE_FORWARDING to be kept functional
+*/
+   dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
 }
 
 int dsa_port_bridg

[RFC PATCH v2 net-next 02/16] net: dsa: pass extack to dsa_port_{bridge,lag}_join

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

This is a pretty noisy change that was broken out of the larger change
for replaying switchdev attributes and objects at bridge join time,
which is when these extack objects are actually used.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/dsa_priv.h | 6 --
 net/dsa/port.c | 8 +---
 net/dsa/slave.c| 7 +--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 4c43c5406834..b8778c5d8529 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -181,12 +181,14 @@ int dsa_port_enable_rt(struct dsa_port *dp, struct 
phy_device *phy);
 int dsa_port_enable(struct dsa_port *dp, struct phy_device *phy);
 void dsa_port_disable_rt(struct dsa_port *dp);
 void dsa_port_disable(struct dsa_port *dp);
-int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br);
+int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
+struct netlink_ext_ack *extack);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
 int dsa_port_lag_change(struct dsa_port *dp,
struct netdev_lag_lower_state_info *linfo);
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev,
- struct netdev_lag_upper_info *uinfo);
+ struct netdev_lag_upper_info *uinfo,
+ struct netlink_ext_ack *extack);
 void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev);
 int dsa_port_vlan_filtering(struct dsa_port *dp, bool vlan_filtering,
struct netlink_ext_ack *extack);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index d39262a9fe0e..fcbe5b1545b8 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -144,7 +144,8 @@ static void dsa_port_change_brport_flags(struct dsa_port 
*dp,
}
 }
 
-int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br)
+int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
+struct netlink_ext_ack *extack)
 {
struct dsa_notifier_bridge_info info = {
.tree_index = dp->ds->dst->index,
@@ -241,7 +242,8 @@ int dsa_port_lag_change(struct dsa_port *dp,
 }
 
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag,
- struct netdev_lag_upper_info *uinfo)
+ struct netdev_lag_upper_info *uinfo,
+ struct netlink_ext_ack *extack)
 {
struct dsa_notifier_lag_info info = {
.sw_index = dp->ds->index,
@@ -263,7 +265,7 @@ int dsa_port_lag_join(struct dsa_port *dp, struct 
net_device *lag,
if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
return 0;
 
-   err = dsa_port_bridge_join(dp, bridge_dev);
+   err = dsa_port_bridge_join(dp, bridge_dev, extack);
if (err)
goto err_bridge_join;
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 992fcab4b552..1ff48be476bb 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1976,11 +1976,14 @@ static int dsa_slave_changeupper(struct net_device *dev,
 struct netdev_notifier_changeupper_info *info)
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
+   struct netlink_ext_ack *extack;
int err = NOTIFY_DONE;
 
+   extack = netdev_notifier_info_to_extack(>info);
+
if (netif_is_bridge_master(info->upper_dev)) {
if (info->linking) {
-   err = dsa_port_bridge_join(dp, info->upper_dev);
+   err = dsa_port_bridge_join(dp, info->upper_dev, extack);
if (!err)
dsa_bridge_mtu_normalization(dp);
err = notifier_from_errno(err);
@@ -1991,7 +1994,7 @@ static int dsa_slave_changeupper(struct net_device *dev,
} else if (netif_is_lag_master(info->upper_dev)) {
if (info->linking) {
err = dsa_port_lag_join(dp, info->upper_dev,
-   info->upper_info);
+   info->upper_info, extack);
if (err == -EOPNOTSUPP) {
NL_SET_ERR_MSG_MOD(info->info.extack,
   "Offloading not supported");
-- 
2.25.1



[RFC PATCH v2 net-next 01/16] net: dsa: call dsa_port_bridge_join when joining a LAG that is already in a bridge

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

DSA can properly detect and offload this sequence of operations:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set swp0 master bond0
ip link set bond0 master br0

But not this one:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

Actually the second one is more complicated, due to the elapsed time
between the enslavement of bond0 and the offloading of it via swp0, a
lot of things could have happened to the bond0 bridge port in terms of
switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
a bit of a can of worms, and making sure that the DSA port's state is in
sync with this already existing bridge port is handled in the next
patches.

Signed-off-by: Vladimir Oltean 
---
 net/dsa/port.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index c9c6d7ab3f47..d39262a9fe0e 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -249,17 +249,31 @@ int dsa_port_lag_join(struct dsa_port *dp, struct 
net_device *lag,
.lag = lag,
.info = uinfo,
};
+   struct net_device *bridge_dev;
int err;
 
dsa_lag_map(dp->ds->dst, lag);
dp->lag_dev = lag;
 
err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_JOIN, );
-   if (err) {
-   dp->lag_dev = NULL;
-   dsa_lag_unmap(dp->ds->dst, lag);
-   }
+   if (err)
+   goto err_lag_join;
 
+   bridge_dev = netdev_master_upper_dev_get(lag);
+   if (!bridge_dev || !netif_is_bridge_master(bridge_dev))
+   return 0;
+
+   err = dsa_port_bridge_join(dp, bridge_dev);
+   if (err)
+   goto err_bridge_join;
+
+   return 0;
+
+err_bridge_join:
+   dsa_port_notify(dp, DSA_NOTIFIER_LAG_LEAVE, );
+err_lag_join:
+   dp->lag_dev = NULL;
+   dsa_lag_unmap(dp->ds->dst, lag);
return err;
 }
 
-- 
2.25.1



[RFC PATCH v2 net-next 00/16] Better support for sandwiched LAGs with bridge and DSA

2021-03-18 Thread Vladimir Oltean
From: Vladimir Oltean 

This series has two objectives:
- To make LAG uppers on top of DSA ports work regardless of which order
  we link interfaces to their masters (first make the port join the LAG,
  then the LAG join the bridge, or the other way around).
- To make DSA ports support non-offloaded LAG interfaces properly.

There was a design decision to be made in patches 2-4 on whether we
should adopt the "push" model, where the driver just calls:

  switchdev_bridge_port_offloaded(brport_dev,
  _notifier_block,
  _notifier_block,
  extack);

and the bridge just replays the entire collection of switchdev port
attributes and objects that it has, in some predefined order and with
some predefined error handling logic;


or the "pull" model, where the driver, apart from calling:

  switchdev_bridge_port_offloaded(brport_dev, extack);

has the task of "dumpster diving" (as Tobias puts it) through the bridge
attributes and objects by itself, by calling:

  - br_vlan_replay
  - br_fdb_replay
  - br_mdb_replay
  - br_vlan_enabled
  - br_port_flag_is_set
  - br_port_get_stp_state
  - br_multicast_router
  - br_get_ageing_time

(not necessarily all of them, and not necessarily in this order, and
with driver-defined error handling).

Even though I'm not in love myself with the "pull" model, I chose it
because there is a fundamental trick with replaying switchdev events
like this:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0 <- this will replay the objects once for
 the bond0 bridge port, and the swp0
 switchdev port will process them
ip link set swp1 master bond0 <- this will replay the objects again for
 the bond0 bridge port, and the swp1
 switchdev port will see them, but swp0
 will see them for the second time now

Basically I believe that it is implementation defined whether the driver
wants to error out on switchdev objects seen twice on a port, and the
bridge should not enforce a certain model for that. For example, for FDB
entries added to a bonding interface, the underling switchdev driver
might have an abstraction for just that: an FDB entry pointing towards a
logical (as opposed to physical) port. So when the second port joins the
bridge, it doesn't realy need to replay FDB entries, since there is
already at least one hardware port which has been receiving those
events, and the FDB entries don't need to be added a second time to the
same logical port.
In the other corner, we have the drivers that handle switchdev port
attributes on a LAG as individual switchdev port attributes on physical
ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set
helper facilitates this: it is a fan-out from a single orig_dev towards
multiple lowers that pass the check_cb().
But that's the point: switchdev_handle_port_attr_set is just a helper
which the driver _opts_ to use. The bridge can't enforce the "push"
model, because that would assume that all drivers handle port attributes
in the same way, which is probably false.

For this reason, I preferred to go with the "pull" mode for this patch
set. Just to see how bad it is for other switchdev drivers to copy-paste
this logic, I added the pull support to ocelot too, and I think it's
pretty manageable.

This patch set is RFC because it is minimally tested, and I would like
to get some feedback/agreement regarding the design decisions taken,
before I spend any more time on this.

There are also some things I probably broke, but I couldn't figure any
better. For example, I can't seem to figure out if mlxsw does the right
thing when joining a bonding interface that is already a bridge port.
I think it probably doesn't, so in that case, the placement I found for
the switchdev_bridge_port_offload() probably needs some adjustment when
there exists a LAG upper.

If possible, I would like the maintainers of the switchdev drivers to
tell me if this change introduces any regressions to how packets are
flooded (actually not flooded) in software by the bridge between two
ports belonging to the same ASIC ID.

I should mention that this patch series is written on top of Tobias'
series:
https://patchwork.kernel.org/project/netdevbpf/cover/20210318192540.895062-1-tob...@waldekranz.com/
which should get applied soon.

Vladimir Oltean (16):
  net: dsa: call dsa_port_bridge_join when joining a LAG that is already
in a bridge
  net: dsa: pass extack to dsa_port_{bridge,lag}_join
  net: dsa: inherit the actual bridge port flags at join time
  net: dsa: sync up with bridge port's STP state when joining
  net: dsa: sync up VLAN filtering state when joining the bridge
  net: dsa: sy

Re: [PATCH net-next] net: phy: at803x: remove at803x_aneg_done()

2021-03-18 Thread Vladimir Oltean
On Thu, Mar 18, 2021 at 05:38:13PM +0100, Michael Walle wrote:
> Am 2021-03-18 17:21, schrieb Heiner Kallweit:
> > On 18.03.2021 16:17, Vladimir Oltean wrote:
> > > On Thu, Mar 18, 2021 at 03:54:00PM +0100, Heiner Kallweit wrote:
> > > > On 18.03.2021 15:23, Michael Walle wrote:
> > > > > at803x_aneg_done() is pretty much dead code since the patch series
> > > > > "net: phy: improve and simplify phylib state machine" [1].
> > > > > Remove it.
> > > > >
> > > >
> > > > Well, it's not dead, it's resting .. There are few places where
> > > > phy_aneg_done() is used. So you would need to explain:
> > > > - why these users can't be used with this PHY driver
> > > > - or why the aneg_done callback isn't needed here and the
> > > >   genphy_aneg_done() fallback is sufficient
> > >
> > > The piece of code that Michael is removing keeps the aneg reporting as
> > > "not done" even when the copper-side link was reported as up, but the
> > > in-band autoneg has not finished.
> > >
> > > That was the _intended_ behavior when that code was introduced, and
> > > you
> > > have said about it:
> > > https://www.spinics.net/lists/stable/msg389193.html
> > >
> > > | That's not nice from the PHY:
> > > | It signals "link up", and if the system asks the PHY for link details,
> > > | then it sheepishly says "well, link is *almost* up".
> > >
> > > If the specification of phy_aneg_done behavior does not include
> > > in-band
> > > autoneg (and it doesn't), then this piece of code does not belong
> > > here.
> > >
> > > The fact that we can no longer trigger this code from phylib is yet
> > > another reason why it fails at its intended (and wrong) purpose and
> > > should be removed.
> > >
> > I don't argue against the change, I just think that the current commit
> > description isn't sufficient. What you just said I would have expected
> > in the commit description.
>
> I'll come up with a better one, Vladimir, may I use parts of the text
> above?

My words aren't copyrighted, so feel free, however you might want to
check with Heiner too for his part, you never know.


Re: [PATCH net-next] net: phy: at803x: remove at803x_aneg_done()

2021-03-18 Thread Vladimir Oltean
On Thu, Mar 18, 2021 at 03:54:00PM +0100, Heiner Kallweit wrote:
> On 18.03.2021 15:23, Michael Walle wrote:
> > at803x_aneg_done() is pretty much dead code since the patch series
> > "net: phy: improve and simplify phylib state machine" [1]. Remove it.
> > 
> 
> Well, it's not dead, it's resting .. There are few places where
> phy_aneg_done() is used. So you would need to explain:
> - why these users can't be used with this PHY driver
> - or why the aneg_done callback isn't needed here and the
>   genphy_aneg_done() fallback is sufficient

The piece of code that Michael is removing keeps the aneg reporting as
"not done" even when the copper-side link was reported as up, but the
in-band autoneg has not finished.

That was the _intended_ behavior when that code was introduced, and you
have said about it:
https://www.spinics.net/lists/stable/msg389193.html

| That's not nice from the PHY:
| It signals "link up", and if the system asks the PHY for link details,
| then it sheepishly says "well, link is *almost* up".

If the specification of phy_aneg_done behavior does not include in-band
autoneg (and it doesn't), then this piece of code does not belong here.

The fact that we can no longer trigger this code from phylib is yet
another reason why it fails at its intended (and wrong) purpose and
should be removed.


Re: [PATCH stable 0/6] net: dsa: b53: Correct learning for standalone ports

2021-03-17 Thread Vladimir Oltean
On Tue, Mar 16, 2021 at 05:35:43PM -0700, Florian Fainelli wrote:
> Hi Greg, Sasha, Jaakub and David,
> 
> This patch series contains backports for a change that recently made it
> upstream as f9b3827ee66cfcf297d0acd6ecf33653a5f297ef ("net: dsa: b53:
> Support setting learning on port") however that commit depends on
> infrastructure that landed in v5.12-rc1.
> 
> The way this was fixed in the netdev group's net tree is slightly
> different from how it should be backported to stable trees which is why
> you will find a patch for each branch in the thread started by this
> cover letter. The commit used as a Fixes: base dates back from when the
> driver was first introduced into the tree since this should have been
> fixed from day one ideally.
> 
> Let me know if this does not apply for some reason. The changes from 4.9
> through 4.19 are nearly identical and then from 5.4 through 5.11 are
> about the same.
> 
> Thank you very much!

Florian, same comment I just sent to Tobias applies to you as well:
could you please call b53_br_fast_age when disabling address learning?


Re: [PATCH net-next v2 3/3] net: ocelot: Remove ocelot_xfh_get_cpuq

2021-03-16 Thread Vladimir Oltean
On Tue, Mar 16, 2021 at 09:10:19PM +0100, Horatiu Vultur wrote:
> Now when extracting frames from CPU the cpuq is not used anymore so
> remove it.
> 
> Signed-off-by: Horatiu Vultur 
> ---

OCELOT_MRP_CPUQ should have disappeared too. Doesn't matter too much
though.

Reviewed-by: Vladimir Oltean 


  1   2   3   4   5   6   7   8   9   10   >