Re: merge vlan and carp input back into ether_input

2018-01-14 Thread Florian Riehm

On 01/11/18 14:51, Martin Pieuchot wrote:

On 11/01/18(Thu) 21:59, David Gwynne wrote:

[...]
when you say i break carp balancing, are you talking about the removal of the 
PACKET_TAG_CARP_BAL_IP tagging? PACKET_TAG_CARP_BAL_IP is only used in 
carp_lsdrop to clear the M_MCAST flag on the mbuf. M_MCAST wont be set on 
packets destined for the carp interface because we check ac_enaddr before 
checking if the packet is multicast or broadcast.


I might be mistaken.  I just know that this code is fragile and I'd
prefer to see such change tested isolated because carp(4) itself has
multiple configurations.  CARP balancing has been fixed for 6.2 by
friehm@ after being broken for multiple releases.


Hi David,

let me know if you all you carp changes are commited, then
I will test if carp balancing is still working.
I tested this commit already and it seems to be fine.

Thanks & Regards,

friehm



Re: merge vlan and carp input back into ether_input

2018-01-11 Thread David Gwynne
On Thu, Jan 11, 2018 at 02:51:40PM +0100, Martin Pieuchot wrote:
> On 11/01/18(Thu) 21:59, David Gwynne wrote:
> > [...] 
> > when you say i break carp balancing, are you talking about the removal of 
> > the PACKET_TAG_CARP_BAL_IP tagging? PACKET_TAG_CARP_BAL_IP is only used in 
> > carp_lsdrop to clear the M_MCAST flag on the mbuf. M_MCAST wont be set on 
> > packets destined for the carp interface because we check ac_enaddr before 
> > checking if the packet is multicast or broadcast.
> 
> I might be mistaken.  I just know that this code is fragile and I'd
> prefer to see such change tested isolated because carp(4) itself has
> multiple configurations.  CARP balancing has been fixed for 6.2 by
> friehm@ after being broken for multiple releases.

ok.

i had a look the commit related to this at
https://github.com/openbsd/src/commit/76dda2b0279f3c37adf1c059c3bab4d74bc96602

that change happened because ethernet_input checks and sets the
multicast bits before doing the ac_enaddr comparison. that's reversed
in the big diff i sent, but i pulled it out and attached it below.
it doesn't take away the mbuf tag carp uses, but it would make it
unnecessary.

> > are you sure? carp and vlan on top of trunk should still function. however, 
> > trunk or bridge/switch on vlan is broken though :(
> 
> I'm not sure, but I believe mixing input handler and hardcoding some in
> ether_input() won't fly :)

agreed.

> 
> > > If if_input() is called on a pseudo-interface we know we're already in
> > > a softnet process.
> > 
> > or a syscall.
> > 
> > the code above has pseudo interfaces recurse, where they'd loop either on 
> > ifih or at the task level. probably not a huge concern though.
> 
> We can unroll the loop afterward.  What we need now is get rid of the
> queues.

ok.

> 
> > > We could also think of doing something similar for if_enqueue() and call
> > > if_start() directly for pseudo-interface.
> > 
> > ill think about that. ive had some other experiments in that area we could 
> > look at too.
> 
> I'd be glad to look at your experiments :)

but im shy :(

> > >> note that trunk and bridge/switch are still implemented using
> > >> interface input handlers at the moment.
> > > 
> > > If you want to get rid of the input handlers, I'd suggest doing it in
> > > the beginning of a release cycle and for all pseudo drivers at once.
> > 
> > considering the trunk and bridge/switch issue, it probably is best to do 
> > them all at once.
> 
> Yes.  I'm aware that the SRP/input handler loop might be considered.
> We can probably gain some percents if we remove it.  However this is
> a micro optimization compared to other improvements that can be done.
> Plus it has the advantage of not having fragile #ifdef maze in the
> rest of the code.

im not expecting a performance difference with this stuff, it's
more about correctness. right now the behaviour of the stack is
arguably incorrect depending on when you configure pseudo-interfaces.
here are two examples:

if you configure a carp interface on myx0 after configuring a vlan
interface, carp_input input handler will be in the SRPL before
vlan_input. because of that we have this chunk in carp_input:

#if NVLAN > 0
/*
 * If the underlying interface removed the VLAN header itself,
 * it's not for us.
 */
if (ISSET(m->m_flags, M_VLANTAG))
return (0);
#endif

however, myx does not do hw vlan tagging and we dont check for
ETHERTYPE_VLAN or _QINQ there. carp may take a packet that vlan
should have taken.

secondly, if you configure switch(4) on an interface after configuring
it as a trunk port, switch will happily take all the packets on it.

it could be argued that these examples are a bit contrived, which
i will accept, and they could be fixed with some stricter code
checks, but they do demonstrate a problem with the stack.

in my mind, the order of processing by pseudo interfaces attached
to an ethernet interface would be:

1. let trunk(4) look at it

if trunk is configured on the port...

while we're talking about this, i would like to implement "independent"
ports on trunks. independent ports mean that if lacp isnt negotiated
on the ethernet interface, it can be used as a normal interface.
cisco do this by default on port channels, but other vendors require
explicit config. eg, on a force10^Wdell i need to add something
like "lacp ungroup member-independent port-channel 12" to allow
members of po12 to function as normal ports when lacp isnt negotiated.
this is useful if you want to pxe boot boxes that are usually
connected using lacp.

if we supported independent trunk ports, then we should allow
otherwise normal ethernet interface configuration.

2. let bridge(4) or switch(4) look at it

not both.

it only makes sense for a bridge/switch to get the packet if trunk
doesnt want it.

3. if it is vlan or qinq tagged, give it to vlan or drop it.

4. check ac_enaddr to see if it is for the current interface.

5. otherwise, give 

Re: merge vlan and carp input back into ether_input

2018-01-11 Thread Martin Pieuchot
On 11/01/18(Thu) 21:59, David Gwynne wrote:
> [...] 
> when you say i break carp balancing, are you talking about the removal of the 
> PACKET_TAG_CARP_BAL_IP tagging? PACKET_TAG_CARP_BAL_IP is only used in 
> carp_lsdrop to clear the M_MCAST flag on the mbuf. M_MCAST wont be set on 
> packets destined for the carp interface because we check ac_enaddr before 
> checking if the packet is multicast or broadcast.

I might be mistaken.  I just know that this code is fragile and I'd
prefer to see such change tested isolated because carp(4) itself has
multiple configurations.  CARP balancing has been fixed for 6.2 by
friehm@ after being broken for multiple releases.

> are you sure? carp and vlan on top of trunk should still function. however, 
> trunk or bridge/switch on vlan is broken though :(

I'm not sure, but I believe mixing input handler and hardcoding some in
ether_input() won't fly :)

> > If if_input() is called on a pseudo-interface we know we're already in
> > a softnet process.
> 
> or a syscall.
> 
> the code above has pseudo interfaces recurse, where they'd loop either on 
> ifih or at the task level. probably not a huge concern though.

We can unroll the loop afterward.  What we need now is get rid of the
queues.

> > We could also think of doing something similar for if_enqueue() and call
> > if_start() directly for pseudo-interface.
> 
> ill think about that. ive had some other experiments in that area we could 
> look at too.

I'd be glad to look at your experiments :)

> >> note that trunk and bridge/switch are still implemented using
> >> interface input handlers at the moment.
> > 
> > If you want to get rid of the input handlers, I'd suggest doing it in
> > the beginning of a release cycle and for all pseudo drivers at once.
> 
> considering the trunk and bridge/switch issue, it probably is best to do them 
> all at once.

Yes.  I'm aware that the SRP/input handler loop might be considered.
We can probably gain some percents if we remove it.  However this is
a micro optimization compared to other improvements that can be done.
Plus it has the advantage of not having fragile #ifdef maze in the
rest of the code.



Re: merge vlan and carp input back into ether_input

2018-01-11 Thread David Gwynne


> On 11 Jan 2018, at 8:58 pm, Martin Pieuchot  wrote:
> 
> On 11/01/18(Thu) 11:50, David Gwynne wrote:
>> while we were working on making the various pseudo interfaces you
>> stack on top of ethernet mpsafe, we split their input processing
>> off so they could be attacked one by one. they're all mpsafe now,
>> so this separation is not strictly necessary anymore.
> 
> Well at that time we weren't sure how to split the work between
> CPUs.  The current design allow to have multiple tasks doing some
> part of the work and, like we are pushing now, process all the
> incoming packets in the same context.

yes.

now though we assign interfaces to softnettqs, no matter what their 
relationship to each other is. a physical interface could be on tq 0, and it's 
child vlan on 1, which goes against the goal of processing all incoming packets 
in the same context.

> 
>> this moves carp and vlan input back into ether_input. [...]
> 
> It does much more than that.  It's also doing a conversion from mbuf to
> mbuf_list, it contains some IFF_RUNNING "fixes",  it breaks carp(4)
> balancing, it breaks vlan/carp on top of trunk(4)...

it doesnt replace mbufs with mbuf_lists, it gets rid of the cookies for vlan 
and carp input and gives them the list to requeue their packets on for input.

wrt to IFF_RUNNING, if we agree that we should be looking at it to know if an 
interface is functional i can make a bunch of those fixes without oks.

when you say i break carp balancing, are you talking about the removal of the 
PACKET_TAG_CARP_BAL_IP tagging? PACKET_TAG_CARP_BAL_IP is only used in 
carp_lsdrop to clear the M_MCAST flag on the mbuf. M_MCAST wont be set on 
packets destined for the carp interface because we check ac_enaddr before 
checking if the packet is multicast or broadcast.

are you sure? carp and vlan on top of trunk should still function. however, 
trunk or bridge/switch on vlan is broken though :(

> 
>> this diff also gets rid of the use of the pseudo interfaces input
>> queues, it processes their packets off an mbuf list for each real
>> ethernet packet. if we can tie all the work done on behalf of a
>> physical ring to a single task it makes rx ring moderation for
>> physical interfaces a lot easier to implement.
> 
> This needs to be done.  But please don't mix that with a rewrite of
> the input handlers.  I believe that's the easiest way to do that is
> to modify if_input() do process the packets directly if `ifp' is a
> pseudo-interface:
> 
>   if (ISSET(ifp->if_xflags, IFXF_CLONED) {
>   struct ifih *ifih;
>   struct srp_ref sr;
> 
>   NET_ASSERT_LOCKED();
> 
>SRPL_FOREACH(ifih, &sr, &ifp->if_inputs, ifih_next) {
>if ((*ifih->ifih_input)(ifp, m, ifih->ifih_cookie))
>break;
>}
>SRPL_LEAVE(&sr)
>   } else
>   ifiq_input(&ifp->if_rcv, ml, 2048);
> 
> If if_input() is called on a pseudo-interface we know we're already in
> a softnet process.

or a syscall.

the code above has pseudo interfaces recurse, where they'd loop either on ifih 
or at the task level. probably not a huge concern though.

> 
> We could also think of doing something similar for if_enqueue() and call
> if_start() directly for pseudo-interface.

ill think about that. ive had some other experiments in that area we could look 
at too.

> 
>> note that trunk and bridge/switch are still implemented using
>> interface input handlers at the moment.
> 
> If you want to get rid of the input handlers, I'd suggest doing it in
> the beginning of a release cycle and for all pseudo drivers at once.

considering the trunk and bridge/switch issue, it probably is best to do them 
all at once.



Re: merge vlan and carp input back into ether_input

2018-01-11 Thread Martin Pieuchot
On 11/01/18(Thu) 11:50, David Gwynne wrote:
> while we were working on making the various pseudo interfaces you
> stack on top of ethernet mpsafe, we split their input processing
> off so they could be attacked one by one. they're all mpsafe now,
> so this separation is not strictly necessary anymore.

Well at that time we weren't sure how to split the work between
CPUs.  The current design allow to have multiple tasks doing some
part of the work and, like we are pushing now, process all the
incoming packets in the same context.

> this moves carp and vlan input back into ether_input. [...]

It does much more than that.  It's also doing a conversion from mbuf to
mbuf_list, it contains some IFF_RUNNING "fixes",  it breaks carp(4)
balancing, it breaks vlan/carp on top of trunk(4)...

> this diff also gets rid of the use of the pseudo interfaces input
> queues, it processes their packets off an mbuf list for each real
> ethernet packet. if we can tie all the work done on behalf of a
> physical ring to a single task it makes rx ring moderation for
> physical interfaces a lot easier to implement.

This needs to be done.  But please don't mix that with a rewrite of
the input handlers.  I believe that's the easiest way to do that is
to modify if_input() do process the packets directly if `ifp' is a
pseudo-interface:

if (ISSET(ifp->if_xflags, IFXF_CLONED) {
struct ifih *ifih;
struct srp_ref sr;

NET_ASSERT_LOCKED();

SRPL_FOREACH(ifih, &sr, &ifp->if_inputs, ifih_next) {
if ((*ifih->ifih_input)(ifp, m, ifih->ifih_cookie))
break;
}
SRPL_LEAVE(&sr)
} else
ifiq_input(&ifp->if_rcv, ml, 2048);

If if_input() is called on a pseudo-interface we know we're already in
a softnet process.

We could also think of doing something similar for if_enqueue() and call
if_start() directly for pseudo-interface.

> note that trunk and bridge/switch are still implemented using
> interface input handlers at the moment.

If you want to get rid of the input handlers, I'd suggest doing it in
the beginning of a release cycle and for all pseudo drivers at once.



merge vlan and carp input back into ether_input

2018-01-10 Thread David Gwynne
while we were working on making the various pseudo interfaces you
stack on top of ethernet mpsafe, we split their input processing
off so they could be attacked one by one. they're all mpsafe now,
so this separation is not strictly necessary anymore.

this moves carp and vlan input back into ether_input. a lot of care
is taken to correctly order when we give the packets to the sub
interfaces. basically, any ethernet packet with a vlan tag is
unconditionally given to vlan_input. carp input is only attempted
if the packet is not for the parent interface, but before the
multicast handling is done. by checking the interfaces mac address
first, carp interfaces can get their packets immediately, which
means we can  stop messing around with the M_BCAST and M_MCAST flags
on carp mbufs. the relevant chunk of code is:

#if NVLAN > 0
if (ISSET(m->m_flags, M_VLANTAG) ||
etype == ETHERTYPE_VLAN ||
etype == ETHERTYPE_QINQ) {
vlan_input(ifp, ml, m);
return;
}
#endif

if (memcmp(ac->ac_enaddr, eh->ether_dhost, ETHER_ADDR_LEN) != 0) {
/* The packet doesn't match the ether addr on this iface */

#if NCARP > 0
/* It may be addressed to a child carp iface */
if (ifp->if_type != IFT_CARP &&
!SRPL_EMPTY_LOCKED(&ifp->if_carp) &&
carp_input(ifp, ml, m)) {
/* carp_input has consumed the packet */
return;
}
#endif

/* It must be multicast if it isn't for us or a child carp */
if (!ETHER_IS_MULTICAST(eh->ether_dhost))
goto dropanyway;

/* Drop it if it came from us in the first place */
if (!ISSET(ifp->if_flags, IFF_SIMPLEX) &&
memcmp(ac->ac_enaddr, eh->ether_shost,
 ETHER_ADDR_LEN) == 0)
goto dropanyway;

SET(m->m_flags, (memcmp(etherbroadcastaddr, eh->ether_dhost,
ETHER_ADDR_LEN) == 0) ? M_BCAST : M_MCAST);
ifp->if_imcasts++; /* XXX lock? */
}

doing vlan and carp input here let's us remove special handling for
VLAN packets in carp code.

this diff also gets rid of the use of the pseudo interfaces input
queues, it processes their packets off an mbuf list for each real
ethernet packet. if we can tie all the work done on behalf of a
physical ring to a single task it makes rx ring moderation for
physical interfaces a lot easier to implement.

note that trunk and bridge/switch are still implemented using
interface input handlers at the moment.

ok?

Index: net/if_ethersubr.c
===
RCS file: /cvs/src/sys/net/if_ethersubr.c,v
retrieving revision 1.250
diff -u -p -r1.250 if_ethersubr.c
--- net/if_ethersubr.c  10 Jan 2018 00:14:38 -  1.250
+++ net/if_ethersubr.c  11 Jan 2018 01:20:54 -
@@ -103,6 +103,16 @@ didn't get a copy, you may request one f
 #include 
 #endif
 
+#include "vlan.h"
+#if NVLAN > 0
+#include 
+#endif
+
+#include "carp.h"
+#if NCARP > 0
+#include 
+#endif
+
 #include "pppoe.h"
 #if NPPPOE > 0
 #include 
@@ -121,6 +131,8 @@ didn't get a copy, you may request one f
 #include 
 #endif /* MPLS */
 
+void   ether_input_m(struct ifnet *, struct mbuf_list *, struct mbuf *);
+
 u_int8_t etherbroadcastaddr[ETHER_ADDR_LEN] =
 { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
 u_int8_t etheranyaddr[ETHER_ADDR_LEN] =
@@ -306,69 +318,108 @@ bad:
return (error);
 }
 
-/*
- * Process a received Ethernet packet;
- * the packet is in the mbuf chain m without
- * the ether header, which is provided separately.
- */
+void
+ether_enqueue(struct ifnet *ifp, struct mbuf_list *ml, struct mbuf *m)
+{
+#if NBPFILTER > 0
+   caddr_t if_bpf;
+#endif
+
+   m->m_pkthdr.ph_ifidx = ifp->if_index;
+   m->m_pkthdr.ph_rtableid = ifp->if_rdomain;
+
+   /* XXX lock? */
+   ifp->if_ipackets++;
+   ifp->if_ibytes += m->m_pkthdr.len;
+
+#if NBPFILTER > 0
+   if_bpf = ifp->if_bpf;
+   if (if_bpf) {
+   if (bpf_mtap_ether(if_bpf, m, BPF_DIRECTION_IN)) {
+   m_freem(m);
+   return;
+   }
+   }
+#endif
+
+   ml_enqueue(ml, m);
+}
+
 int
 ether_input(struct ifnet *ifp, struct mbuf *m, void *cookie)
 {
+   struct mbuf_list ml = MBUF_LIST_INITIALIZER();
+
+   /* Drop short frames */
+   if (m->m_len < ETHER_HDR_LEN) {
+   m_freem(m);
+   return (1);
+   }
+
+   /* We have a reference to this ifp already */
+   ether_input_m(ifp, &ml, m);
+
+   /* Run the packet through any child interfaces */
+   while ((m = ml_dequeue(&ml)) != NULL) {
+   ifp = if_get(m->m_pkthdr.ph_ifidx);
+   if (ifp != NULL)
+   ether_input_m(ifp, &ml, m);
+   else
+